Changes between Initial Version and Version 1 of BluePrint/SurveyTool/Statistics


Ignore:
Timestamp:
10/04/11 09:12:39 (13 years ago)
Author:
Michael Howden
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/SurveyTool/Statistics

    v1 v1  
     1= Survey Tool Blue Print - Assessment Data Analysis Tool (ADAT) =
     2== Statistics ==
     3=== Statistics Package ===
     4Your data may look like:
     5
     6Question A (4 Options) | Question B (3 Options)
     7
     8Answers:
     9
     10Option A-1 | Option B-1
     11Option A-2 | Option B-2
     12Option A-1 | Option B-3
     13...
     14
     15To display a diagram, you need a table like:
     16
     17                   Option B-1 | Option B-2 | Option B-3
     18Option A-1  <count>       <count>       <count>
     19Option A-2  <count>       <count>       <count>
     20Option A-3  <count>       <count>       <count>
     21Option A-4  <count>       <count>       <count>
     22
     23where <count> represents the "formula" here.
     24
     25Certainly, for options the <count> formula is the simplest one. There are
     26plenty more of these - e.g. <sum>, <average>, <relative frequency> etc etc.
     27
     28To make this more comprehensible, say that one of the fields is a numeric
     29value, like in:
     30
     31Family size (4 Options) | Family Income (numeric)
     32
     33Answers:
     34
     35Single | 25000
     36Single | 32000
     372 Persons | 38000
     38Single | 92000
     393-5 Persons | 80000
     40more than 5 Persons | 123000
     41....and so forth
     42
     43Now the analysis could give a table like:
     44
     45                                          Income
     46Single                                <sum>
     472 Persons                         <sum>
     483-5 Persons                      <sum>
     49more than five Persons    <sum>
     50
     51...which can be displayed as a barchart.
     52
     53Instead of <sum> you can have other formulas of interest here: <average>,
     54<maximum>, <minimum>, <median>, ....or you want to analyze in <percentiles> 
     55or <distribution> like:
     56
     57                                          50% or less  average  150% or more
     58Single                                <count>         <count>    <count>
     592 Persons                         <count>         <count>     <count>
     603-5 Persons                      <count>         <count>     <count>
     61more than five Persons    <count>         <count>     <count>
     62
     63
     64To come to those output tables (which contain a label column on the left, and
     65several purely numeric columns right of it), you need to do the statistical
     66analysis using a *formula* (that is the term I'm used to).
     67
     68Once you have these tables, *then* you can choose the best fitting diagram.
     69
     70In a UI you would choose the "label question", which should generally be an
     71option field or at least a field with a limited number of discrete values,
     72then the "question to analyze", which can be options, numeric or other things,
     73then you choose one or more formulas (<count>, <average>, <sum>, <min>, <max>,
     74<median>, <distribution>, <percentiles> etc.) and *then* the diagram to
     75display the results (or no diagram if you want to see the results in a table).
     76
     77Actually, you would at first choose the "question to analyze" and then the
     78"formula", and then - only if apropriate or required - the labels column. That
     79is because in simple distribution formulas (for histograms or pie charts) for
     80example you would not need an additional labels column.
     81
     82A simple statistics package for Eden would include just a limited number of
     83formulas (which we would have to choose, but I think that <count> and <sum>
     84are not enough to produce real value), and it would take the raw data with the
     85answers (i.e. the DB rows) as input and spit out that numeric table.
     86
     87The chart package would then take that numeric table as input and produce the
     88chart, whatever type of.
     89
     90As a next step beyond this basic analysis, you would certainly be interested
     91in analyzing trends and predicting developments, which is kinda "advanced"
     92statistics. Generally, you would still run the base formulas over surveys, but
     93then compare the results of multiple surveys and calculate trends and
     94forecasts. This would really go beyond the capabilities of a simple
     95spreadsheet and therefore be of extremely high value (is this a seller?)
     96
     97Anyway - the architecture would include a "statistics package" (the
     98"formulas") and a "chart package" (the graph representation).
     99
     100=== Z Scores ===
     101
     102Basically they are a simple statistical tool to measure how unusual a particular value is within the data set. If data is normally distributed then when you map it using a histogram it will give you a bell curve with the mean through the apex of the curve. A curve that is tall & thin would represent data with little variance, a curve that is short and fat would represent one that has a lot of variance. A common measure of variance is the standard deviation. So a data set with little variance will have a small standard deviation and a data set with a lot of variation will have a larger standard deviation.
     103
     104As I said at the beginning the z-score is a measure of how unusual the data item is. It assumes that most data should be in or around the mean which is true for normally distributed data (that bell curve) and so it counts how far the data is from the mean in terms of the standard deviation.
     105
     106Time for an example:
     107
     108Let us assume that the mean IQ of people living in Bangkok is 112 with a standard deviation of 16 and when we measured you IQ we found that it's 148, wow that's good, but just how good is it? That's where the z-score comes in. You are 148 - 112 points from the mean or 36 points we convert that 36 in terms of how many standard deviation or 36/16 = 2.25 so your z-score is 2.25.
     109
     110Now one thing that we know is that (again this is for normally distributed data) about 68% of all values will lie within one standard deviation of the mean and 95% will lie within two standard deviations of the mean. So your z-score of 2.25 means that you are in the top 5%. You may recall those tables that statistics students would always refer to, one of those the unimaginatively titled positive z-score table would be able to tell you that on selecting one person at random the chance that they would have that z-score or greater would be 98.78%.
     111
     112=== Outliers ===
     113Another term I used was outlier this is a value that is so far from the usual (mean) that it could be wrong. All outliers should be double checked. When looking for unusual data anything with a z-score greater than 2 or less than -2 would be considered unusual. So this could be used as the cut-off point but obviously it is something than can be parametrised. I want to use the z-score analysis to prioritise the markers on the map, the same code can be used to help identify outliers which can help to identify typing errors, or just poor data, little cost big gain.