wiki:BluePrint/SurveyTool/Statistics

Version 2 (modified by Michael Howden, 10 years ago) ( diff )

--

Survey Tool Blue Print - Assessment Data Analysis Tool (ADAT)

Statistics

Statistics Package

Your data may look like:

Question A (4 Options) Question B (3 Options)

Answers:

Option A-1 | Option B-1
Option A-2 | Option B-2
Option A-1 | Option B-3

...

To display a diagram, you need a table like:

Option B-1 Option B-2 Option B-3
Option A-1 <count> <count> <count>
Option A-2 <count> <count> <count>
Option A-3 <count> <count> <count>
Option A-4 <count> <count> <count>

where <count> represents the "formula" here.

Certainly, for options the <count> formula is the simplest one. There are plenty more of these - e.g. <sum>, <average>, <relative frequency> etc etc.

To make this more comprehensible, say that one of the fields is a numeric value, like in:

Family size (4 Options) Family Income (numeric)

Answers:

Single 25000
Single 32000
2 Persons 38000
Single 92000
3-5 Persons 80000
more than 5 Persons 123000

....and so forth

Now the analysis could give a table like:

Income
Single <sum>
2 Persons <sum>
3-5 Persons <sum>
more than five Persons <sum>

...which can be displayed as a barchart.

Instead of <sum> you can have other formulas of interest here: <average>, <maximum>, <minimum>, <median>, ....or you want to analyze in <percentiles> or <distribution> like:

50% or less average 150% or more
Single <count> <count> <count>
2 Persons <count> <count> <count>
3-5 Persons <count> <count> <count>
more than five Persons <count> <count> <count>

To come to those output tables (which contain a label column on the left, and several purely numeric columns right of it), you need to do the statistical analysis using a *formula* (that is the term I'm used to).

Once you have these tables, *then* you can choose the best fitting diagram.

In a UI you would choose the "label question", which should generally be an option field or at least a field with a limited number of discrete values, then the "question to analyze", which can be options, numeric or other things, then you choose one or more formulas (<count>, <average>, <sum>, <min>, <max>, <median>, <distribution>, <percentiles> etc.) and *then* the diagram to display the results (or no diagram if you want to see the results in a table).

Actually, you would at first choose the "question to analyze" and then the "formula", and then - only if apropriate or required - the labels column. That is because in simple distribution formulas (for histograms or pie charts) for example you would not need an additional labels column.

A simple statistics package for Eden would include just a limited number of formulas (which we would have to choose, but I think that <count> and <sum> are not enough to produce real value), and it would take the raw data with the answers (i.e. the DB rows) as input and spit out that numeric table.

The chart package would then take that numeric table as input and produce the chart, whatever type of.

As a next step beyond this basic analysis, you would certainly be interested in analyzing trends and predicting developments, which is kinda "advanced" statistics. Generally, you would still run the base formulas over surveys, but then compare the results of multiple surveys and calculate trends and forecasts. This would really go beyond the capabilities of a simple spreadsheet and therefore be of extremely high value (is this a seller?)

Anyway - the architecture would include a "statistics package" (the "formulas") and a "chart package" (the graph representation).

Z Scores

Basically they are a simple statistical tool to measure how unusual a particular value is within the data set. If data is normally distributed then when you map it using a histogram it will give you a bell curve with the mean through the apex of the curve. A curve that is tall & thin would represent data with little variance, a curve that is short and fat would represent one that has a lot of variance. A common measure of variance is the standard deviation. So a data set with little variance will have a small standard deviation and a data set with a lot of variation will have a larger standard deviation.

As I said at the beginning the z-score is a measure of how unusual the data item is. It assumes that most data should be in or around the mean which is true for normally distributed data (that bell curve) and so it counts how far the data is from the mean in terms of the standard deviation.

Time for an example:

Let us assume that the mean IQ of people living in Bangkok is 112 with a standard deviation of 16 and when we measured you IQ we found that it's 148, wow that's good, but just how good is it? That's where the z-score comes in. You are 148 - 112 points from the mean or 36 points we convert that 36 in terms of how many standard deviation or 36/16 = 2.25 so your z-score is 2.25.

Now one thing that we know is that (again this is for normally distributed data) about 68% of all values will lie within one standard deviation of the mean and 95% will lie within two standard deviations of the mean. So your z-score of 2.25 means that you are in the top 5%. You may recall those tables that statistics students would always refer to, one of those the unimaginatively titled positive z-score table would be able to tell you that on selecting one person at random the chance that they would have that z-score or greater would be 98.78%.

Outliers

Another term I used was outlier this is a value that is so far from the usual (mean) that it could be wrong. All outliers should be double checked. When looking for unusual data anything with a z-score greater than 2 or less than -2 would be considered unusual. So this could be used as the cut-off point but obviously it is something than can be parametrised. I want to use the z-score analysis to prioritise the markers on the map, the same code can be used to help identify outliers which can help to identify typing errors, or just poor data, little cost big gain.

Note: See TracWiki for help on using the wiki.