<http://lib.cnfolio.com/ExploratoryDataAnalysisIntroduction>

Understanding data and descriptive statistics



The workshop is for participants who want a hands-on:

The workshop will help participants to:





Sample data set


This sample data set will be used to define terms that describe the key characteristics of typical data sets:

Exam marks for 68 students
Encoded Student NumberQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total mark
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
2E99101034997782886
FEB979341097981985
A6B106101710109781281
6CA51043410107882879
22876934466882871
3AC788008107581870
04392534899841769
E8210510301087081769
40284734666981668
BB7848371009082867
8FE1041032576481767
B8A86800799301657
3EB84700678482256
8E492900969721256
D835410149107021356
24463737382081755
DA342337749561354
A3A5410070103281252
3CC46434257362652
29C84521899201352
2E236234348782050
1E80820410109120349
88991833476020245
6341061032452021045
FFB60704288401242
DEE20437247082342
A5234534365212341
FE032337345341341
03950637661601041
604801002262082040
90B527008103201240
CC440234557181040
DD972504783301040
AC880930692001038
B3342434244081238
20F1001004651001037
5D264404035521337
0E171501377202136
D9451330346401434
E5924200273181333
F2B81800092201132
A23100900180002232
75C70700652101130
5B426200472112330
D0740300156221125
43802234024061024
4FE54200350100222
8C130300067001020
D0B52401051100019
FE424200064000119
ACD50500250000118
4CC70400120011218
41B30130030061017
A9042400200021116
4EA50200025000014
0C32010000200139
1774020000002109
6233010203000009
A0B1040010020019
4A31010011000105
6021010000000013
B600000000001012
0780200000000002
8151000000000001





There are 68 elements in the sample data set

StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





There are 12 variables in the sample data set

StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





There are 68 observations in the sample data set

StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





There are 816 data values in the sample data set

StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





Types of variables


The sample data set contains discrete variables because each variable may only take on a finite number of values.

By comparison, continuous variables may take on an infinite number of values. For example, temperature and distance are continuous variables.

When a variable has only two possible values, it is called a dichotomous variable. For example, logic values can only be true or false.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





Scales of measurement


The sample data set uses a ratio scale of measurement. Below are more types of scales.

 Meaningful orderMeaningful differenceNatural zero
NominalNoNoNo
OrdinalYESNoNo
IntervalYESYESNo
RatioYESYESYES

StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10Total
07D10810378109882992
A931081034101089811091
04810883798810811090
B45961034101010881786
• • •
6021010000000013
B600000000001012
0780200000000002
8151000000000001





Time and purpose of data collection


Cross sectional data are collected for many variables over the same point in time.

In contrast, time series data are repeatedly collected for the same variables over multiple points in time.

Observational data are not collected in response to any conditions.

In contrast, experimental data are collected for specific variables to measure responses to controlled conditions.

The sample data set contains a cross section of observational data.





• Does the graph contain discrete, continuous or dichotomous variables?
• Does the graph use nominal, ordinal, interval or ratio scales?
• Does the graph have cross sectional or time series data?



Source: Orcutt, M. (2013). Humans generate most of the world's data, but machines are catching up. MIT Technology Review. Retrieved from http://www.technologyreview.com/view/509656/consumers-generate-most-of-the-worlds-data-but-machines-are-catching-up





• Does the table contain discrete, continuous or dichotomous variables?
• Does the table use nominal, ordinal, interval or ratio scales?
• Does the table have cross sectional or time series data?



Source: MIT Technology Review. Retrieved from http://www.technologyreview.com/news/428049/trust-us-were-google





• Do the graphs contain discrete, continuous or dichotomous variables?
• Do the graphs use nominal, ordinal, interval or ratio scales?
• Do the graphs have cross sectional or time series data?



Source: Leber, J. (2012). Trust us, we're google! MIT Technology Review. Retrieved from http://www.technologyreview.com/news/427787/are-smart-phones-spreading-faster-than-any-technology-in-human-history





• Does the graph contain discrete, continuous or dichotomous variables?
• Does the graph use nominal, ordinal, interval or ratio scales?
• Does the graph have cross sectional or time series data?



Source: Belluz, J. (2014). There were more measles cases in 2014 than any year during the last two decades. Vox. Retrieved from http://www.vox.com/2014/11/3/7149289/there-were-more-measles-cases-in-2014-than-any-year-during-the-last





• Does the graph contain discrete, continuous or dichotomous variables?
• Does the graph use nominal, ordinal, interval or ratio scales?
• Does the graph have cross sectional or time series data?



Source: Orcutt, M. (2013). Humans generate most of the world's data, but machines are catching up. MIT Technology Review. Retrieved from http://www.technologyreview.com/view/509656/consumers-generate-most-of-the-worlds-data-but-machines-are-catching-up





Measures of location


The most common approach to summarize data sets is to find a single numerical quantity that reflects the typical observed value. These quantities are called measures of location or measures of central tendency. Although hundreds of such measures have been proposed, three are frequently used:





Average


The average value is the arithmetic mean of all data values for one variable. The average exam mark is 41.99 for the sample data set.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
• • •
DEE2043724708234230
A523453436521234131
• • •
078020000000000267
815100000000000168





Median


For data sets with an odd number of elements, the median is the middle data value when all values are arranged from lowest to highest.
For data sets with an even number of elements, the median is the average of the two middle data values when all values are arranged from lowest to highest.

The median exam mark is 40 for the sample data set.

The Nth percentile of a data set is a value that is equal to or greater than Nth percent of all values in the data set.

The 50th percentile is the same as the median value.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
• • •
90B52700810320124035
CC44023455718104036
• • •
078020000000000267
815100000000000168





Mode


The mode of a data set is the value that occurs most frequently. If there are two modes, the data set is said to be bimodal. If there are more than two modes, the data set is said to be multimodal.

The modal exam mark is 40 for the sample data set.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
• • •
0395063766160104133
60480100226208204034
90B52700810320124035
CC44023455718104036
DD97250478330104037
AC88093069200103838
• • •
078020000000000267
815100000000000168





Which type of readers read the most number of books?

Urban communities 17 Among book readers, in last 12 months, mean # of books read
8 Among book readers, in last 12 months, median # of books read
Suburban communities 17 Among book readers, in last 12 months, mean # of books read
8 Among book readers, in last 12 months, median # of books read
Rural communities 19 Among book readers, in last 12 months, mean # of books read
7 Among book readers, in last 12 months, median # of books read
Source: Miller, C., Purcell, K., Rainie, L. (2012). Reading babits in different communities. Retrieved from http://libraries.pewinternet.org/2012/12/20/reading-habits-in-different-communities





Which is a better investment to make?

1st investment zero over period of 10 years, average return per year
£750 over period of 10 years, median return per year
2nd investment £1,500 over period of 10 years, average return per year
£60 over period of 10 years, median return per year
3rd investment £300 over period of 10 years, average return per year
£275 over period of 10 years, median return per year





Measures of variability


Another common approach to summarize data sets is to find a single numerical quantity that reflects how spread out the data happen to be or the amount of variability among the observed values. Two frequently used measures are:





Range


The range is quick and easy way to indicate the variations in a data set. It is calculated by subtracting the lowest observed value from the highest observed value, which makes it very sensitive to outlying values.

The exam marks in the sample data set has a range of 91.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
• • •
60480100226208204034
90B52700810320124035
• • •
078020000000000267
815100000000000168





Standard deviation


Standard deviation is a common, but more complex, method of measuring variability. The steps to calculate standard deviation are:
  1. Subtract the mean value from each observed data value to calculate its deviation score
  2. Sum the squared value of all deviation scores
  3. If the data set contains a complete population, then divide the sum from step 2 by the number of elements in the data set to yield the variance
  4. If the data set contains a sample subset, then divide the sum from step 2 by one less the number of elements in the sample to yield the variance
  5. The standard deviation is the positive square root of the variance

The exam marks in the sample data set has a standard deviation of 24.76 marks.

Approximately 68% of data values will be within one standard deviation of the average value.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
• • •
8FE104103257648176715
B8A8680079930165716
3EB8470067848225617
• • •
ACD5050025000011855
4CC7040012001121856
41B3013003006101757
• • •
078020000000000267
815100000000000168





Usage of standard deviation


Approximately 95% of data values will be within two standard deviation of the average value.
StudentQ1Q2Q3Q4aQ4bQ5Q6Q7Q8Q9aQ9bQ10TotalRank
07D108103781098829921
A9310810341010898110912
048108837988108110903
B459610341010108817864
• • •
078020000000000267
815100000000000168




A small survey was conducted with 10 people. Below is the list of ratings each person gave regarding the performance of the current Prime Minister. Calculate the standard deviation of the sample data set.

    3, 9, 10, 4, 7, 8, 9, 5, 7, 8





Let the average price for a mobile phone be £120 and the standard deviation is £40.






Graphical techniques of exploratory data analysis (EDA)


Descriptive statistics can be supplemented with graphical techniques to further explore and summarize the key characteristics of data sets. This visual approach is often associated with the work of John Tukey. Below are several examples of graphical techniques commonly used with EDA:





Box plots


Box plots provide a visual summary of the distribution of a variable. They are considered to be robust because outliers have minimal impact on the shape of the chart.








Box plots to show the relationship between year level and convergence of student feedback
















Balloon charts to show the relationship between marks and peer review participation












Scatter plot to show the relationship between marks and the number of days that the students used their individual coursework pages








Scatter plot to show the number of units that have threshold number of student feedback responses








Questions and suggestions


Please contact Chi Nguyen with questions or suggestions.