Understanding data and descriptive statistics
The workshop is for participants who want a handson:
 review of descriptive statistics
 introduction to exploratory data analysis
The workshop will help participants to:
 calculate summary statistics when the data set is available
 evaluate summary statistics that have been provided when the data set is not available
 combine graphs with summary statistics to explore data from multiple perspectives
Sample data set
This sample data set will be used to define terms that
describe the key characteristics of typical data sets:
 elements
 variables
 observations
 data values
 type of variable
 scale of measurement
 time of data collection
 purpose of data collection
Exam marks for 68 students 
Encoded Student Number  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total mark 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
2E9  9  10  10  3  4  9  9  7  7  8  2  8  86 
FEB  9  7  9  3  4  10  9  7  9  8  1  9  85 
A6B  10  6  10  1  7  10  10  9  7  8  1  2  81 
6CA  5  10  4  3  4  10  10  7  8  8  2  8  79 
228  7  6  9  3  4  4  6  6  8  8  2  8  71 
3AC  7  8  8  0  0  8  10  7  5  8  1  8  70 
043  9  2  5  3  4  8  9  9  8  4  1  7  69 
E82  10  5  10  3  0  10  8  7  0  8  1  7  69 
402  8  4  7  3  4  6  6  6  9  8  1  6  68 
BB7  8  4  8  3  7  10  0  9  0  8  2  8  67 
8FE  10  4  10  3  2  5  7  6  4  8  1  7  67 
B8A  8  6  8  0  0  7  9  9  3  0  1  6  57 
3EB  8  4  7  0  0  6  7  8  4  8  2  2  56 
8E4  9  2  9  0  0  9  6  9  7  2  1  2  56 
D83  5  4  10  1  4  9  10  7  0  2  1  3  56 
244  6  3  7  3  7  3  8  2  0  8  1  7  55 
DA3  4  2  3  3  7  7  4  9  5  6  1  3  54 
A3A  5  4  10  0  7  0  10  3  2  8  1  2  52 
3CC  4  6  4  3  4  2  5  7  3  6  2  6  52 
29C  8  4  5  2  1  8  9  9  2  0  1  3  52 
2E2  3  6  2  3  4  3  4  8  7  8  2  0  50 
1E8  0  8  2  0  4  10  10  9  1  2  0  3  49 
889  9  1  8  3  3  4  7  6  0  2  0  2  45 
634  10  6  10  3  2  4  5  2  0  2  1  0  45 
FFB  6  0  7  0  4  2  8  8  4  0  1  2  42 
DEE  2  0  4  3  7  2  4  7  0  8  2  3  42 
A52  3  4  5  3  4  3  6  5  2  1  2  3  41 
FE0  3  2  3  3  7  3  4  5  3  4  1  3  41 
039  5  0  6  3  7  6  6  1  6  0  1  0  41 
604  8  0  10  0  2  2  6  2  0  8  2  0  40 
90B  5  2  7  0  0  8  10  3  2  0  1  2  40 
CC4  4  0  2  3  4  5  5  7  1  8  1  0  40 
DD9  7  2  5  0  4  7  8  3  3  0  1  0  40 
AC8  8  0  9  3  0  6  9  2  0  0  1  0  38 
B33  4  2  4  3  4  2  4  4  0  8  1  2  38 
20F  10  0  10  0  4  6  5  1  0  0  1  0  37 
5D2  6  4  4  0  4  0  3  5  5  2  1  3  37 
0E1  7  1  5  0  1  3  7  7  2  0  2  1  36 
D94  5  1  3  3  0  3  4  6  4  0  1  4  34 
E59  2  4  2  0  0  2  7  3  1  8  1  3  33 
F2B  8  1  8  0  0  0  9  2  2  0  1  1  32 
A23  10  0  9  0  0  1  8  0  0  0  2  2  32 
75C  7  0  7  0  0  6  5  2  1  0  1  1  30 
5B4  2  6  2  0  0  4  7  2  1  1  2  3  30 
D07  4  0  3  0  0  1  5  6  2  2  1  1  25 
438  0  2  2  3  4  0  2  4  0  6  1  0  24 
4FE  5  4  2  0  0  3  5  0  1  0  0  2  22 
8C1  3  0  3  0  0  0  6  7  0  0  1  0  20 
D0B  5  2  4  0  1  0  5  1  1  0  0  0  19 
FE4  2  4  2  0  0  0  6  4  0  0  0  1  19 
ACD  5  0  5  0  0  2  5  0  0  0  0  1  18 
4CC  7  0  4  0  0  1  2  0  0  1  1  2  18 
41B  3  0  1  3  0  0  3  0  0  6  1  0  17 
A90  4  2  4  0  0  2  0  0  0  2  1  1  16 
4EA  5  0  2  0  0  0  2  5  0  0  0  0  14 
0C3  2  0  1  0  0  0  0  2  0  0  1  3  9 
177  4  0  2  0  0  0  0  0  0  2  1  0  9 
623  3  0  1  0  2  0  3  0  0  0  0  0  9 
A0B  1  0  4  0  0  1  0  0  2  0  0  1  9 
4A3  1  0  1  0  0  1  1  0  0  0  1  0  5 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
There are 68 elements in the sample data set
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
There are 12 variables in the sample data set
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
There are 68 observations in the sample data set
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
There are 816 data values in the sample data set
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
Types of variables
The sample data set contains
discrete variables because each variable may only take on a finite number of values.
By comparison,
continuous variables may take on an infinite number of values. For example, temperature and distance are continuous variables.
When a variable has only two possible values, it is called a
dichotomous variable. For example, logic values can only be true or false.
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
Scales of measurement
The sample data set uses a
ratio scale of measurement. Below are more types of scales.
 Meaningful order  Meaningful difference  Natural zero 
Nominal  No  No  No 
Ordinal  YES  No  No 
Interval  YES  YES  No 
Ratio  YES  YES  YES 
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91 
048  10  8  8  3  7  9  8  8  10  8  1  10  90 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86 
• • • 
602  1  0  1  0  0  0  0  0  0  0  0  1  3 
B60  0  0  0  0  0  0  0  0  0  1  0  1  2 
078  0  2  0  0  0  0  0  0  0  0  0  0  2 
815  1  0  0  0  0  0  0  0  0  0  0  0  1 
Time and purpose of data collection
Cross sectional data are collected for many variables over the same point in time.
In contrast, time series data are repeatedly collected for the same variables over multiple points in time.
Observational data are not collected in response to any conditions.
In contrast, experimental data are collected for specific variables to measure responses to controlled conditions.
The sample data set contains a cross section of observational data.
• Does the graph contain
discrete,
continuous or
dichotomous variables?
• Does the graph use
nominal,
ordinal,
interval or
ratio scales?
• Does the graph have
cross sectional or
time series data?
Source: Orcutt, M. (2013). Humans generate most of the world's data, but machines are catching up. MIT Technology Review. Retrieved from http://www.technologyreview.com/view/509656/consumersgeneratemostoftheworldsdatabutmachinesarecatchingup
• Does the table contain
discrete,
continuous or
dichotomous variables?
• Does the table use
nominal,
ordinal,
interval or
ratio scales?
• Does the table have
cross sectional or
time series data?
Source: MIT Technology Review. Retrieved from http://www.technologyreview.com/news/428049/trustusweregoogle
• Do the graphs contain
discrete,
continuous or
dichotomous variables?
• Do the graphs use
nominal,
ordinal,
interval or
ratio scales?
• Do the graphs have
cross sectional or
time series data?
Source: Leber, J. (2012). Trust us, we're google! MIT Technology Review. Retrieved from http://www.technologyreview.com/news/427787/aresmartphonesspreadingfasterthananytechnologyinhumanhistory
• Does the graph contain
discrete,
continuous or
dichotomous variables?
• Does the graph use
nominal,
ordinal,
interval or
ratio scales?
• Does the graph have
cross sectional or
time series data?
Source: Belluz, J. (2014). There were more measles cases in 2014 than any year during the last two decades. Vox. Retrieved from http://www.vox.com/2014/11/3/7149289/thereweremoremeaslescasesin2014thananyyearduringthelast
• Does the graph contain
discrete,
continuous or
dichotomous variables?
• Does the graph use
nominal,
ordinal,
interval or
ratio scales?
• Does the graph have
cross sectional or
time series data?
Source: Orcutt, M. (2013). Humans generate most of the world's data, but machines are catching up. MIT Technology Review. Retrieved from http://www.technologyreview.com/view/509656/consumersgeneratemostoftheworldsdatabutmachinesarecatchingup
Measures of location
The most common approach to summarize data sets is to find
a single numerical quantity that reflects the typical observed value. These quantities are called measures of location or
measures of central tendency. Although hundreds of such measures have been proposed, three are frequently used:
Average
The average value is the
arithmetic mean of all data values for one variable. The
average exam mark is 41.99 for the sample data set.
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
• • • 
DEE  2  0  4  3  7  2  4  7  0  8  2  3  42  30 
A52  3  4  5  3  4  3  6  5  2  1  2  3  41  31 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
Median
For data sets with an
odd number of elements, the median is the
middle data value when all values are arranged from lowest to highest.
For data sets with an
even number of elements, the median is the
average of the two middle data values when all values are arranged from lowest to highest.
The
median exam mark is 40 for the sample data set.
The
Nth percentile of a data set is a value that is equal to or greater than Nth percent of all values in the data set.
 Arrange the data values from lowest to highest.
 Calculate the index using this formula: index = ( N / 100 ) * ( number of elements)
 If the index is not a whole number, round up. The Nth percentile is the data value found at the position indicated by the index.
 If the index is a whole number, the Nth percentile is the average of the values at the positions indicated by the index and index + 1.
The
50th percentile is the same as the
median value.
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
• • • 
90B  5  2  7  0  0  8  10  3  2  0  1  2  40  35 
CC4  4  0  2  3  4  5  5  7  1  8  1  0  40  36 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
Mode
The
mode of a data set is
the value that occurs most frequently. If there are two modes, the data set is said to be
bimodal. If there are more than two modes, the data set is said to be
multimodal.
The
modal exam mark is 40 for the sample data set.
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
• • • 
039  5  0  6  3  7  6  6  1  6  0  1  0  41  33 
604  8  0  10  0  2  2  6  2  0  8  2  0  40  34 
90B  5  2  7  0  0  8  10  3  2  0  1  2  40  35 
CC4  4  0  2  3  4  5  5  7  1  8  1  0  40  36 
DD9  7  2  5  0  4  7  8  3  3  0  1  0  40  37 
AC8  8  0  9  3  0  6  9  2  0  0  1  0  38  38 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
Which type of readers read the most number of books?
Urban communities 
17 
Among book readers, in last 12 months, mean # of books read 
8 
Among book readers, in last 12 months, median # of books read 
Suburban communities 
17 
Among book readers, in last 12 months, mean # of books read 
8 
Among book readers, in last 12 months, median # of books read 
Rural communities 
19 
Among book readers, in last 12 months, mean # of books read 
7 
Among book readers, in last 12 months, median # of books read 
Source: Miller, C., Purcell, K., Rainie, L. (2012). Reading babits in different communities. Retrieved from http://libraries.pewinternet.org/2012/12/20/readinghabitsindifferentcommunities
Which is a better investment to make?
1st investment 
zero 
over period of 10 years, average return per year 
£750 
over period of 10 years, median return per year 
2nd investment 
£1,500 
over period of 10 years, average return per year 
£60 
over period of 10 years, median return per year 
3rd investment 
£300 
over period of 10 years, average return per year 
£275 
over period of 10 years, median return per year 
Measures of variability
Another common approach to summarize data sets is to find
a single numerical quantity that reflects how spread out the data happen to be or the amount of variability among the observed values. Two frequently used measures are:
Range
The
range is quick and easy way to indicate the variations in a data set. It is calculated by
subtracting the lowest observed value from the highest observed value, which makes it
very sensitive to outlying values.
The exam marks in the sample data set has a
range of 91.
Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
• • • 
604  8  0  10  0  2  2  6  2  0  8  2  0  40  34 
90B  5  2  7  0  0  8  10  3  2  0  1  2  40  35 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
Standard deviation
Standard deviation is a common, but more complex, method of measuring variability. The steps to calculate standard deviation are:
 Subtract the mean value from each observed data value to calculate its deviation score
 Sum the squared value of all deviation scores
 If the data set contains a complete population, then divide the sum from step 2 by the number of elements in the data set to yield the variance
 If the data set contains a sample subset, then divide the sum from step 2 by one less the number of elements in the sample to yield the variance
 The standard deviation is the positive square root of the variance
The exam marks in the sample data set has a standard deviation of
24.76 marks.
Approximately 68% of data values will be
within one standard deviation of the average value.

Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
• • • 
8FE  10  4  10  3  2  5  7  6  4  8  1  7  67  15 
B8A  8  6  8  0  0  7  9  9  3  0  1  6  57  16 
3EB  8  4  7  0  0  6  7  8  4  8  2  2  56  17 
• • • 
ACD  5  0  5  0  0  2  5  0  0  0  0  1  18  55 
4CC  7  0  4  0  0  1  2  0  0  1  1  2  18  56 
41B  3  0  1  3  0  0  3  0  0  6  1  0  17  57 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
Usage of standard deviation
Approximately 95% of data values will be
within two standard deviation of the average value.

Student  Q1  Q2  Q3  Q4a  Q4b  Q5  Q6  Q7  Q8  Q9a  Q9b  Q10  Total  Rank 
07D  10  8  10  3  7  8  10  9  8  8  2  9  92  1 
A93  10  8  10  3  4  10  10  8  9  8  1  10  91  2 
048  10  8  8  3  7  9  8  8  10  8  1  10  90  3 
B45  9  6  10  3  4  10  10  10  8  8  1  7  86  4 
• • • 
078  0  2  0  0  0  0  0  0  0  0  0  0  2  67 
815  1  0  0  0  0  0  0  0  0  0  0  0  1  68 
A small survey was conducted with 10 people. Below is the list of ratings each person gave regarding the performance of the current Prime Minister. Calculate the standard deviation of the sample data set.
3, 9, 10, 4, 7, 8, 9, 5, 7, 8
Let the
average price for a mobile phone be £120 and the
standard deviation is £40.
 What are the lowest and highest prices that would include 95% of all mobile phones for sale?
 Would it be better for consumers if the standard deviation was lower or higher? Why?
Graphical techniques of exploratory data analysis (EDA)
Descriptive statistics can be supplemented with
graphical techniques to further explore and summarize the key characteristics of data sets. This visual approach is often associated with the work of John Tukey. Below are several examples of graphical techniques commonly used with EDA:
 box plot
 balloon chart
 scatter plot
Box plots
Box plots provide a
visual summary of the distribution of a variable. They are considered to be robust because outliers have minimal impact on the shape of the chart.
 Each box represents half of the observed values for a variable.
 The black line in bold found inside each box is the median value.
 The vertical bar (whisker) at the bottom represents the lowest observed value which is equal to or higher than the 25th percentile minus 1.5 times the width of the box.
 The whisker at the top represents the highest observed value which is equal to or lesser than the 75th percentile plus 1.5 times the width of the box.
 The circular symbol represents observed values found outside of the whiskers, often called outliers.
Box plots to show the relationship between year level and convergence of student feedback
Balloon charts to show the relationship between marks and peer review participation
Scatter plot to show the relationship between marks and the number of days that the students used their individual coursework pages
Scatter plot to show the number of units that have threshold number of student feedback responses
Questions and suggestions
Please contact
Chi Nguyen with questions or suggestions.