Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Summarizing Data

DATA 606 - Statistics & Probability for Data Analytics

Jason Bryer, Ph.D. and Angela Lui, Ph.D.

February 9, 2022

1 / 64

Agenda

  • Questions
  • Data wrangling
    • Data types
    • Descriptive statistics
  • Data visualization
    • Grammar of graphics
    • Types of graphics

Data Wrangler Image source: @allison_horst

2 / 64

One Minute Paper Results

What was the most important thing you learned during this class?

What important question remains unanswered for you?

3 / 64

Workflow

Data Science Workflow

Source: Wickham & Grolemund, 2017

4 / 64

Tidy Data

See Wickham (2014) Tidy data.

5 / 64

Types of Data

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

6 / 64

Data Types in R

7 / 64

Data Types / Descriptives / Visualizations

Data Type Descriptive Stats Visualization
Continuous mean, median, mode, standard deviation, IQR histogram, density, box plot
Discrete contingency table, proportional table, median bar plot
Categorical contingency table, proportional table bar plot
Ordinal contingency table, proportional table, median bar plot
Two quantitative correlation scatter plot
Two qualitative contingency table, chi-squared mosaic plot, bar plot
Quantitative & Qualitative grouped summaries, ANOVA, t-test box plot
8 / 64

Variance

Population Variance: S2=Σ(xiˉx)2N Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( xiˉx ).

See also: https://shiny.rit.albany.edu/stat/visualizess/
https://github.com/jbryer/VisualStats/

9 / 64

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the y direction.

10 / 64

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N

We end up with a square.

11 / 64

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.

12 / 64

Variance (cont.)

Population Variance: S2=Σ(xiˉx)2N The variance is therefore the average of the area of all these squares, here represented by the orange square.

13 / 64

Population versus Sample Variance

Typically we want the sample variance. The difference is we divide by n1 to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by n.

Population Variance (yellow): S2=Σ(xiˉx)2N

Sample Variance (green): s2=Σ(xiˉx)2n1

14 / 64

Robust Statistics

Consider the following data randomly selected from the normal distribution:

set.seed(41)
x <- rnorm(30, mean = 100, sd = 15)
mean(x); sd(x)
## [1] 103.1934
## [1] 16.8945
median(x); IQR(x)
## [1] 103.9947
## [1] 25.68004

15 / 64

Robust Statistics

16 / 64

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)
16 / 64

Robust Statistics

Let's add an extreme value:

x <- c(x, 1000)

16 / 64

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread

  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

17 / 64

About legosets

To install the brickset package:

remotes::install_github('jbryer/brickset')

To load the load the legosets dataset.

data('legosets', package = 'brickset')

The legosets data has 16355 observations of 34 variables.

names(legosets)
## [1] "setID" "name" "year"
## [4] "theme" "themeGroup" "subtheme"
## [7] "category" "released" "pieces"
## [10] "minifigs" "bricksetURL" "rating"
## [13] "reviewCount" "packagingType" "availability"
## [16] "agerange_min" "US_retailPrice" "US_dateFirstAvailable"
## [19] "US_dateLastAvailable" "UK_retailPrice" "UK_dateFirstAvailable"
## [22] "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable"
## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable"
## [28] "DE_dateLastAvailable" "height" "width"
## [31] "depth" "weight" "thumbnailURL"
## [34] "imageURL"
18 / 64

Structure (str)

str(legosets)
## 'data.frame': 16355 obs. of 34 variables:
## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...
## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ...
## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...
## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ...
## $ subtheme : chr NA NA NA NA ...
## $ category : chr "Normal" "Normal" "Normal" "Normal" ...
## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ...
## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ...
## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...
## $ rating : num 0 0 0 0 0 0 0 0 0 0 ...
## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ...
## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ...
## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ...
## $ US_dateFirstAvailable: Date, format: NA NA ...
## $ US_dateLastAvailable : Date, format: NA NA ...
## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ UK_dateFirstAvailable: Date, format: NA NA ...
## $ UK_dateLastAvailable : Date, format: NA NA ...
## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ CA_dateFirstAvailable: Date, format: NA NA ...
## $ CA_dateLastAvailable : Date, format: NA NA ...
## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...
## $ DE_dateFirstAvailable: Date, format: NA NA ...
## $ DE_dateLastAvailable : Date, format: NA NA ...
## $ height : num NA NA NA NA NA ...
## $ width : num NA NA NA NA NA ...
## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ...
## $ weight : num NA NA NA NA NA NA NA NA NA NA ...
## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...
## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
19 / 64

RStudio Eenvironment tab can help

20 / 64

Data Wrangling Cheat Sheet

22 / 64

Tidyverse vs Base R

23 / 64

Pipes %>%

The pipe operator (%>%) introduced with the magrittr R package allows for the chaining of R operations. It takes the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. In base R, to get the output of a proportional table, you need to first call table then prop.table.

You can do this in two steps:

tab_out <- table(legosets$category)
prop.table(tab_out)

Or as nested function calls.

prop.table(table(legosets$category))

Using the pipe (%>%) operator we can chain these calls in a what is arguably a more readable format:

table(legosets$category) %>% prop.table()

##
## Book Collection Extended Gear Normal Other
## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749
## Random
## 0.002873739
24 / 64

Filter

25 / 64

Logical Operators

  • !a - TRUE if a is FALSE
  • a == b - TRUE if a and be are equal
  • a != b - TRUE if a and b are not equal
  • a > b - TRUE if a is larger than b, but not equal
  • a >= b - TRUE if a is larger or equal to b
  • a < b - TRUE if a is smaller than be, but not equal
  • a <= b - TRUE if a is smaller or equal to b
  • a %in% b - TRUE if a is in b where b is a vector
which( letters %in% c('a','e','i','o','u') )
## [1] 1 5 9 15 21
  • a | b - TRUE if a or b are TRUE
  • a & b - TRUE if a and b are TRUE
  • isTRUE(a) - TRUE if a is TRUE
26 / 64

Filter

dplyr

mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)

Base R

mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]

nrow(mylego)
## [1] 61
27 / 64

Select

dplyr

mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)

Base R

mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]

head(mylego, n = 4)
## setID pieces theme availability US_retailPrice minifigs
## 1 26803 103 Education {Not specified} NA 6
## 2 26689 142 Education {Not specified} NA 4
## 3 26804 98 Education {Not specified} NA 6
## 4 26277 188 Education Educational 78.95 NA
28 / 64

Relocate

29 / 64

Relocate

dplyr

mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
## theme availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6

Base R

mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]
head(mylego2, n = 3)
## theme availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6
30 / 64

Rename

31 / 64

Rename

dplyr

mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
## setID pieces theme availability USD minifigs
## 1 26803 103 Education {Not specified} NA 6
## 2 26689 142 Education {Not specified} NA 4
## 3 26804 98 Education {Not specified} NA 6

Base R

names(mylego2)[5] <- 'USD'
head(mylego2, n = 3)
## theme availability setID pieces USD minifigs
## 1 Education {Not specified} 26803 103 NA 6
## 2 Education {Not specified} 26689 142 NA 4
## 3 Education {Not specified} 26804 98 NA 6
32 / 64

Mutate

33 / 64

Mutate

dplyr

mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>%
mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
## setID pieces theme availability US_retailPrice minifigs Price_per_piece
## 1 26277 188 Education Educational 78.95 NA 0.4199468
## 2 25949 280 Education Educational 224.95 NA 0.8033929
## 3 25954 1 Education Educational 14.95 NA 14.9500000

Base R

mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]
mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice
head(mylego2, n = 3)
## [1] setID pieces theme availability
## [5] US_retailPrice minifigs Price_per_piece
## <0 rows> (or 0-length row.names)
34 / 64

Group By and Summarize

legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE),
sd_price = sd(US_retailPrice, na.rm = TRUE),
median_price = median(US_retailPrice, na.rm = TRUE),
n = n(),
missing = sum(is.na(US_retailPrice)))
## # A tibble: 15 × 6
## themeGroup mean_price sd_price median_price n missing
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 Action/Adventure 31.3 29.9 20.0 1280 462
## 2 Basic 13.1 12.8 7.99 843 473
## 3 Constraction 15.1 14.0 9.99 501 125
## 4 Educational 89.0 107. 59.7 452 294
## 5 Girls 23.4 22.6 15.0 677 225
## 6 Historical 25.5 27.7 15.0 473 125
## 7 Junior 18.6 13.2 17.8 228 93
## 8 Licensed 42.9 58.3 25.0 2060 467
## 9 Miscellaneous 14.3 20.8 6.99 4925 2117
## 10 Model making 52.8 65.1 30.0 582 166
## 11 Modern day 31.2 33.7 20.0 1723 763
## 12 Pre-school 23.8 19.4 20.0 1487 699
## 13 Racing 24.8 30.2 10 270 59
## 14 Technical 60.8 68.1 40.0 550 137
## 15 Vintage 9.71 9.56 7.50 304 264
35 / 64

Describe and Describe By

library(psych)
describe(legosets$US_retailPrice)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
## item group1 vars n mean sd min max
## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99
## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95
## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99
## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99
## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99
## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99
## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00
## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99
## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99
## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99
## range se
## X11 789.39 0.6416833
## X12 230.00 28.7860885
## X13 799.99 2.1732094
## X14 15.00 2.4369603
## X15 0.00 NA
## X16 249.99 1.8314504
## X17 23.00 1.9944429
## X18 398.04 0.4790158
## X19 379.59 2.3429956
## X110 2.99 0.6268356
36 / 64

Grammer of Graphics

37 / 64

Data Visualizations with ggplot2

  • ggplot2 is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.

  • ggplot2 is, in general, more flexible for creating "prettier" and complex plots.

  • Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) ggplot2 has at least three ways of creating plots:

    1. qplot
    2. ggplot(...) + geom_XXX(...) + ...
    3. ggplot(...) + layer(...)
  • We will focus only on the second.

38 / 64

Parts of a ggplot2 Statement

  • Data
    ggplot(myDataFrame, aes(x=x, y=y))

  • Layers
    geom_point(), geom_histogram()

  • Facets
    facet_wrap(~ cut), facet_grid(~ cut)

  • Scales
    scale_y_log10()

  • Other options
    ggtitle('my title'), ylim(c(0, 10000)), xlab('x-axis label')

39 / 64

Lots of geoms

ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))]
## [1] "geom_abline" "geom_area" "geom_bar"
## [4] "geom_bin_2d" "geom_bin2d" "geom_blank"
## [7] "geom_boxplot" "geom_col" "geom_contour"
## [10] "geom_contour_filled" "geom_count" "geom_crossbar"
## [13] "geom_curve" "geom_density" "geom_density_2d"
## [16] "geom_density_2d_filled" "geom_density2d" "geom_density2d_filled"
## [19] "geom_dotplot" "geom_errorbar" "geom_errorbarh"
## [22] "geom_freqpoly" "geom_function" "geom_hex"
## [25] "geom_histogram" "geom_hline" "geom_jitter"
## [28] "geom_label" "geom_line" "geom_linerange"
## [31] "geom_map" "geom_path" "geom_point"
## [34] "geom_pointrange" "geom_polygon" "geom_qq"
## [37] "geom_qq_line" "geom_quantile" "geom_raster"
## [40] "geom_rect" "geom_ribbon" "geom_rug"
## [43] "geom_segment" "geom_sf" "geom_sf_label"
## [46] "geom_sf_text" "geom_smooth" "geom_spoke"
## [49] "geom_step" "geom_text" "geom_tile"
## [52] "geom_violin" "geom_vline"
40 / 64

Data Visualization Cheat Sheet

41 / 64

Scatterplot

ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point()

42 / 64

Scatterplot (cont.)

ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point()

43 / 64

Scatterplot (cont.)

ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point()

44 / 64

Scatterplot (cont.)

ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability)

45 / 64

Boxplots

ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot()

46 / 64

Boxplots (cont.)

ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot()

47 / 64

Boxplot (cont.)

ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip()

48 / 64

Histograms

ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram()

49 / 64

Histograms (cont.)

ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + scale_x_log10()

50 / 64

Histograms (cont.)

ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + facet_wrap(~ availability)

51 / 64

Density Plots

ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density()

52 / 64

ggplot2 aesthetics

53 / 64

Likert Scales

Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
ST24Q01="I read only if I have to.",
ST24Q02="Reading is one of my favorite hobbies.",
ST24Q03="I like talking about books with other people.",
ST24Q04="I find it hard to finish books.",
ST24Q05="I feel happy if I receive a book as a present.",
ST24Q06="For me, reading is a waste of time.",
ST24Q07="I enjoy going to a bookstore or a library.",
ST24Q08="I read only to get information that I need.",
ST24Q09="I cannot sit still and read for more than a few minutes.",
ST24Q10="I like to express my opinions about books I have read.",
ST24Q11="I like to exchange books with my friends."))
54 / 64

likert R Package

l24 <- likert(items24)
summary(l24)
## Item low neutral
## 10 I like to express my opinions about books I have read. 41.07516 0
## 5 I feel happy if I receive a book as a present. 46.93475 0
## 8 I read only to get information that I need. 50.39874 0
## 7 I enjoy going to a bookstore or a library. 51.21231 0
## 3 I like talking about books with other people. 54.99129 0
## 11 I like to exchange books with my friends. 55.54115 0
## 2 Reading is one of my favorite hobbies. 56.64470 0
## 1 I read only if I have to. 58.72868 0
## 4 I find it hard to finish books. 65.35125 0
## 9 I cannot sit still and read for more than a few minutes. 76.24524 0
## 6 For me, reading is a waste of time. 82.88729 0
## high mean sd
## 10 58.92484 2.604913 0.9009968
## 5 53.06525 2.466751 0.9446590
## 8 49.60126 2.484616 0.9089688
## 7 48.78769 2.428508 0.9164136
## 3 45.00871 2.328049 0.9090326
## 11 44.45885 2.343193 0.9609234
## 2 43.35530 2.344530 0.9277495
## 1 41.27132 2.291811 0.9369023
## 4 34.64875 2.178299 0.8991628
## 9 23.75476 1.974736 0.8793028
## 6 17.11271 1.810093 0.8611554
55 / 64

likert Plots

plot(l24)

56 / 64

likert Plots

plot(l24, type='heat')

57 / 64

likert Plots

plot(l24, type='density')

58 / 64

Dual Scales

Some problems1:

  • The designer has to make choices about scales and this can have a big impact on the viewer
  • "Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers)
  • They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
  • Because of the issues above, in malicious hands they make it possible to deliberately mislead

This example looks at the relationship between NZ dollar exchange rate and trade weighted index.

DATA606::shiny_demo('DualScales', package='DATA606')

My advise:

  • Avoid using them. You can usually do better with other plot types.
  • When necessary (or compelled) to use them, rescale (using z-scores, we'll discuss this in a few weeks)
59 / 64

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

60 / 64

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Source: https://en.wikipedia.org/wiki/Pie_chart.

61 / 64

Just say NO to pie charts!

"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"

John Tukey
62 / 64

Additional Resources

For data wrangling:

For data visualization:

63 / 64

One Minute Paper

Complete the one minute paper: https://forms.gle/qxRnsCyydx1nf8sXA

  1. What was the most important thing you learned during this class?

  2. What important question remains unanswered for you?

64 / 64

Agenda

  • Questions
  • Data wrangling
    • Data types
    • Descriptive statistics
  • Data visualization
    • Grammar of graphics
    • Types of graphics

Data Wrangler Image source: @allison_horst

2 / 64
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow