Summarizing Data

class: center, middle, inverse, title-slide

# Summarizing Data
## DATA 606 - Statistics & Probability for Data Analytics
### Jason Bryer, Ph.D. and Angela Lui, Ph.D.
### February 9, 2022

---

# Agenda

.pull-left[.font130[
* Questions
* Data wrangling
	* Data types
	* Descriptive statistics
* Data visualization
	* Grammar of graphics
	* Types of graphics
]]
.pull-right[
<img src='images/data_wrangler.png' alt='Data Wrangler' width='100%' />
.right[.font60[ Image source: [@allison_horst](https://twitter.com/allison_horst) ]]
]

---
# One Minute Paper Results

.pull-left[
**What was the most important thing you learned during this class?**
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]
.pull-right[
**What important question remains unanswered for you?**
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
# Workflow

.center[
<img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' />
]

.font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)]

---
# Tidy Data

.center[
<img src='images/tidydata_1.jpg' height='500' />
]

See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html).

---
# Types of Data

.pull-left[
* Numerical (quantitative)
	* Continuous
	* Discrete
]
.pull-right[
* Categorical (qualitative)
	* Regular categorical
	* Ordinal
]
.center[
<img src='images/continuous_discrete.png' height='400' />
]

---
# Data Types in R

---
# Data Types / Descriptives / Visualizations

Data Type    |  Descriptive Stats                            | Visualization
-------------|-----------------------------------------------|-------------------|
Continuous   | mean, median, mode, standard deviation, IQR   | histogram, density, box plot
Discrete     | contingency table, proportional table, median | bar plot
Categorical  | contingency table, proportional table         | bar plot
Ordinal      | contingency table, proportional table, median | bar plot
Two quantitative | correlation                               | scatter plot
Two qualitative  | contingency table, chi-squared            | mosaic plot, bar plot
Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot

---
# Variance

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `$x_i - \bar{x}$` ).

See also:
https://shiny.rit.albany.edu/stat/visualizess/  
https://github.com/jbryer/VisualStats/

]
.pull-right[

<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$

We end up with a square.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
The variance is therefore the average of the area of all these squares, here represented by the orange square.
]
.pull-right[

<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

---
# Population versus Sample Variance

.pull-left[
Typically we want the sample variance. The difference is we divide by `$n - 1$` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `$n$`.

Population Variance (yellow):
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$

Sample Variance (green):
$$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$

]
.pull-right[

]

---
# Robust Statistics

Consider the following data randomly selected from the normal distribution:

.pull-left[

```r
set.seed(41)
x <- rnorm(30, mean = 100, sd = 15)
mean(x); sd(x)
```

```
## [1] 103.1934
```

```
## [1] 16.8945
```

```r
median(x); IQR(x)
```

```
## [1] 103.9947
```

```
## [1] 25.68004
```
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]

---
# Robust Statistics

Let's add an extreme value:

```r
x <- c(x, 1000)
```

---
# Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

* for skewed distributions it is often more helpful to use median and IQR to describe the center and spread

* for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

---
class: font80
# About `legosets` <img src="images/hex/brickset.png" class="title-hex">

To install the `brickset` package:

```r
remotes::install_github('jbryer/brickset')
```

To load the load the `legosets` dataset.

```r
data('legosets', package = 'brickset')
```

The `legosets` data has 16355 observations of 34 variables.

.code70[

```r
names(legosets)
```

```
##  [1] "setID"                 "name"                  "year"                 
##  [4] "theme"                 "themeGroup"            "subtheme"             
##  [7] "category"              "released"              "pieces"               
## [10] "minifigs"              "bricksetURL"           "rating"               
## [13] "reviewCount"           "packagingType"         "availability"         
## [16] "agerange_min"          "US_retailPrice"        "US_dateFirstAvailable"
## [19] "US_dateLastAvailable"  "UK_retailPrice"        "UK_dateFirstAvailable"
## [22] "UK_dateLastAvailable"  "CA_retailPrice"        "CA_dateFirstAvailable"
## [25] "CA_dateLastAvailable"  "DE_retailPrice"        "DE_dateFirstAvailable"
## [28] "DE_dateLastAvailable"  "height"                "width"                
## [31] "depth"                 "weight"                "thumbnailURL"         
## [34] "imageURL"
```
]

---
# Structure (`str`) <img src="images/hex/brickset.png" class="title-hex">

.code50[

```r
str(legosets)
```

```
## 'data.frame':	16355 obs. of  34 variables:
##  $ setID                : int  7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...
##  $ name                 : chr  "Small house set" "Medium house set" "Medium house set" "Large house set" ...
##  $ year                 : int  1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##  $ theme                : chr  "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...
##  $ themeGroup           : chr  "Vintage" "Vintage" "Vintage" "Vintage" ...
##  $ subtheme             : chr  NA NA NA NA ...
##  $ category             : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ released             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ pieces               : int  67 109 158 233 NA 1 1 60 65 NA ...
##  $ minifigs             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ bricksetURL          : chr  "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...
##  $ rating               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviewCount          : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ packagingType        : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ availability         : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ agerange_min         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ US_retailPrice       : num  NA NA NA NA NA 1.99 NA NA 4.99 NA ...
##  $ US_dateFirstAvailable: Date, format: NA NA ...
##  $ US_dateLastAvailable : Date, format: NA NA ...
##  $ UK_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UK_dateFirstAvailable: Date, format: NA NA ...
##  $ UK_dateLastAvailable : Date, format: NA NA ...
##  $ CA_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CA_dateFirstAvailable: Date, format: NA NA ...
##  $ CA_dateLastAvailable : Date, format: NA NA ...
##  $ DE_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DE_dateFirstAvailable: Date, format: NA NA ...
##  $ DE_dateLastAvailable : Date, format: NA NA ...
##  $ height               : num  NA NA NA NA NA ...
##  $ width                : num  NA NA NA NA NA ...
##  $ depth                : num  NA NA NA NA NA NA NA NA 5.08 NA ...
##  $ weight               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ thumbnailURL         : chr  "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...
##  $ imageURL             : chr  "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
```

]

---
# RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex">

---
class: hide-logo
# Table View

.font60[

<div id="htmlwidget-4c138408b3df193278db" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-4c138408b3df193278db">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100"],[29512,7650,9937,6565,28932,24433,2741,23821,22536,29301,672,644,4343,410,8934,6106,29871,7222,26113,95,710,30637,7619,4737,30280,7270,26140,1282,10222,25635,27017,3514,3392,25,29481,29460,27565,26300,7563,6034,23716,9950,24996,7878,15465,5626,6869,29248,23913,4447,2018,24125,2049,30009,23488,4368,1040,9348,31073,2339,1775,7847,1423,4216,24320,24548,24784,23870,7000,27857,23195,13309,789,24303,23757,7296,30163,2071,25828,25651,2916,9953,310,2771,6491,29397,2396,23621,2001,3937,27192,24172,1730,28135,25693,29518,1812,23193,28247,27785],["Mummy Queen","Von Nebula","Ninjago Kai ZX Kids' Watch","Aeroplane","Princess Leia Key Chain","Dressing table","Crane and Digger Accessories","Azari and the Magical Bakery","Small Freestyle Bucket","Lady Liberty","Windsurfer","Scout Patrol Ship","Onewa","Maersk Truck and Trailer Unit","Patrol Car","Road Sweeper","Cake","LEGO Castle: Brickmaster","Dog on stage","TIE Fighter Collection","Sun Set","The Flaming Foundry","Mobile Crane","Points","LEGO 8 Stud Aqua Light Blue Storage Brick Drawer","Monster Crab Clash","Monsters Army-Building Set","Busy City","Corporate Alliance Tank Droid","Aaron Fox's Aero-Striker V2","The Hulk","Hook & Haul Wrecker","Police 4 x 4","Santa Fe Cars - Set II","Supernatural Race Car","Dr. Wu's Lab: Baby Dinosaurs Breakout","The Joker Manor","Black Widow Key Chain","Space Mini-Figures","Build a Farm","THE LEGO MOVIE Blu ray Combo Pack","8-stud Red Storage Brick","Skybound Plane","Playhouse Set","Bonus/Value Pack","DINO ATTACK Chalk Eggs","Medieval Market Village","Panda","Ferrari F138","Rascus","Umbrella Minifigure","Lady Cyclops","Basic Box 5+","Triceratops","Emmet's Car/Fly Car","Onua Nuva","Fire Engine","Mini Republic Dropship Mini AT-TE Brickmaster Pack (SDCC 2009 exclusive)","BrickJournal Issue 8","Alpha Team Aquatic Mech","2x4 Tan Bricks","Basic Pack","Bob's Workshop","Off-Roader","Stormtrooper Sergeant","Doc Brown Fun Pack","Resistance X-wing Fighter Microfighter","Enter the Serpent","Thomas at Morgan's Mine","Party Time","Krader","Bonus/Value Pack","Nanas","The IMP for the Enterprise 2","Unikitty -- CuteseyKitty","Snowman","Blocks magazine issue 54","Rings","Great LEGO Sets: A Visual History","Harry Potter Key Chain","Nursery","The LEGO Story","Precision Training","Light and Sound Stacker","In-flight Helicopter and Raft","Avengers Speeder Bike Attack","2 Keys for Wind-Up Motor","Santa's Workshop","Building Stories with Nana Bird","Tiny's Lift Cart","Kendo Lloyd","T. rex Tracker","Fire Station","Park Playmat","The LEGO Adventure Book, Vol. 1: Cars, Castles, Dinosaurs & More!","Galactic Bounty Hunter","Hugo Hog the Tinker","Vulk","LEGO Minifigures - Harry Potter and Fantastic Beasts Series 1 - Sealed box","Firstbourne"],[2019,2010,2012,2004,2019,2015,1998,2015,1996,2019,1993,1992,2001,1985,2012,2007,2020,2009,2016,2004,1998,2020,2010,2005,2020,2010,2016,2000,2013,2016,2017,1989,1992,2002,2020,2020,2017,2016,1979,2007,2014,2012,2016,2010,2004,2006,2009,2019,2014,2004,2003,2015,1998,2020,2014,2002,1995,2009,2009,2002,2000,1984,2001,2003,2015,2016,2016,2015,2009,2018,2014,2007,1997,2010,2014,2009,2019,1980,2015,2006,1997,2012,2002,2002,2008,2020,1987,2014,2001,2004,2017,2015,1982,2018,2012,2019,1982,2014,2018,2018],["Collectable Minifigures","HERO Factory","Gear","Creator","Gear","Friends","Service Packs","Elves","Freestyle","BrickHeadz","Town","Space","Bionicle","Town","City","Duplo","Friends","Books","Friends","Star Wars","Primo","Monkie Kid","Technic","Duplo","Gear","Atlantis","Nexo Knights","Books","Star Wars","Nexo Knights","BrickHeadz","Town","Town","Trains","Hidden Side","Jurassic World","The LEGO Batman Movie","Gear","Space","Duplo","Gear","Gear","Ninjago","Education","Star Wars","Gear","Castle","Promotional","Promotional","Castle","Gear","Collectable Minifigures","Basic","Jurassic World","The LEGO Movie","Bionicle","Duplo","Star Wars","Books","Alpha Team","Bulk Bricks","Dacta","Duplo","Racers","Star Wars","Dimensions","Star Wars","Ninjago","Duplo","Unikitty","Mixels","Duplo","Basic","Serious Play","The LEGO Movie","Creator","Books","Scala","Books","Gear","Belville","Gear","Sports","Explore","City","Marvel Super Heroes","Service Packs","Creator Expert","Creator","Explore","The LEGO Ninjago Movie","Jurassic World","Fabuland","Xtra","Books","Collectable Minifigures","Fabuland","Mixels","Collectable Minifigures","Ninjago"],["Miscellaneous","Constraction","Miscellaneous","Model making","Miscellaneous","Girls","Miscellaneous","Action/Adventure","Basic","Licensed","Modern day","Action/Adventure","Constraction","Modern day","Modern day","Pre-school","Girls","Miscellaneous","Girls","Licensed","Pre-school","Action/Adventure","Technical","Pre-school","Miscellaneous","Action/Adventure","Action/Adventure","Miscellaneous","Licensed","Action/Adventure","Licensed","Modern day","Modern day","Modern day","Action/Adventure","Licensed","Licensed","Miscellaneous","Action/Adventure","Pre-school","Miscellaneous","Miscellaneous","Action/Adventure","Educational","Licensed","Miscellaneous","Historical","Miscellaneous","Miscellaneous","Historical","Miscellaneous","Miscellaneous","Basic","Licensed","Licensed","Constraction","Pre-school","Licensed","Miscellaneous","Action/Adventure","Basic","Educational","Pre-school","Racing","Licensed","Miscellaneous","Licensed","Action/Adventure","Pre-school","Licensed","Licensed","Pre-school","Basic","Educational","Licensed","Model making","Miscellaneous","Girls","Miscellaneous","Miscellaneous","Girls","Miscellaneous","Modern day","Pre-school","Modern day","Licensed","Miscellaneous","Model making","Model making","Pre-school","Licensed","Licensed","Junior","Miscellaneous","Miscellaneous","Miscellaneous","Junior","Licensed","Miscellaneous","Action/Adventure"],["Normal","Normal","Gear","Normal","Gear","Other","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Other","Book","Other","Normal","Normal","Normal","Normal","Normal","Gear","Normal","Extended","Book","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Gear","Normal","Normal","Gear","Gear","Normal","Normal","Collection","Gear","Normal","Other","Normal","Normal","Gear","Normal","Normal","Other","Other","Normal","Normal","Other","Book","Normal","Normal","Normal","Normal","Normal","Extended","Normal","Normal","Normal","Normal","Normal","Normal","Collection","Normal","Normal","Other","Normal","Book","Normal","Book","Gear","Normal","Gear","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Normal","Extended","Book","Normal","Normal","Normal","Collection","Normal"],[3.99,19.99,24.99,null,4.99,null,4,29.99,null,9.99,null,null,3,null,11.99,14.99,null,null,null,70,7,139.99,99.99,6.99,null,6.99,14.99,20,19.99,29.99,9.99,6,4.75,35,29.99,19.99,269.99,4.99,null,29.99,29.99,41.99,null,null,null,null,99.99,null,null,null,null,3.99,null,null,null,8,18.75,50,8.95,null,7,null,6,4,null,14.99,9.99,59.99,29.99,19.99,4.99,null,null,null,null,3.5,null,null,null,null,null,null,null,null,null,19.99,null,69.99,30,null,null,69.99,null,7.99,null,3.99,null,4.99,3.99,69.99],[6,156,null,null,null,22,14,324,null,153,21,30,30,352,97,11,35,null,20,682,3,1427,1289,2,null,68,30,73,271,301,93,52,62,411,244,164,3444,null,18,107,null,null,30,125,363,null,1601,null,41,46,null,6,355,65,40,41,12,94,null,162,50,396,10,26,5,69,87,529,22,214,66,42,13,null,16,44,null,15,102,null,62,null,16,10,115,226,2,883,373,3,7,520,33,11,null,8,4,69,null,882],[1,null,null,null,null,null,null,2,null,null,1,1,null,1,2,1,null,2,null,4,null,7,null,null,null,1,4,null,3,3,null,1,1,null,3,2,10,null,3,1,null,null,1,null,null,null,8,null,null,null,null,1,2,null,null,null,1,null,null,1,null,6,1,null,1,1,1,5,null,4,null,null,null,null,1,null,null,null,null,null,3,null,1,null,2,3,null,6,3,1,1,3,3,null,null,1,1,null,null,6],[3.7,4.3,0,0,0,0,0,3.8,0,4,0,0,3.8,0,3.7,0,0,3.9,0,4.2,0,0,4.1,0,0,3.7,4.1,0,4,3.9,3.8,4,3.4,4.2,4,4,4.3,0,0,0,0,0,3.5,0,0,0,4.7,0,3.7,3.7,0,3.2,0,0,3.5,4.1,0,0,0,0,0,0,0,0,4,4.1,3.6,4,0,3.6,3.8,0,0,0,0,3.7,0,0,4.4,0,0,0,0,0,0,3.5,0,4.1,0,0,3.2,3.9,0,0,0,4.1,0,3.7,0,4.5]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>setID<\/th>\n      <th>name<\/th>\n      <th>year<\/th>\n      <th>theme<\/th>\n      <th>themeGroup<\/th>\n      <th>category<\/th>\n      <th>US_retailPrice<\/th>\n      <th>pieces<\/th>\n      <th>minifigs<\/th>\n      <th>rating<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":10,"columnDefs":[{"className":"dt-right","targets":[1,3,7,8,9,10]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

]

---
# Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex">

.center[
<a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a>
]

---
# Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex">

.center[
<a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a>
]

---
# Pipes `%>%` <img src="images/hex/magrittr.png" class="title-hex">

.font90[
The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. It takes the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. In base R, to get the output of a proportional table, you need to first call `table` then `prop.table`. 
]

.pull-left[
You can do this in two steps:

```r
tab_out <- table(legosets$category)
prop.table(tab_out)
```

Or as nested function calls.

```r
prop.table(table(legosets$category))
```
]
.pull-right[
Using the pipe (`%>%`) operator we can chain these calls in a what is arguably a more readable format:

```r
table(legosets$category) %>% prop.table()
```
]

<hr />

```
## 
##        Book  Collection    Extended        Gear      Normal       Other 
## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 
##      Random 
## 0.002873739
```

---
# Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_filter_sm.png' width='800' />
]

---
# Logical Operators

* `!a` - TRUE if a is FALSE
* `a == b` - TRUE if a and be are equal
* `a != b` - TRUE if a and b are not equal
* `a > b` - TRUE if a is larger than b, but not equal
* `a >= b` - TRUE if a is larger or equal to b
* `a < b` - TRUE if a is smaller than be, but not equal
* `a <= b` - TRUE if a is smaller or equal to b
* `a %in% b` - TRUE if a is in b where b is a vector

```r
which( letters %in% c('a','e','i','o','u') )
```

```
## [1]  1  5  9 15 21
```
* `a | b` - TRUE if a *or* b are TRUE
* `a & b` - TRUE if a *and* b are TRUE
* `isTRUE(a)` - TRUE if a is TRUE

---
# Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

```r
mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)
```

### Base R

```r
mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]
```

<hr />

```r
nrow(mylego)
```

```
## [1] 61
```

---
# Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

```r
mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)
```

### Base R

```r
mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]
```

<hr />

```r
head(mylego, n = 4)
```

```
##   setID pieces     theme    availability US_retailPrice minifigs
## 1 26803    103 Education {Not specified}             NA        6
## 2 26689    142 Education {Not specified}             NA        4
## 3 26804     98 Education {Not specified}             NA        6
## 4 26277    188 Education     Educational          78.95       NA
```

---
# Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_relocate.png' width='800' />
]

---
# Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

```r
mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
```

```
##       theme    availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803    103             NA        6
## 2 Education {Not specified} 26689    142             NA        4
## 3 Education {Not specified} 26804     98             NA        6
```

### Base R

```r
mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]
head(mylego2, n = 3)
```

---
# Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/rename_sm.jpg' width='1000' />
]

---
# Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

```r
mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
```

```
##   setID pieces     theme    availability USD minifigs
## 1 26803    103 Education {Not specified}  NA        6
## 2 26689    142 Education {Not specified}  NA        4
## 3 26804     98 Education {Not specified}  NA        6
```

### Base R

```r
names(mylego2)[5] <- 'USD'
head(mylego2, n = 3)
```

```
##       theme    availability setID pieces USD minifigs
## 1 Education {Not specified} 26803    103  NA        6
## 2 Education {Not specified} 26689    142  NA        4
## 3 Education {Not specified} 26804     98  NA        6
```

---
# Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_mutate.png' width='700' />
]

---
# Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

```r
mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% 
	mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
```

```
##   setID pieces     theme availability US_retailPrice minifigs Price_per_piece
## 1 26277    188 Education  Educational          78.95       NA       0.4199468
## 2 25949    280 Education  Educational         224.95       NA       0.8033929
## 3 25954      1 Education  Educational          14.95       NA      14.9500000
```

### Base R

```r
mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]
mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice
head(mylego2, n = 3)
```

```
## [1] setID           pieces          theme           availability   
## [5] US_retailPrice  minifigs        Price_per_piece
## <0 rows> (or 0-length row.names)
```

---
# Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.code80[

```r
legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE),
												sd_price = sd(US_retailPrice, na.rm = TRUE),
												median_price = median(US_retailPrice, na.rm = TRUE),
												n = n(),
												missing = sum(is.na(US_retailPrice)))
```

```
## # A tibble: 15 × 6
##    themeGroup       mean_price sd_price median_price     n missing
##    <chr>                 <dbl>    <dbl>        <dbl> <int>   <int>
##  1 Action/Adventure      31.3     29.9         20.0   1280     462
##  2 Basic                 13.1     12.8          7.99   843     473
##  3 Constraction          15.1     14.0          9.99   501     125
##  4 Educational           89.0    107.          59.7    452     294
##  5 Girls                 23.4     22.6         15.0    677     225
##  6 Historical            25.5     27.7         15.0    473     125
##  7 Junior                18.6     13.2         17.8    228      93
##  8 Licensed              42.9     58.3         25.0   2060     467
##  9 Miscellaneous         14.3     20.8          6.99  4925    2117
## 10 Model making          52.8     65.1         30.0    582     166
## 11 Modern day            31.2     33.7         20.0   1723     763
## 12 Pre-school            23.8     19.4         20.0   1487     699
## 13 Racing                24.8     30.2         10      270      59
## 14 Technical             60.8     68.1         40.0    550     137
## 15 Vintage                9.71     9.56         7.50   304     264
```
]

---
# Describe and Describe By

```r
library(psych)
describe(legosets$US_retailPrice)
```

```
##    vars    n  mean sd median trimmed   mad min    max  range skew kurtosis   se
## X1    1 9886 28.52 42  14.99   20.14 14.83   0 799.99 799.99 5.62    58.91 0.42
```

```r
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
```

```
##      item                group1 vars    n      mean        sd   min    max
## X11     1       {Not specified}    1 3197  24.24484 36.282072  0.60 789.99
## X12     2           Educational    1    9 140.95000 86.358265 14.95 244.95
## X13     3        LEGO exclusive    1 1066  28.79797 70.954538  0.00 799.99
## X14     4    LEGOLAND exclusive    1    7  12.70429  6.447591  4.99  19.99
## X15     5              Not sold    1    1  12.99000        NA 12.99  12.99
## X16     6           Promotional    1  167   9.19485 23.667555  0.00 249.99
## X17     7 Promotional (Airline)    1   11  15.79455  6.614819  5.00  28.00
## X18     8                Retail    1 4824  29.82030 33.270049  1.95 399.99
## X19     9      Retail - limited    1  600  44.64837 57.391438  0.40 379.99
## X110   10               Unknown    1    4   2.24750  1.253671  1.00   3.99
##       range         se
## X11  789.39  0.6416833
## X12  230.00 28.7860885
## X13  799.99  2.1732094
## X14   15.00  2.4369603
## X15    0.00         NA
## X16  249.99  1.8314504
## X17   23.00  1.9944429
## X18  398.04  0.4790158
## X19  379.59  2.3429956
## X110   2.99  0.6268356
```

---
class: middle
# Grammer of Graphics

.center[
<img src="images/ggplot2_masterpiece.png" height="550" />
]

---
# Data Visualizations with ggplot2 <img src="images/hex/ggplot2.png" class="title-hex">

* `ggplot2` is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.

* `ggplot2` is, in general, more flexible for creating "prettier" and complex plots.

* Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.)
`ggplot2` has at least three ways of creating plots:
     1. `qplot`
     2. `ggplot(...) + geom_XXX(...) + ...`
     3. `ggplot(...) + layer(...)`

* We will focus only on the second.

---
# Parts of a `ggplot2` Statement <img src="images/hex/ggplot2.png" class="title-hex">

* Data  
`ggplot(myDataFrame, aes(x=x, y=y))`

* Layers  
`geom_point()`, `geom_histogram()`

* Facets  
`facet_wrap(~ cut)`, `facet_grid(~ cut)`

* Scales  
`scale_y_log10()`

* Other options  
`ggtitle('my title')`, `ylim(c(0, 10000))`, `xlab('x-axis label')`

---
# Lots of geoms <img src="images/hex/ggplot2.png" class="title-hex">

```r
ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))]
```

```
##  [1] "geom_abline"            "geom_area"              "geom_bar"              
##  [4] "geom_bin_2d"            "geom_bin2d"             "geom_blank"            
##  [7] "geom_boxplot"           "geom_col"               "geom_contour"          
## [10] "geom_contour_filled"    "geom_count"             "geom_crossbar"         
## [13] "geom_curve"             "geom_density"           "geom_density_2d"       
## [16] "geom_density_2d_filled" "geom_density2d"         "geom_density2d_filled" 
## [19] "geom_dotplot"           "geom_errorbar"          "geom_errorbarh"        
## [22] "geom_freqpoly"          "geom_function"          "geom_hex"              
## [25] "geom_histogram"         "geom_hline"             "geom_jitter"           
## [28] "geom_label"             "geom_line"              "geom_linerange"        
## [31] "geom_map"               "geom_path"              "geom_point"            
## [34] "geom_pointrange"        "geom_polygon"           "geom_qq"               
## [37] "geom_qq_line"           "geom_quantile"          "geom_raster"           
## [40] "geom_rect"              "geom_ribbon"            "geom_rug"              
## [43] "geom_segment"           "geom_sf"                "geom_sf_label"         
## [46] "geom_sf_text"           "geom_smooth"            "geom_spoke"            
## [49] "geom_step"              "geom_text"              "geom_tile"             
## [52] "geom_violin"            "geom_vline"
```

---
# Data Visualization Cheat Sheet <img src="images/hex/ggplot2.png" class="title-hex">

.center[
<a href='https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf'><img src='images/data-visualization-2.1.png' width='700' /></a>
]

---
# Scatterplot  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability)
```

---
# Boxplots  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot()
```

---
# Boxplots (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot()
```

---
# Boxplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip()
```

---
# Histograms <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram()
```

---
# Histograms (cont.)<img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + scale_x_log10()
```

---
# Histograms (cont.) <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + facet_wrap(~ availability)
```

---
# Density Plots <img src="images/hex/ggplot2.png" class="title-hex">

```r
ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density()
```

---
# `ggplot2` aesthetics <img src="images/hex/ggplot2.png" class="title-hex">

.center[
<a href='images/ggplot_aesthetics_cheatsheet.png' target='_new'> <img src='images/ggplot_aesthetics_cheatsheet.png' height='550' /></a>
]

---
# Likert Scales <img src="images/hex/likert.png" class="title-hex">

Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

```r
library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
			ST24Q01="I read only if I have to.",
			ST24Q02="Reading is one of my favorite hobbies.",
			ST24Q03="I like talking about books with other people.",
			ST24Q04="I find it hard to finish books.",
			ST24Q05="I feel happy if I receive a book as a present.",
			ST24Q06="For me, reading is a waste of time.",
			ST24Q07="I enjoy going to a bookstore or a library.",
			ST24Q08="I read only to get information that I need.",
			ST24Q09="I cannot sit still and read for more than a few minutes.",
			ST24Q10="I like to express my opinions about books I have read.",
			ST24Q11="I like to exchange books with my friends."))
```

---
# `likert` R Package <img src="images/hex/likert.png" class="title-hex">

```r
l24 <- likert(items24)
summary(l24)
```

```
##                                                        Item      low neutral
## 10   I like to express my opinions about books I have read. 41.07516       0
## 5            I feel happy if I receive a book as a present. 46.93475       0
## 8               I read only to get information that I need. 50.39874       0
## 7                I enjoy going to a bookstore or a library. 51.21231       0
## 3             I like talking about books with other people. 54.99129       0
## 11                I like to exchange books with my friends. 55.54115       0
## 2                    Reading is one of my favorite hobbies. 56.64470       0
## 1                                 I read only if I have to. 58.72868       0
## 4                           I find it hard to finish books. 65.35125       0
## 9  I cannot sit still and read for more than a few minutes. 76.24524       0
## 6                       For me, reading is a waste of time. 82.88729       0
##        high     mean        sd
## 10 58.92484 2.604913 0.9009968
## 5  53.06525 2.466751 0.9446590
## 8  49.60126 2.484616 0.9089688
## 7  48.78769 2.428508 0.9164136
## 3  45.00871 2.328049 0.9090326
## 11 44.45885 2.343193 0.9609234
## 2  43.35530 2.344530 0.9277495
## 1  41.27132 2.291811 0.9369023
## 4  34.64875 2.178299 0.8991628
## 9  23.75476 1.974736 0.8793028
## 6  17.11271 1.810093 0.8611554
```

---
# `likert` Plots  <img src="images/hex/likert.png" class="title-hex">

```r
plot(l24)
```

---
# `likert` Plots  <img src="images/hex/likert.png" class="title-hex">

```r
plot(l24, type='heat')
```

---
# `likert` Plots  <img src="images/hex/likert.png" class="title-hex">

```r
plot(l24, type='density')
```

---
class: font90
# Dual Scales <img src="images/hex/shiny.png" class="title-hex">

Some problems<sup>1</sup>:

* The designer has to make choices about scales and this can have a big impact on the viewer
* "Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers)
* They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
* Because of the issues above, in malicious hands they make it possible to deliberately mislead

This example looks at the relationship between NZ dollar exchange rate and trade weighted index.

```r
DATA606::shiny_demo('DualScales', package='DATA606')
```

My advise:

* Avoid using them. You can usually do better with other plot types.
* When necessary (or compelled) to use them, rescale (using z-scores, we'll discuss this in a few weeks)

.font50[
<sup>1</sup> http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html  
<sup>2</sup> http://ellisp.github.io/blog/2016/08/18/dualaxes
]

---
# Pie Charts

There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

---
# Pie Charts

Source: [https://en.wikipedia.org/wiki/Pie_chart](https://en.wikipedia.org/wiki/Pie_chart).

---
class: middle
# Just say NO to pie charts!

.font150[
"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"]
.right[.font130[John Tukey]]

---
# Additional Resources

For data wrangling:

* `dplyr` website: https://dplyr.tidyverse.org
* R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html
* Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome
* Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

For data visualization:

* `ggplot2` website: https://ggplot2.tidyverse.org
* R for Data Science book: https://r4ds.had.co.nz/data-visualisation.html
* R Graphics Cookbook: https://r-graphics.org
* Data visualization cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf

---
# One Minute Paper

Complete the one minute paper: https://forms.gle/qxRnsCyydx1nf8sXA

1. What was the most important thing you learned during this class?

2. What important question remains unanswered for you?