Image source: @allison_horst
What was the most important thing you learned during this class?
What important question remains unanswered for you?
Data Type | Descriptive Stats | Visualization |
---|---|---|
Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot |
Discrete | contingency table, proportional table, median | bar plot |
Categorical | contingency table, proportional table | bar plot |
Ordinal | contingency table, proportional table, median | bar plot |
Two quantitative | correlation | scatter plot |
Two qualitative | contingency table, chi-squared | mosaic plot, bar plot |
Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot |
Population Variance: S2=Σ(xi−ˉx)2N Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( xi−ˉx ).
See also:
https://shiny.rit.albany.edu/stat/visualizess/
https://github.com/jbryer/VisualStats/
Population Variance: S2=Σ(xi−ˉx)2N In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the y direction.
Population Variance: S2=Σ(xi−ˉx)2N
We end up with a square.
Population Variance: S2=Σ(xi−ˉx)2N We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.
Population Variance: S2=Σ(xi−ˉx)2N The variance is therefore the average of the area of all these squares, here represented by the orange square.
Typically we want the sample variance. The difference is we divide by n−1 to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by n.
Population Variance (yellow): S2=Σ(xi−ˉx)2N
Sample Variance (green): s2=Σ(xi−ˉx)2n−1
Consider the following data randomly selected from the normal distribution:
set.seed(41)x <- rnorm(30, mean = 100, sd = 15)mean(x); sd(x)
## [1] 103.1934
## [1] 16.8945
median(x); IQR(x)
## [1] 103.9947
## [1] 25.68004
Let's add an extreme value:
x <- c(x, 1000)
Let's add an extreme value:
x <- c(x, 1000)
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread
legosets
To install the brickset
package:
remotes::install_github('jbryer/brickset')
To load the load the legosets
dataset.
data('legosets', package = 'brickset')
The legosets
data has 16355 observations of 34 variables.
names(legosets)
## [1] "setID" "name" "year" ## [4] "theme" "themeGroup" "subtheme" ## [7] "category" "released" "pieces" ## [10] "minifigs" "bricksetURL" "rating" ## [13] "reviewCount" "packagingType" "availability" ## [16] "agerange_min" "US_retailPrice" "US_dateFirstAvailable"## [19] "US_dateLastAvailable" "UK_retailPrice" "UK_dateFirstAvailable"## [22] "UK_dateLastAvailable" "CA_retailPrice" "CA_dateFirstAvailable"## [25] "CA_dateLastAvailable" "DE_retailPrice" "DE_dateFirstAvailable"## [28] "DE_dateLastAvailable" "height" "width" ## [31] "depth" "weight" "thumbnailURL" ## [34] "imageURL"
str
) str(legosets)
## 'data.frame': 16355 obs. of 34 variables:## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ...## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ...## $ subtheme : chr NA NA NA NA ...## $ category : chr "Normal" "Normal" "Normal" "Normal" ...## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ...## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ...## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ...## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...## $ rating : num 0 0 0 0 0 0 0 0 0 0 ...## $ reviewCount : int 0 0 1 0 0 0 0 1 0 0 ...## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ...## $ US_retailPrice : num NA NA NA NA NA 1.99 NA NA 4.99 NA ...## $ US_dateFirstAvailable: Date, format: NA NA ...## $ US_dateLastAvailable : Date, format: NA NA ...## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ UK_dateFirstAvailable: Date, format: NA NA ...## $ UK_dateLastAvailable : Date, format: NA NA ...## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ CA_dateFirstAvailable: Date, format: NA NA ...## $ CA_dateLastAvailable : Date, format: NA NA ...## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ...## $ DE_dateFirstAvailable: Date, format: NA NA ...## $ DE_dateLastAvailable : Date, format: NA NA ...## $ height : num NA NA NA NA NA ...## $ width : num NA NA NA NA NA ...## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ...## $ weight : num NA NA NA NA NA NA NA NA NA NA ...## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
setID | name | year | theme | themeGroup | category | US_retailPrice | pieces | minifigs | rating | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 29512 | Mummy Queen | 2019 | Collectable Minifigures | Miscellaneous | Normal | 3.99 | 6 | 1 | 3.7 |
2 | 7650 | Von Nebula | 2010 | HERO Factory | Constraction | Normal | 19.99 | 156 | 4.3 | |
3 | 9937 | Ninjago Kai ZX Kids' Watch | 2012 | Gear | Miscellaneous | Gear | 24.99 | 0 | ||
4 | 6565 | Aeroplane | 2004 | Creator | Model making | Normal | 0 | |||
5 | 28932 | Princess Leia Key Chain | 2019 | Gear | Miscellaneous | Gear | 4.99 | 0 | ||
6 | 24433 | Dressing table | 2015 | Friends | Girls | Other | 22 | 0 | ||
7 | 2741 | Crane and Digger Accessories | 1998 | Service Packs | Miscellaneous | Normal | 4 | 14 | 0 | |
8 | 23821 | Azari and the Magical Bakery | 2015 | Elves | Action/Adventure | Normal | 29.99 | 324 | 2 | 3.8 |
9 | 22536 | Small Freestyle Bucket | 1996 | Freestyle | Basic | Normal | 0 | |||
10 | 29301 | Lady Liberty | 2019 | BrickHeadz | Licensed | Normal | 9.99 | 153 | 4 |
%>%
The pipe operator (%>%
) introduced with the magrittr
R package allows for the chaining of R operations. It takes the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. In base R, to get the output of a proportional table, you need to first call table
then prop.table
.
You can do this in two steps:
tab_out <- table(legosets$category)prop.table(tab_out)
Or as nested function calls.
prop.table(table(legosets$category))
Using the pipe (%>%
) operator we can chain these calls in a what is arguably a more readable format:
table(legosets$category) %>% prop.table()
## ## Book Collection Extended Gear Normal Other ## 0.028798533 0.032100275 0.025191073 0.143564659 0.713420972 0.054050749 ## Random ## 0.002873739
!a
- TRUE if a is FALSEa == b
- TRUE if a and be are equala != b
- TRUE if a and b are not equala > b
- TRUE if a is larger than b, but not equala >= b
- TRUE if a is larger or equal to ba < b
- TRUE if a is smaller than be, but not equala <= b
- TRUE if a is smaller or equal to ba %in% b
- TRUE if a is in b where b is a vector which( letters %in% c('a','e','i','o','u') )
## [1] 1 5 9 15 21
a | b
- TRUE if a or b are TRUEa & b
- TRUE if a and b are TRUEisTRUE(a)
- TRUE if a is TRUEdplyr
mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)
mylego <- legosets[legosets$themeGroups == 'Educaitonal' & legosets$year > 2015,]
nrow(mylego)
## [1] 61
dplyr
mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)
mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]
head(mylego, n = 4)
## setID pieces theme availability US_retailPrice minifigs## 1 26803 103 Education {Not specified} NA 6## 2 26689 142 Education {Not specified} NA 4## 3 26804 98 Education {Not specified} NA 6## 4 26277 188 Education Educational 78.95 NA
dplyr
mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
## theme availability setID pieces US_retailPrice minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]head(mylego2, n = 3)
## theme availability setID pieces US_retailPrice minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
dplyr
mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
## setID pieces theme availability USD minifigs## 1 26803 103 Education {Not specified} NA 6## 2 26689 142 Education {Not specified} NA 4## 3 26804 98 Education {Not specified} NA 6
names(mylego2)[5] <- 'USD'head(mylego2, n = 3)
## theme availability setID pieces USD minifigs## 1 Education {Not specified} 26803 103 NA 6## 2 Education {Not specified} 26689 142 NA 4## 3 Education {Not specified} 26804 98 NA 6
dplyr
mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
## setID pieces theme availability US_retailPrice minifigs Price_per_piece## 1 26277 188 Education Educational 78.95 NA 0.4199468## 2 25949 280 Education Educational 224.95 NA 0.8033929## 3 25954 1 Education Educational 14.95 NA 14.9500000
mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPricehead(mylego2, n = 3)
## [1] setID pieces theme availability ## [5] US_retailPrice minifigs Price_per_piece## <0 rows> (or 0-length row.names)
legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice)))
## # A tibble: 15 × 6## themeGroup mean_price sd_price median_price n missing## <chr> <dbl> <dbl> <dbl> <int> <int>## 1 Action/Adventure 31.3 29.9 20.0 1280 462## 2 Basic 13.1 12.8 7.99 843 473## 3 Constraction 15.1 14.0 9.99 501 125## 4 Educational 89.0 107. 59.7 452 294## 5 Girls 23.4 22.6 15.0 677 225## 6 Historical 25.5 27.7 15.0 473 125## 7 Junior 18.6 13.2 17.8 228 93## 8 Licensed 42.9 58.3 25.0 2060 467## 9 Miscellaneous 14.3 20.8 6.99 4925 2117## 10 Model making 52.8 65.1 30.0 582 166## 11 Modern day 31.2 33.7 20.0 1723 763## 12 Pre-school 23.8 19.4 20.0 1487 699## 13 Racing 24.8 30.2 10 270 59## 14 Technical 60.8 68.1 40.0 550 137## 15 Vintage 9.71 9.56 7.50 304 264
library(psych)describe(legosets$US_retailPrice)
## vars n mean sd median trimmed mad min max range skew kurtosis se## X1 1 9886 28.52 42 14.99 20.14 14.83 0 799.99 799.99 5.62 58.91 0.42
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
## item group1 vars n mean sd min max## X11 1 {Not specified} 1 3197 24.24484 36.282072 0.60 789.99## X12 2 Educational 1 9 140.95000 86.358265 14.95 244.95## X13 3 LEGO exclusive 1 1066 28.79797 70.954538 0.00 799.99## X14 4 LEGOLAND exclusive 1 7 12.70429 6.447591 4.99 19.99## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99## X16 6 Promotional 1 167 9.19485 23.667555 0.00 249.99## X17 7 Promotional (Airline) 1 11 15.79455 6.614819 5.00 28.00## X18 8 Retail 1 4824 29.82030 33.270049 1.95 399.99## X19 9 Retail - limited 1 600 44.64837 57.391438 0.40 379.99## X110 10 Unknown 1 4 2.24750 1.253671 1.00 3.99## range se## X11 789.39 0.6416833## X12 230.00 28.7860885## X13 799.99 2.1732094## X14 15.00 2.4369603## X15 0.00 NA## X16 249.99 1.8314504## X17 23.00 1.9944429## X18 398.04 0.4790158## X19 379.59 2.3429956## X110 2.99 0.6268356
ggplot2
is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.
ggplot2
is, in general, more flexible for creating "prettier" and complex plots.
Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.)
ggplot2
has at least three ways of creating plots:
qplot
ggplot(...) + geom_XXX(...) + ...
ggplot(...) + layer(...)
We will focus only on the second.
ggplot2
Statement Dataggplot(myDataFrame, aes(x=x, y=y))
Layersgeom_point()
, geom_histogram()
Facetsfacet_wrap(~ cut)
, facet_grid(~ cut)
Scalesscale_y_log10()
Other optionsggtitle('my title')
, ylim(c(0, 10000))
, xlab('x-axis label')
ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))]
## [1] "geom_abline" "geom_area" "geom_bar" ## [4] "geom_bin_2d" "geom_bin2d" "geom_blank" ## [7] "geom_boxplot" "geom_col" "geom_contour" ## [10] "geom_contour_filled" "geom_count" "geom_crossbar" ## [13] "geom_curve" "geom_density" "geom_density_2d" ## [16] "geom_density_2d_filled" "geom_density2d" "geom_density2d_filled" ## [19] "geom_dotplot" "geom_errorbar" "geom_errorbarh" ## [22] "geom_freqpoly" "geom_function" "geom_hex" ## [25] "geom_histogram" "geom_hline" "geom_jitter" ## [28] "geom_label" "geom_line" "geom_linerange" ## [31] "geom_map" "geom_path" "geom_point" ## [34] "geom_pointrange" "geom_polygon" "geom_qq" ## [37] "geom_qq_line" "geom_quantile" "geom_raster" ## [40] "geom_rect" "geom_ribbon" "geom_rug" ## [43] "geom_segment" "geom_sf" "geom_sf_label" ## [46] "geom_sf_text" "geom_smooth" "geom_spoke" ## [49] "geom_step" "geom_text" "geom_tile" ## [52] "geom_violin" "geom_vline"
ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point()
ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point()
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point()
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability)
ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot()
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot()
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip()
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram()
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + scale_x_log10()
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram() + facet_wrap(~ availability)
ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density()
Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).
library(likert)library(reshape)data(pisaitems)items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']items24 <- rename(items24, c( ST24Q01="I read only if I have to.", ST24Q02="Reading is one of my favorite hobbies.", ST24Q03="I like talking about books with other people.", ST24Q04="I find it hard to finish books.", ST24Q05="I feel happy if I receive a book as a present.", ST24Q06="For me, reading is a waste of time.", ST24Q07="I enjoy going to a bookstore or a library.", ST24Q08="I read only to get information that I need.", ST24Q09="I cannot sit still and read for more than a few minutes.", ST24Q10="I like to express my opinions about books I have read.", ST24Q11="I like to exchange books with my friends."))
likert
R Package l24 <- likert(items24)summary(l24)
## Item low neutral## 10 I like to express my opinions about books I have read. 41.07516 0## 5 I feel happy if I receive a book as a present. 46.93475 0## 8 I read only to get information that I need. 50.39874 0## 7 I enjoy going to a bookstore or a library. 51.21231 0## 3 I like talking about books with other people. 54.99129 0## 11 I like to exchange books with my friends. 55.54115 0## 2 Reading is one of my favorite hobbies. 56.64470 0## 1 I read only if I have to. 58.72868 0## 4 I find it hard to finish books. 65.35125 0## 9 I cannot sit still and read for more than a few minutes. 76.24524 0## 6 For me, reading is a waste of time. 82.88729 0## high mean sd## 10 58.92484 2.604913 0.9009968## 5 53.06525 2.466751 0.9446590## 8 49.60126 2.484616 0.9089688## 7 48.78769 2.428508 0.9164136## 3 45.00871 2.328049 0.9090326## 11 44.45885 2.343193 0.9609234## 2 43.35530 2.344530 0.9277495## 1 41.27132 2.291811 0.9369023## 4 34.64875 2.178299 0.8991628## 9 23.75476 1.974736 0.8793028## 6 17.11271 1.810093 0.8611554
likert
Plots plot(l24)
likert
Plots plot(l24, type='heat')
likert
Plots plot(l24, type='density')
Some problems1:
This example looks at the relationship between NZ dollar exchange rate and trade weighted index.
DATA606::shiny_demo('DualScales', package='DATA606')
My advise:
1 http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html
2 http://ellisp.github.io/blog/2016/08/18/dualaxes
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
Source: https://en.wikipedia.org/wiki/Pie_chart.
"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"
For data wrangling:
dplyr
website: https://dplyr.tidyverse.orgFor data visualization:
ggplot2
website: https://ggplot2.tidyverse.orgComplete the one minute paper: https://forms.gle/qxRnsCyydx1nf8sXA
What was the most important thing you learned during this class?
What important question remains unanswered for you?
Image source: @allison_horst
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |