Foundation for Inference Part 2

class: center, middle, inverse, title-slide

# Foundation for Inference Part 2
## DATA 606 - Statistics & Probability for Data Analytics
### Jason Bryer, Ph.D. and Angela Lui, Ph.D.
### March 9, 2022

---

# Data Project Proposal

Due April 3rdish Select a dataset that interests you. For the proposal, you need to answer the questions below.

.font80[
* Research question
* What type of statistical test do you plan to do (e.g. t-test, ANOVA, regression, logistic regression, chi-squared, etc.)
* What are the cases, and how many are there?
* Describe the method of data collection.
* What type of study is this (observational/experiment)?
* Data Source: If you collected the data, state self-collected. If not, provide a citation/link.
* Response: What is the response variable, and what type is it (numerical/categorical)?
* Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorical)?
* Relevant summary statistics
]

More information including template and suggested datasets located here: https://spring2022.data606.net/assignments/project/

---
# One Minute Paper Results

.pull-left[
**What was the most important thing you learned during this class?**
<img src="05-Foundation_for_Inference2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]
.pull-right[
**What important question remains unanswered for you?**
<img src="05-Foundation_for_Inference2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
# Population Distribution (Uniform)

```r
n <- 1e5
pop <- runif(n, 0, 1)
mean(pop)
```

```
## [1] 0.5003282
```

---
# Random Sample (n=30)

```r
samp2 <- sample(pop, size=30)
mean(samp2)
```

```
## [1] 0.4169682
```

```r
hist(samp2)
```

---
class: center, middle, inverse
# Null Hypothesis Testing

---
# Hypothesis Testing

* We start with a null hypothesis ( `$H_0$` ) that represents the status quo.

* We also have an alternative hypothesis ( `$H_A$` ) that represents our research question, i.e. what we're testing for.

* We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation or traditional methods based on the central limit theorem.

* If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative.

---
# Hypothesis Testing (using CI)

`$H_0$`: The mean of `samp2` = 0.5  
`$H_A$`: The mean of `samp2` `$\ne$` 0.5

Using confidence intervals, if the *null* value is within the confidence interval, then we *fail* to reject the *null* hypothesis.

```r
(samp2.ci <- c(mean(samp2) - 1.96 * sd(samp2) / sqrt(length(samp2)),
			 mean(samp2) + 1.96 * sd(samp2) / sqrt(length(samp2))))
```

```
## [1] 0.3101259 0.5238106
```

Since 0.5 fall within 0.3101259, 0.5238106, we *fail* to reject the null hypothesis.

---
# Hypothesis Testing (using *p*-values)

$$ \bar { x } \sim N\left( mean=0.49,SE=\frac { 0.27 }{ \sqrt { 30 } = 0.049 }  \right)  $$

$$ Z=\frac { \bar { x } -null }{ SE } =\frac { 0.49-0.50 }{ 0.049 } = -.204081633 $$

```r
pnorm(-.204) * 2
```

```
## [1] 0.8383535
```

---
# Hypothesis Testing (using *p*-values)

```r
DATA606::normal_plot(cv = c(.204), tails = 'two.sided')
```

---
# Type I and II Errors

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

| | fail to reject H0 | reject H0 |
|--------------------|:----------------------------:|:--------------------:|
| H0 true | 	&#10004; | Type I Error |
| HA true | Type II Error | 	&#10004; |

* Type I Error: **Rejecting** the null hypothesis when it is **true**.
* Type II Error: **Failing to reject** the null hypothesis when it is **false**.

---
# Hypothesis Test

If we again think of a hypothesis test as a criminal trial then it
makes sense to frame the verdict in terms of the null and
alternative hypotheses:

H0 : Defendant is innocent 
HA : Defendant is guilty

Which type of error is being committed in the following
circumstances?

* Declaring the defendant innocent when they are actually guilty 
<center>Type 2 error</center>

* Declaring the defendant guilty when they are actually innocent 
<center>Type 1 error</center>

Which error do you think is the worse error to make?

---
# Null Distribution

```r
(cv <- qnorm(0.05, mean=0, sd=1, lower.tail=FALSE))
```

```
## [1] 1.644854
```

---
# Alternative Distribution

```r
pnorm(cv, mean=cv, lower.tail = FALSE)
```

```
## [1] 0.5
```

---
# Another Example (mu = 2.5)

.pull-left[

```r
mu <- 2.5
(cv <- qnorm(0.05, 
			 mean=0, 
			 sd=1, 
			 lower.tail=FALSE))
```

```
## [1] 1.644854
```
]
.pull-right[
<img src="05-Foundation_for_Inference2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /><img src="05-Foundation_for_Inference2_files/figure-html/unnamed-chunk-16-2.png" style="display: block; margin: auto;" />
]

---
# Numeric Values

Type I Error

```r
pnorm(mu, mean=0, sd=1, lower.tail=FALSE)
```

```
## [1] 0.006209665
```

Type II Error

```r
pnorm(cv, mean=mu, lower.tail = TRUE)
```

```
## [1] 0.1962351
```

---
# Shiny Application

Visualizing Type I and Type II errors: [https://bcdudek.net/betaprob/](https://bcdudek.net/betaprob/)

---
# Why p < 0.05?

Check out this page: https://r.bryer.org/shiny/Why05/