Machine Learning in R

# Machine Learning in R
### Introduction to the Tidyverse

___

**Simon Schölzel**

Winter Term 2021/2022  
.small[(updated: 2021-09-25)]

---

## Agenda

**1 Learning Objectives**

**2 Introduction to the `tidyverse`**  
> 2.1 What is the `tidyverse`  
2.2 The Concept of Tidy Data
  
**3 `palmerpenguins`: Palmer Archipelago (Antarctica) Penguin Data**

**4 The Core `tidyverse` Packages**  
> 4.1 `magrittr`: A Forward-Pipe Operator for `R`  
4.2 `tibble`: Simple Data Frames  
4.3 `readr`: Read Rectangular Text Data  
4.4 `tidyr`: Tidy Messy Data  
4.5 `dplyr`: A Grammar of Data Manipulation  
4.6 `purrr`: Functional Programming Tools  
4.7 `ggplot2`: Create Elegant Data Visualisations Using the Grammar of Graphics

---

## 1 Learning Objectives 💡

This lecture teaches you important tools for working with tabular data sets in `R`. It introduces and showcases a suite of packages which ease your data science workflow in terms of data import, data cleaning, data transformation and data visualization.

More specifically, after this lecture you will
- be familiar with the main tools of the `tidyverse` and how it differs from `base R`, 
- know your way around in working with the core packages of the `tidyverse` for importing, tidying, transforming and visualizing data, 
- be proficient in processing (*non-tidy*) data of any shape and quality, 
- be able to produce high-quality, fully customizable visualizations, 
- have improved your overall data literacy.

???
especially highlight the last point: how you think about data, how you approach working with data whenever you open a new data set, build a mental model for data transformation operations

---

# 2 Introduction to the `tidyverse`

---

background-image: url(https://www.tidyverse.org/images/hex-tidyverse.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 2.1 What is the `tidyverse`?

> The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/)

> Its primary goal is to facilitate a conversation between a human and a computer about data.  
~ [Wickham, et al. (2019)](https://joss.theoj.org/papers/10.21105/joss.01686)

.pull-left[.center[
<img src="https://www.tidyverse.org/images/hex-tidyverse.png" width="45%" height="45%" />

Official `tidyverse` [Hex Sticker](https://github.com/rstudio/hex-stickers)
]]
.pull-right[.center[
<img src="https://pbs.twimg.com/profile_images/905186381995147264/7zKAG5sY.jpg" width="50%" height="50%" />

Hadley Wickham - Chief Scientist @ RStudio,  
[Founding Father](https://twitter.com/hadleywickham/status/959507805282582528?s=20) of the `tidyverse`
]]

???
- Can also be seen as a philosophy of how to write code in R. Its a dialect.
- Many people in the community argue that this dialect should be incorporated in base R.
- Often when googling for specific solutions and reading the stackoverflow answers, you may find solutions which can be implemented using plain `base R` or using the `tidyverse` syntax.

---

## 2.1 What is the `tidyverse`?

> The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/)

.pull-left[
**`tidyverse` core packages:**
- `readr`: data import
- `tibble`: modern data frame object
- `stringr`: working with strings
- `forcats`: working with factors
- `tidyr`: data tidying
- `dplyr`: data manipulation
- `ggplot2`: data visualization
- `purrr`: functional programming
]
.pull-right[
<img src="./img/tidyverse-hex.PNG" width="90%" height="90%" style="display: block; margin: auto;" />
]

???
- Tidyverse can be viewed as a meta-package
- each package has its own goal which makes the tidyverse a modular collection of packages
- these are the core packages (there are many others for special purposes which integrate seamlessly, e.g., lubridate, stringr, forcats, ...)

---

## 2.1 What is the `tidyverse`?

```r
install.packages("tidyverse")
library(tidyverse)
```
```
-- Attaching packages --------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.4     v dplyr   1.0.7
v tidyr   1.1.3     v stringr 1.4.0
v readr   2.0.1     v forcats 0.5.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
```

.footnote[
*Note that `install.packages("tidyverse")` is essentially equivalent to running `install.packages("ggplot2")`, `install.packages("tibble")`, `install.packages("tidyr")`, `install.packages("readr")`, etc. individually.*
]

???
- see tidyverse package version as well as the version of the eight core packages
- the eight core packages are loaded by loading the `tidyverse` package
- note that some `base R` functions (`stats` namespace) are overwritten by their `tidyverse` equivalents (`dplyr` namespace)
- when working with a new or rarely used package, i prefer to explicitly state the namespace to remember where the function is coming from

---

## 2.1 What is the `tidyverse`?

> The tidyverse is an opinionated collection of R packages designed **for data science**. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/)

These packages are geared towards facilitating the day-2-day data science workflow:

**Import:** `readr`  
**Tidy:** `tidyr`  
**Transform:** `dplyr`, `forcats`, `stringr`  
**Visualize:** `ggplot2`  
**Model:** `tidymodels`  
**Communicate:** `rmarkdown`  
**Program:** `magrittr`, `purrr`, `tibble`

*Note: For the communication and modeling part of the workflow refer to the `rmarkdown` and `tidymodels` videos.*
]

???
workflow here as stylized example

---

## 2.1 What is the `tidyverse`?

> The tidyverse is an opinionated collection of R packages designed for data science. All packages share an **underlying design philosophy, grammar, and data structures**. ~ [tidyverse.org](https://www.tidyverse.org/)

This underlying design philosophy and grammar boils down to a consistent and easy-to-use API:

- The `tibble` as the core underlying data structure
- Extensive use of the `%>%`-operator for gluing together multiple function calls
- Consistently applied naming conventions (e.g., function names in [*snakecase*](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/other-stats-artwork/coding_cases.png))
- Consistent order of function arguments (e.g., `fn(arg1 = data, arg2 = col names, ...)`)
- ...

The `tidyverse` syntax can be viewed as a "dialect" of `R`. When you have familiarized yourself with it, you will be able to easily transfer your knowledge about one function or package to other components of the `tidyverse`. Just like learning a new language.

.footnote[
*Note: For further information see [Tidyverse Team (2020)](https://design.tidyverse.org/) and [Wickham (2019)](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html).*
]

???
- API: application programming interface
- R data structures: atomic vector (character, integer, numeric, logical, complex), list, matrix, data frame, factors -> tibble simply an extension/better version of the data frame
- snakecase: underscores, numbers and lowercase characters

---

## 2.2 The Concept of Tidy Data

> Tidy data sets are all alike; but every messy data set is messy in its own way.  
~ [Wickham/Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html)

.pull-left[
**Tidy Data Principles:** The concept of tidy data has been coined by Hadley Wickham in his 2014 paper ["Tidy Data"](https://www.jstatsoft.org/article/view/v059i10). The concept formulates principles for structuring rectangular, tabular data sets consisting of rows and columns:

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.
]
.pull-right[

???
- 3: relates to the storage of one data set per table (analogy to principles in data base design) -> here the type of observational unit might be the citizen, he/she reserves a policy treatment, e.g., tax reduction (hence information about firms might be stored in a different data frame)
- all the upcoming tools are geared towards bringing data into this tabular shape (inversely we will not work with text or image data)

---

## 2.2 The Concept of Tidy Data

**Violations of the Tidy Data Principles:**
1. Column headers are values, not variable names.  
2. Multiple variables are stored in one column.  
3. Variables are stored in both rows and columns.  
4. Multiple types of observational units are stored in the same table.  
5. A single observational unit is stored in multiple tables.

```
> # A tibble: 3 x 4
> species Biscoe Dream Torgersen
> <fct> <int> <int> <int>
> 1 Adelie 44 56 52
> 2 Chinstrap NA 68 NA
> 3 Gentoo 124 NA NA
```
]
.panel[.panel-name[Example 2]

```
> # A tibble: 5 x 3
> col island year
> <chr> <fct> <int>
> 1 Gentoo_NA Biscoe 2007
> 2 Adelie_male Torgersen 2007
> 3 Gentoo_female Biscoe 2008
> 4 Chinstrap_male Dream 2008
> 5 Adelie_male Torgersen 2009
```
]
.panel[.panel-name[Example 3]

```
> # A tibble: 3 x 4
> term bill_length_mm bill_depth_mm flipper_length_mm
> <chr> <dbl> <dbl> <dbl>
> 1 bill_length_mm NA -0.235 0.656
> 2 bill_depth_mm -0.235 NA -0.584
> 3 flipper_length_mm 0.656 -0.584 NA
```
]
.panel[.panel-name[Example 4]

```
> # A tibble: 6 x 6
> species island sex model mpg cyl
> <fct> <fct> <fct> <chr> <dbl> <dbl>
> 1 Chinstrap Dream female <NA> NA NA
> 2 Gentoo Biscoe female <NA> NA NA
> 3 Gentoo Biscoe male <NA> NA NA
> 4 <NA> <NA> <NA> Merc 450SLC 15.2 8
> 5 <NA> <NA> <NA> Dodge Challenger 15.5 8
> 6 <NA> <NA> <NA> Pontiac Firebird 19.2 8
```
]
.panel[.panel-name[Example 5]
```
# A tibble: 4 x 2 # A tibble: 4 x 4
 species island species bill_length_mm bill_depth_mm flipper_length_mm
 <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 1 Adelie 39.1 18.7 181
2 Adelie Torgersen 2 Adelie 39.5 17.4 186
3 Adelie Torgersen 3 Adelie 40.3 18 195
4 Adelie Torgersen 4 Adelie NA NA NA

```
]
]

---

## 2.2 The Concept of Tidy Data

---

# 3 `palmerpenguins`: Palmer Archipelago (Antarctica) Penguin Data

---

background-image: url(https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 3 `palmerpenguins`: Palmer Archipelago (Antarctica) Penguin Data
.pull-left[
From here on, to illustrate the features of the `tidyverse` core packages we use data from the `palmerpenguins` package by [Allison Horst](https://allisonhorst.github.io/palmerpenguins/).

The package comes with data about penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.
]
.pull-right[
<img src="https://tenor.com/view/penguin-fat-the-struggle-is-real-lazy-gif-4242854.gif" width="60%" style="display: block; margin: auto;" />
]

---

## 3 `palmerpenguins`: Palmer Archipelago (Antarctica) Penguin Data

```r
library(palmerpenguins)

penguins
```

```
> # A tibble: 344 x 8
> species island bill_length_mm bill_depth_mm
> <fct> <fct> <dbl> <dbl>
> 1 Adelie Torgersen 39.1 18.7
> 2 Adelie Torgersen 39.5 17.4
> 3 Adelie Torgersen 40.3 18 
> 4 Adelie Torgersen NA NA 
> 5 Adelie Torgersen 36.7 19.3
> 6 Adelie Torgersen 39.3 20.6
> 7 Adelie Torgersen 38.9 17.8
> 8 Adelie Torgersen 39.2 19.6
> 9 Adelie Torgersen 34.1 18.1
> 10 Adelie Torgersen 42 20.2
> # ... with 334 more rows, and 4 more variables:
> # flipper_length_mm <int>, body_mass_g <int>,
> # sex <fct>, year <int>
```
]
.pull-right[
<img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png" width="65%" height="65%" style="display: block; margin: auto;" /><img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png" width="65%" height="65%" style="display: block; margin: auto;" />
]

---

# 4.1 `magrittr`: A Forward-Pipe Operator for `R`

---

background-image: url(https://raw.githubusercontent.com/tidyverse/magrittr/master/man/figures/logo.png)
background-position: 95% 5%
background-size: 7.5%
layout: true

---

## 4.1 `magrittr`: The Forward-Pipe Operator

`magrittr` comes with a set of operators:
- **Pipe Operator:** `%>%` 
- **Assignment Operator:** `%<>%` 
- **"Tee" Operator:** `%T>%` 
- **Exposition Operator:** `%$%`

Essentially, these operators aim to improve the readability of your code in multiple ways:
- arrange operations into an easily readable pipeline of chained commands (left-to-right),
- avoid nested function calls (inside-out), 
- minimize the use of local variable assignments (`<-`) and function definitions, and
- easily add and/or delete steps in your pipeline without breaking the code.

???
The exposition operator: %$% (explodes out variables in a data frame, no need to use pull())

---

## 4.1 `magrittr`: The Forward-Pipe Operator

**Basic Piping:** forward a value or object (LHS) into the next function call (RHS) as **first** argument

```r
x %>% f                            # equivalent to: f(x)
x %>% f(y)                         # equivalent to: f(x, y)
x %>% f %>% g %>% h                # equivalent to: h(g(f(x)))
```

**Piping with placeholders:** forward a value or object (LHS) into the next function call (RHS) as **any** argument

```r
x %>% f(.)                         # equivalent to: x %>% f
x %>% f(y, .)                      # equivalent to: f(y, x)
x %>% f(y, z = .)                  # equivalent to: f(y, z = x)
x %>% f(y = nrow(.), z = ncol(.))  # equivalent to: f(x, y = nrow(x), z = ncol(x))
```

**Building functions and pipelines:** a sequence of code starting with the placeholder (`.`) returns a function which can be used to later apply the pipeline to concrete values

```r
f <- . %>% cos %>% sin # equivalent to: f <- function(.) sin(cos(.))
f(20) # equivalent to: the pipeline 20 %>% cos %>% sin
```

.footnote[
*Note: Find out more about `%>%` by running `vignette("magrittr")`. Type `%>%` using the shortcut: Ctrl + Shift + M.*
]

---

## 4.1 `magrittr`: The Forward-Pipe Operator

**Question:** What is the average body mass in grams for all penguins observed in the year 2007 (after excluding missing values)?

**In a pipeless world:**

```r
mean(subset(penguins, year == 2007)$body_mass_g, na.rm = T)

# alternatively:
peng_bm_2007 <- subset(penguins, year == 2007)$body_mass_g
mean(peng_bm_2007, na.rm = T)
```

```r
penguins %>% 
 subset(year == 2007) %>% 
 .$body_mass_g %>% 
 mean(na.rm = T)
```
]
.pull-right[
 
- Sequential style improves readability!
- Less deciphering of nested function calls!
- No need to store intermediate results!
- Modular modification of pipeline steps!
]

.footnote[
*Note: As of version `4.1.0`, base `R` comes with a native pipe operator as well (`|>`).*
]

???
- Add or remove individual steps easily in your pipeline
- The `magrittr` forward pipe is imported by the `tidyverse`, no need to load it separately

---

## 4.1 `magrittr`: The Forward-Pipe Operator

**Advanced piping:** Use the more advanced pipe operators to further streamline your workflow.

.panelset[
.panel[.panel-name[Tee Pipe]
`%T>%` can be used to trigger the side-effect of a function, e.g., for plotting or printing results, and let the original data bypass the respective step.

```r
penguins[1:5, c("island", "bill_length_mm")] %T>% print %>% .$bill_length_mm %>% mean(na.rm=T)
```

```
> # A tibble: 5 x 2
> island bill_length_mm
> <fct> <dbl>
> 1 Torgersen 39.1
> 2 Torgersen 39.5
> 3 Torgersen 40.3
> 4 Torgersen NA 
> 5 Torgersen 36.7
```

```
> [1] 38.9
```
]
.panel[.panel-name[Exposition Pipe]
`%$%` exposes the names in LHS object to the RHS expression. This is useful if the RHS expression does not allow for a separate `data` argument.

```r
penguins %$% plot(species, bill_length_mm)  # equivalent to: plot(penguins$species, penguins$bill_length_mm)
```

<img src="index_files/figure-html/unnamed-chunk-35-1.png" width="432" />
]
.panel[.panel-name[Assignment Pipe]
`%<>%` can be used equivalently to the base `R` assignment operator (`<-`). It reassign the result of the of the pipeline to the starting variable.

```r
var <- penguins$bill_length_mm

var %<>% mean(na.rm=T)

var
```

```
> [1] 43.92193
```
]
]

---

# 4.2 `tibble`: Simple Data Frames

---

background-image: url(https://raw.githubusercontent.com/tidyverse/tibble/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 4.2 `tibble`: Simple Data Frames

`tibble` provides an enhanced data frame object of class `tbl_df`, a so-called `tibble`. A `tibble` can be created in four different ways.

```r
tibble(
  x = c("a", "b"),
  y = c(1, 2),
  z = c(T, F)
)
```

```
> # A tibble: 2 x 3
> x y z 
> <chr> <dbl> <lgl>
> 1 a 1 TRUE 
> 2 b 2 FALSE
```
]
.panel[.panel-name[tribble()]
Create a *transposed* `tibble` row by row with `tribble()`.

```r
tribble(
  ~x, ~y,  ~z,
  "a", 1,  T,
  "b", 2,  F
)
```

```
> # A tibble: 2 x 3
> x y z 
> <chr> <dbl> <lgl>
> 1 a 1 TRUE 
> 2 b 2 FALSE
```
]
.panel[.panel-name[as_tibble()]
Create a `tibble` from an existing data frame with `as_tibble()`.

```r
data.frame(
  x = c("a", "b"),
  y = c(1, 2),
  z = c(T, F)
) %>% 
as_tibble
```

```
> # A tibble: 2 x 3
> x y z 
> <chr> <dbl> <lgl>
> 1 a 1 TRUE 
> 2 b 2 FALSE
```
]
.panel[.panel-name[enframe()]
Create a `tibble` from named vectors with `enframe()`.

```r
c(x = "a", y = "b", z = 1) %>%
  enframe(name = "x", value = "y")
```

```
> # A tibble: 3 x 2
> x y 
> <chr> <chr>
> 1 x a 
> 2 y b 
> 3 z 1
```
]
]

There are three important differences between a `tibble` and a `data.frame` object.

???
- named vector: i have key-value pairs

---

## 4.2 `tibble`: Simple Data Frames

**Printing:** By default, `tibble()` prints only the first ten rows and all the columns that fit on the screen as well as a description of the data type. This gives you a much more concise view of your data.

```r
penguins
```

```
> # A tibble: 344 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Adelie Torgersen 39.1 18.7 181 3750 male 
> 2 Adelie Torgersen 39.5 17.4 186 3800 fema~
> 3 Adelie Torgersen 40.3 18 195 3250 fema~
> 4 Adelie Torgersen NA NA NA NA <NA> 
> 5 Adelie Torgersen 36.7 19.3 193 3450 fema~
> 6 Adelie Torgersen 39.3 20.6 190 3650 male 
> 7 Adelie Torgersen 38.9 17.8 181 3625 fema~
> 8 Adelie Torgersen 39.2 19.6 195 4675 male 
> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 
> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 
> # ... with 334 more rows, and 1 more variable: year <int>
```

???
- you will never again have the problem that `R` takes minutes to print a large data frame entirely to your console (`reached 'max' / getOption("max.print")`)

---

## 4.2 `tibble`: Simple Data Frames

**Printing:** By default, `tibble()` prints only the first ten rows and all the columns that fit on the screen as well as a description of the data type.

```r
data.frame(penguins)
```

```
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
> 1 Adelie Torgersen 39.1 18.7 181 3750
> 2 Adelie Torgersen 39.5 17.4 186 3800
> 3 Adelie Torgersen 40.3 18.0 195 3250
> 4 Adelie Torgersen NA NA NA NA
> 5 Adelie Torgersen 36.7 19.3 193 3450
> sex year
> 1 male 2007
> 2 female 2007
> 3 female 2007
> 4 <NA> 2007
> 5 female 2007
> [ reached 'max' / getOption("max.print") -- omitted 339 rows ]
```

]
.panel[.panel-name[tibble() (Option 1)]

```r
penguins
```

```r
penguins %>% glimpse
```

```
> Rows: 344
> Columns: 8
> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~
> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~
> $ sex <fct> male, female, female, NA, female, male, female, male, NA,~
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~
```
]
]

???
- in contrast to `data.frame()` which prints an extensive number of rows (wh)
`glimpse` transposed version of `print()`
 
---

## 4.2 `tibble`: Simple Data Frames

**Subsetting:** Subsetting a `tibble` (`[]`) always returns another `tibble` and never a vector (in contrast to standard `data.frame` objects).

```r
data.frame(penguins) %>% .[,"species"] %>% class
```

```
> [1] "factor"
```
]
.panel[.panel-name[tibble()]

```r
penguins[,"species"] %>% class
```

```
> [1] "tbl_df"     "tbl"        "data.frame"
```
]
]

---

## 4.2 `tibble`: Simple Data Frames

**Partial Matching:** Subsetting a `tibble` does not allow for partial matching, i.e. you must always provide the whole column name.

```r
data.frame(penguins)$spec
```

```
>  [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
> [12] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
> [23] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
> [34] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
>  [ reached getOption("max.print") -- omitted 304 entries ]
> Levels: Adelie Chinstrap Gentoo
```

]
.panel[.panel-name[tibble()]

```r
penguins$spec
```

```
> Warning: Unknown or uninitialised column: `spec`.
```

```
> NULL
```
]
]

???
- also an advantage of tibbles: Giving you better warning messages to confront you with problems early on.

---

# 4.3 `readr`: Read Rectangular Text Data

???
- not data in the form of texts, but as stored in a text file (txt, csv, excel file)

---

background-image: url(https://raw.githubusercontent.com/tidyverse/readr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 4.3 `readr`: Read Rectangular Text Data

`readr` provides read and write functions for multiple different file formats:
- `read_delim()`: general delimited files
- `read_csv()`: comma separated files
- `read_csv2()`: semicolon separated files
- `read_tsv()`: tab separated files
- `read_fwf()`: fixed width files
- `read_table()`: white-space separated files
- `read_log()`: web log files

Conveniently, the `write_*()` functions work analog. In addition, use the `readxl` package for Excel files, the `haven` package for Stata files, the `googlesheets4` package for Google Sheets or the `rvest` package for HTML files.

.footnote[
*Note: In most European countries Microsoft Excel is using `;` as the common delimiter, which can be accounted for by leveraging the `read_csv2()` function.*
]

???
- `read_delim()` as a generalization of the other functions
- `rvest` as the go-to package in the context of web scraping with `R`

---

## 4.3 `readr`: Read Rectangular Text Data

Let's try it out by reading in the penguins data. For the purpose of illustrating the `readr` package, the `penguins` data is written to a csv-file a priori using `write_csv(penguins, file = "./data/penguins.csv")`.

```r
data <- read_csv(file = "./data/penguins.csv")
```

```
> Rows: 344 Columns: 8
```

```
> -- Column specification -------------------------------------------------------------
> Delimiter: ","
> chr (3): species, island, sex
> dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
```

```
> 
> i Use `spec()` to retrieve the full column specification for this data.
> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
```
]
.panel[.panel-name[Select Columns]

```r
data <- read_csv(file = "./data/penguins.csv", col_select = c(species, island))
```

```
> Rows: 344 Columns: 2
```

```
> -- Column specification -------------------------------------------------------------
> Delimiter: ","
> chr (2): species, island
```

```r
data <- read_csv(file = "./data/penguins.csv", col_names = paste("Var", 1:8, sep = "_"))
```

```
> Rows: 345 Columns: 8
```

```
> -- Column specification -------------------------------------------------------------
> Delimiter: ","
> chr (8): Var_1, Var_2, Var_3, Var_4, Var_5, Var_6, Var_7, Var_8
```

```r
data <- read_csv(file = "./data/penguins.csv", skip = 5)
```

```
> Rows: 339 Columns: 8
```

```
> -- Column specification -------------------------------------------------------------
> Delimiter: ","
> chr (3): Adelie, Torgersen, female
> dbl (5): 36.7, 19.3, 193, 3450, 2007
```

```
> 
> i Use `spec()` to retrieve the full column specification for this data.
> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
```
]
]

.pull-right[.pull-right[
.footnote[
Note: The output of any `read_*()` function is a `tibble` object.
]]]

---

## 4.3 `readr`: Read Rectangular Text Data

`readr` prints the column specifications after importing. By default, it tries to infer the column type (e.g., `int`, `dbl`, `chr`, `fct`, `date`, `lgl`) from the first 1,000 rows and parses the columns accordingly.

Try to make column specifications explicit! You likely get more familiar with your data and see warnings if something changes unexpectedly.

```r
read_csv(
  file = "./data/penguins.csv",
  col_types = cols(
    species = col_character(),
    year = col_datetime(format = "%Y"),
    island = col_skip())
  )
```
]
.panel[.panel-name[Option 2]

```r
read_csv(
  file = "./data/penguins.csv",
  col_types =  "_f?di"  # skip, factor, guess, double, integer, ...
  )
```
]
]

Parsing only the first 1,000 rows is efficient but can lead to erroneous guesses:

```r
read_csv(file = "./data/penguins.csv", guess_max = 2000)
```

.footnote[
*Note: Find more information and functions on the `readr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).*
]

???
- Hint: sometimes you may have trouble when reading in text data (type character): special signs such as ö, ä or ü may be strangely encoded as cryptic symbols -> in those cases you must control for the encoding of your data in the read_csv function (e.g., UTF-8)

---

## 4.3 `readr`: Read Rectangular Text Data

.pull-left[
Eventually, you would want to cease using `.xlsx` and `.csv` files as they are not capable of reliably storing your metadata (e.g., data types).

<img src="./img/excel.jpg" width="60%" height="60%" style="display: block; margin: auto;" />
]

.pull-right[
`write_rds()` and `read_rds()` provide a nice alternative for [serializing](https://en.wikipedia.org/wiki/Serialization) your `R` objects (e.g., `tibbles`, models) and storing them as `.rds` files.

```r
penguins %>% 
  write_rds(file = "./data/penguins.rds")
```

```r
penguins <- read_rds(file = "./data/penguins.rds")
```

Note that
- `write_rds()` can only be used to save one object at a time,
- a loaded `.rds` file must be stored into a new variable, i.e. given a new name,
- `read_rds()` preserves data types!
]

???
- serialization: the process of translating a data structure or object state into a format that can be stored, transmitted and reconstructed later (possibly in a different computer environment).

---

# 4.4 `tidyr`: Tidy Messy Data

---

background-image: url(https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 4.4 `tidyr`: Tidy Messy Data

`tidyr` provides several functions that help you bring your data into the *tidy data* format (e.g., reshaping data, splitting columns, handling missing values or nesting data).

```r
penguins
```

???
- Let's again start with our `penguins` data set which already is in *tidy data* format
- in the following i highlight the dimensionality of the data to show you what happens

DIM: 344 x 8

---

## 4.4 `tidyr`: Tidy Messy Data

**Pivotting:** Converts between long and wide format using `pivot_longer()` and `pivot_wider()`.

```r
long_penguins <- penguins %>% 
 pivot_longer(
 cols = c(species, island),
 names_to = "variable", values_to = "value"
 )

long_penguins %>% glimpse
```

```
> Rows: 688
> Columns: 8
> $ bill_length_mm <dbl> 39.1, 39.1, 39.5, 39.5, 40.3, 40.3, NA, NA, 36.7, 36.7, 3~
> $ bill_depth_mm <dbl> 18.7, 18.7, 17.4, 17.4, 18.0, 18.0, NA, NA, 19.3, 19.3, 2~
> $ flipper_length_mm <int> 181, 181, 186, 186, 195, 195, NA, NA, 193, 193, 190, 190,~
> $ body_mass_g <int> 3750, 3750, 3800, 3800, 3250, 3250, NA, NA, 3450, 3450, 3~
> $ sex <fct> male, male, female, female, female, female, NA, NA, femal~
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~
> $ variable <chr> "species", "island", "species", "island", "species", "isl~
> $ value <fct> Adelie, Torgersen, Adelie, Torgersen, Adelie, Torgersen, ~
```
]
.panel[.panel-name[pivot_wider()]

```r
long_penguins %>% 
  pivot_wider(
    names_from = "variable", values_from = "value"
  ) %>%
  glimpse
```

```
> Rows: 344
> Columns: 8
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~
> $ sex <fct> male, female, female, NA, female, male, female, male, NA,~
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~
> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~
> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~
```
]
]

???
`pivot_longer()`:
- now for each observation we have two rows, one row per variable that are pivotted -> no tidy format any longer
- DIM: 688 x 8

`pivot_wider()`
- invert `pivot_longer()`
- DIM: 344 x 8

---

## 4.4 `tidyr`: Tidy Messy Data

.right[
<img src="https://raw.githubusercontent.com/apreshill/teachthat/master/pivot/pivot_longer_smaller.gif" width="80%" height="80%" />
]

.footnote[.pull-left[
*Source: [Allison Hill](https://github.com/apreshill/teachthat/blob/master/pivot/pivot_longer_smaller.gif)*

Note: Find more information about `pivot_*()` in the [pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html).
]]

---

## 4.4 `tidyr`: Tidy Messy Data

**Nesting:** Groups similar data such that each group becomes a single row in a data frame.

```r
nested_penguins <- 
 penguins %>% 
 nest(nested_data = c(island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex))

nested_penguins
```

```
> # A tibble: 9 x 3
> species year nested_data 
> <fct> <int> <list> 
> 1 Adelie 2007 <tibble [50 x 6]>
> 2 Adelie 2008 <tibble [50 x 6]>
> 3 Adelie 2009 <tibble [52 x 6]>
> 4 Gentoo 2007 <tibble [34 x 6]>
> 5 Gentoo 2008 <tibble [46 x 6]>
> 6 Gentoo 2009 <tibble [44 x 6]>
> 7 Chinstrap 2007 <tibble [26 x 6]>
> 8 Chinstrap 2008 <tibble [18 x 6]>
> 9 Chinstrap 2009 <tibble [24 x 6]>
```

???
- note that `nest()` produces a nested data frame with one row per species and year
- note that the `nested_data` column contains `tibbles` with six columns each and a varying amount of observations
- the work with nested data can be particularly helpful if you would like to apply functions to each subset of the data (e.g., fit a model for each year or for each species)

---

## 4.4 `tidyr`: Tidy Messy Data

**Rectangling:** Disentangles nested data structures (e.g., JSON, HTML) and brings it into *tidy data* format.

.panelset[
.panel[.panel-name[pluck()]
Extract individual objects from a nested data structure via `purrr::pluck()`.

```r
nested_penguins %>% purrr::pluck("nested_data", 1)
```

```
> # A tibble: 50 x 6
> island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Torgersen 39.1 18.7 181 3750 male 
> 2 Torgersen 39.5 17.4 186 3800 female
> 3 Torgersen 40.3 18 195 3250 female
> 4 Torgersen NA NA NA NA <NA> 
> 5 Torgersen 36.7 19.3 193 3450 female
> 6 Torgersen 39.3 20.6 190 3650 male 
> 7 Torgersen 38.9 17.8 181 3625 female
> 8 Torgersen 39.2 19.6 195 4675 male 
> 9 Torgersen 34.1 18.1 193 3475 <NA> 
> 10 Torgersen 42 20.2 190 4250 <NA> 
> # ... with 40 more rows
```
]
.panel[.panel-name[unnest()]
Flatten nested data structures via `tidyr::unnest()`.

```r
nested_penguins %>% unnest(cols = c(nested_data)) 
```

```
> # A tibble: 344 x 8
> species year island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
> <fct> <int> <fct> <dbl> <dbl> <int> <int>
> 1 Adelie 2007 Torgersen 39.1 18.7 181 3750
> 2 Adelie 2007 Torgersen 39.5 17.4 186 3800
> 3 Adelie 2007 Torgersen 40.3 18 195 3250
> 4 Adelie 2007 Torgersen NA NA NA NA
> 5 Adelie 2007 Torgersen 36.7 19.3 193 3450
> 6 Adelie 2007 Torgersen 39.3 20.6 190 3650
> 7 Adelie 2007 Torgersen 38.9 17.8 181 3625
> 8 Adelie 2007 Torgersen 39.2 19.6 195 4675
> 9 Adelie 2007 Torgersen 34.1 18.1 193 3475
> 10 Adelie 2007 Torgersen 42 20.2 190 4250
> # ... with 334 more rows, and 1 more variable: sex <fct>
```
]
.panel[.panel-name[hoist()]
Selectively extract individual components from an object in a nested data structure via `tidyr::hoist()`.

```r
nested_penguins %>% hoist(nested_data, hoisted_col = "bill_length_mm")
```

```
> # A tibble: 9 x 4
> species year hoisted_col nested_data 
> <fct> <int> <list> <list> 
> 1 Adelie 2007 <dbl [50]> <tibble [50 x 5]>
> 2 Adelie 2008 <dbl [50]> <tibble [50 x 5]>
> 3 Adelie 2009 <dbl [52]> <tibble [52 x 5]>
> 4 Gentoo 2007 <dbl [34]> <tibble [34 x 5]>
> 5 Gentoo 2008 <dbl [46]> <tibble [46 x 5]>
> 6 Gentoo 2009 <dbl [44]> <tibble [44 x 5]>
> 7 Chinstrap 2007 <dbl [26]> <tibble [26 x 5]>
> 8 Chinstrap 2008 <dbl [18]> <tibble [18 x 5]>
> 9 Chinstrap 2009 <dbl [24]> <tibble [24 x 5]>
```
]
]

???
Alternatively use `unnest_wider()` or `unnest_longer()` for more control over the rectangling operation.

---

## 4.4 `tidyr`: Tidy Messy Data

**Splitting** and **Combining:** Transforms a single character column into multiple columns and vice versa.

```r
penguins %>% unite(col = "species_gender", c(species, sex), sep = "_", remove = T)
```

```
> # A tibble: 344 x 7
> species_gender island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
> <chr> <fct> <dbl> <dbl> <int> <int>
> 1 Adelie_male Torgersen 39.1 18.7 181 3750
> 2 Adelie_female Torgersen 39.5 17.4 186 3800
> 3 Adelie_female Torgersen 40.3 18 195 3250
> 4 Adelie_NA Torgersen NA NA NA NA
> 5 Adelie_female Torgersen 36.7 19.3 193 3450
> 6 Adelie_male Torgersen 39.3 20.6 190 3650
> 7 Adelie_female Torgersen 38.9 17.8 181 3625
> 8 Adelie_male Torgersen 39.2 19.6 195 4675
> 9 Adelie_NA Torgersen 34.1 18.1 193 3475
> 10 Adelie_NA Torgersen 42 20.2 190 4250
> # ... with 334 more rows, and 1 more variable: year <int>
```
]
.panel[.panel-name[separate()]
Separate a single column, containing multiple values, into multiple columns.

```r
penguins %>% separate(bill_length_mm, sep = 2, into = c("cm", "mm"))
```

```
> # A tibble: 344 x 9
> species island cm mm bill_depth_mm flipper_length_~ body_mass_g sex year
> <fct> <fct> <chr> <chr> <dbl> <int> <int> <fct> <int>
> 1 Adelie Torger~ 39 ".1" 18.7 181 3750 male 2007
> 2 Adelie Torger~ 39 ".5" 17.4 186 3800 fema~ 2007
> 3 Adelie Torger~ 40 ".3" 18 195 3250 fema~ 2007
> 4 Adelie Torger~ <NA> <NA> NA NA NA <NA> 2007
> 5 Adelie Torger~ 36 ".7" 19.3 193 3450 fema~ 2007
> 6 Adelie Torger~ 39 ".3" 20.6 190 3650 male 2007
> 7 Adelie Torger~ 38 ".9" 17.8 181 3625 fema~ 2007
> 8 Adelie Torger~ 39 ".2" 19.6 195 4675 male 2007
> 9 Adelie Torger~ 34 ".1" 18.1 193 3475 <NA> 2007
> 10 Adelie Torger~ 42 "" 20.2 190 4250 <NA> 2007
> # ... with 334 more rows
```
]
.panel[.panel-name[separate_rows()]
Separate a single column, containing multiple values, into multiple rows.

```r
penguins %>% separate_rows(island, sep = "s", convert = T)
```

```
> # A tibble: 564 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <chr> <dbl> <dbl> <int> <int> <fct> 
> 1 Adelie Torger 39.1 18.7 181 3750 male 
> 2 Adelie en 39.1 18.7 181 3750 male 
> 3 Adelie Torger 39.5 17.4 186 3800 female
> 4 Adelie en 39.5 17.4 186 3800 female
> 5 Adelie Torger 40.3 18 195 3250 female
> 6 Adelie en 40.3 18 195 3250 female
> 7 Adelie Torger NA NA NA NA <NA> 
> 8 Adelie en NA NA NA NA <NA> 
> 9 Adelie Torger 36.7 19.3 193 3450 female
> 10 Adelie en 36.7 19.3 193 3450 female
> # ... with 554 more rows, and 1 more variable: year <int>
```
]
]

???
can also `separate` based on character match

---

## 4.4 `tidyr`: Tidy Messy Data

**Handling missing values:** Drop or replace explicit or implicit missing values (`NA`).

```r
incompl_penguins
```

```
> # A tibble: 4 x 3
> species year measurement
> <chr> <dbl> <dbl>
> 1 Adelie 2007 31.0
> 2 Adelie 2008 39.7
> 3 Gentoo 2008 43.3
> 4 Chinstrap 2007 NA
```
]
.panel[.panel-name[complete()]
Make implicit missing values explicit.

```r
incompl_penguins %>% 
  complete(species, year, fill = list(measurement = NA))
```

```
> # A tibble: 6 x 3
> species year measurement
> <chr> <dbl> <dbl>
> 1 Adelie 2007 31.0
> 2 Adelie 2008 39.7
> 3 Chinstrap 2007 NA 
> 4 Chinstrap 2008 NA 
> 5 Gentoo 2007 NA 
> 6 Gentoo 2008 43.3
```
.pull-right[.pull-right[.footnote[
]]]
]
.panel[.panel-name[drop_na()]
Make explicit missing values implicit.

```r
incompl_penguins %>% 
  drop_na(measurement)
```

```
> # A tibble: 3 x 3
> species year measurement
> <chr> <dbl> <dbl>
> 1 Adelie 2007 31.0
> 2 Adelie 2008 39.7
> 3 Gentoo 2008 43.3
```
]
.panel[.panel-name[fill()]
Replace missing values with the next/previous value.

```r
incompl_penguins %>% 
  fill(measurement, .direction = "down")
```

```
> # A tibble: 4 x 3
> species year measurement
> <chr> <dbl> <dbl>
> 1 Adelie 2007 31.0
> 2 Adelie 2008 39.7
> 3 Gentoo 2008 43.3
> 4 Chinstrap 2007 43.3
```
]
.panel[.panel-name[replace_na()]
Replace missing values with a pre-defined value.

```r
incompl_penguins %>%
  replace_na(replace = list(measurement = mean(.$measurement, na.rm = T)))
```

```
> # A tibble: 4 x 3
> species year measurement
> <chr> <dbl> <dbl>
> 1 Adelie 2007 31.0
> 2 Adelie 2008 39.7
> 3 Gentoo 2008 43.3
> 4 Chinstrap 2007 38.0
```
]
]

.footnote[
*Note: Find more information and functions on the `tidyr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).*
]

???
Note: function arguments preceded by a dot in the tidyverse may have one of two reasons:
- the function is still pre-mature, i.e. developers still think about the best way of implementing and naming the function
- the function is regularly applied within another function so that you don't confuse function arguments between the inner and outer function

---

# 4.5 `dplyr`: A Grammar of Data Manipulation

---

background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

`dplyr` provides a set of functions for manipulating data frame objects (e.g., `tibbles`) while relying on a consistent grammar. Functions are intuitively represented by "verbs" that reflect the underlying operations and always output a new or modified `tibble`.

**Operations on rows:**
- `filter()` picks rows that meet one or several logical criteria
- `slice()` picks rows based on their location in the data
- `arrange()` changes the order of rows

**Operations on columns:**
- `select()` picks respectively drops certain columns
- `rename()` changes the column names
- `relocate()` changes the order of columns
- `mutate()` transforms the column values and/or creates new columns

**Operations on grouped data:**
- `group_by()` partitions data based on one or several columns
- `summarise()` reduces a group of data into a single row

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on rows:** `filter()` picks rows that meet one or several logical criteria

Filter for all penguins of `species` "Adelie".

```r
penguins %>% 
  filter(species == "Adelie")
```

Filter for all penguins with a missing value in the `bill_length_mm` measurement.

```r
penguins %>% 
  filter(is.na(bill_length_mm) == T)
  # filter(!is.na(bill_length_mm) == F)
```

Filter for all penguins observed prior to `year` 2008 or subsequent to `year` 2008 and where the body mass (`body_mass_g`) lies between 3,800 and 4,000 grams.

```r
penguins %>% 
 filter(between(body_mass_g, 3800, 4000) & (year < 2008 | year > 2008))
```

???
 - Note that using `=` instead of `==` is a common mistakes for beginners (`<-` = `=`).

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on rows:** `slice()` picks rows based on their location in the data

```r
penguins %>% 
  slice(23:27)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Adelie Biscoe 35.9 19.2 189 3800 female
> 2 Adelie Biscoe 38.2 18.1 185 3950 male 
> 3 Adelie Biscoe 38.8 17.2 180 3800 male 
> 4 Adelie Biscoe 35.3 18.9 187 3800 female
> 5 Adelie Biscoe 40.6 18.6 183 3550 male 
> # ... with 1 more variable: year <int>
```
]
.panel[.panel-name[slice_head()]
Pick the first `n` rows (vice versa for `slice_tail()`).

```r
penguins %>% 
  slice_head(n = 5)  # alternatively: slice_head(frac = 0.05)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Adelie Torgersen 39.1 18.7 181 3750 male 
> 2 Adelie Torgersen 39.5 17.4 186 3800 female
> 3 Adelie Torgersen 40.3 18 195 3250 female
> 4 Adelie Torgersen NA NA NA NA <NA> 
> 5 Adelie Torgersen 36.7 19.3 193 3450 female
> # ... with 1 more variable: year <int>
```
]
.panel[.panel-name[slice_sample()]
Pick a random sample of `n` rows (with or without replacement).

```r
penguins %>% 
  slice_sample(n = 5)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Adelie Dream 35.6 17.5 191 3175 fema~
> 2 Chinstrap Dream 51.4 19 201 3950 male 
> 3 Adelie Biscoe 35.3 18.9 187 3800 fema~
> 4 Adelie Torgers~ 38.9 17.8 181 3625 fema~
> 5 Chinstrap Dream 50.2 18.7 198 3775 fema~
> # ... with 1 more variable: year <int>
```
]
.panel[.panel-name[slice_max()]
Pick the `n` rows with the largest value (vice versa for `slice_min()`).

```r
penguins %>% 
  slice_max(bill_length_mm, n = 5)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Gentoo Biscoe 59.6 17 230 6050 male 
> 2 Chinstrap Dream 58 17.8 181 3700 female
> 3 Gentoo Biscoe 55.9 17 228 5600 male 
> 4 Chinstrap Dream 55.8 19.8 207 4000 male 
> 5 Gentoo Biscoe 55.1 16 230 5850 male 
> # ... with 1 more variable: year <int>
```
]
]

???
- slice_sample to generate bootstrapped samples

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on rows:** `arrange()` changes the order of rows

```r
penguins %>% 
  arrange(body_mass_g) %>% 
  slice_head(n = 5)  # equivalent to: slice_min(body_mass_g, n = 3)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Chinstrap Dream 46.9 16.6 192 2700 female
> 2 Adelie Biscoe 36.5 16.6 181 2850 female
> 3 Adelie Biscoe 36.4 17.1 184 2850 female
> 4 Adelie Biscoe 34.5 18.1 187 2900 female
> 5 Adelie Dream 33.1 16.1 178 2900 female
> # ... with 1 more variable: year <int>
```
]
.panel[.panel-name[Descending]
Return the five penguins with the highest body mass.

```r
penguins %>% 
  arrange(desc(body_mass_g)) %>% 
  slice_head(n = 5)  # equivalent to: slice_max(body_mass_g, n = 3)
```

```
> # A tibble: 5 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Gentoo Biscoe 49.2 15.2 221 6300 male 
> 2 Gentoo Biscoe 59.6 17 230 6050 male 
> 3 Gentoo Biscoe 51.1 16.3 220 6000 male 
> 4 Gentoo Biscoe 48.8 16.2 222 6000 male 
> 5 Gentoo Biscoe 45.2 16.4 223 5950 male 
> # ... with 1 more variable: year <int>
```
]
]

???
- arrange by default always sorts from smallest to largest

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `select()` picks respectively drops certain columns

```r
penguins %>% 
  select(1:3) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 3
> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torge~
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37~
```
]
.panel[.panel-name[select() by name]

```r
penguins %>% 
  select(species, island, bill_length_mm) %>% 
  glimpse
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `select()` picks respectively drops certain columns (using `tidyselect` helpers)

```r
penguins %>% 
  select(everything()) %>% 
  glimpse
```

```r
penguins %>% 
  select(last_col()) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 1
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
```
]
.panel[.panel-name[starts_with()]
Select columns which names start with a certain string.

```r
penguins %>% 
  select(starts_with("bill")) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 2
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17~
```
]
.panel[.panel-name[ends_with()]
Select columns which names end with a certain string.

```r
penguins %>% 
  select(ends_with("mm")) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 3
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
```
]
.panel[.panel-name[contains()]
Select columns which name contains a certain string.

```r
penguins %>% 
  select(contains("e") & contains("a")) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 1
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
```
]
.panel[.panel-name[machtes()]
Select columns based on a regular expression ([regex](https://www.rexegg.com/regex-quickstart.html)).

```r
penguins %>% 
  select(matches("_\\w*_mm$")) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 3
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
```
]
.panel[.panel-name[where()]
Select columns for which a function evaluates to `TRUE`.

```r
penguins %>% 
  select(where(is.numeric)) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 5
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~
```
]
]

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `select()` picks respectively drops certain columns

Which columns are returned by the following queries?

```r
penguins %>% 
  select(starts_with("s"))
```

```r
penguins %>% 
  select(ends_with("mm"))
```

```r
penguins %>% 
  select(contains("mm"))
```

```r
penguins %>% 
  select(-contains("mm"))
```

```r
penguins %>% 
  select(where(~ is.numeric(.))) %>%  # equivalent to: select(where(is.numeric))
  select(where(~ mean(., na.rm = T) > 1000))
```

???
deselect:
- if you want to deselect something put a minus in front
where:
- feed a function that takes a vector and returns T or F
- when using a function within another function you usually require the formula (~) notation (see `purrr` part), except when only using a function with one argument

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `rename()` changes the column names

Change the name of the column `body_mass_g` (`sex`) to `bm` (`gender`).

```r
penguins %>% rename(bm = body_mass_g, gender = sex) %>% 
  colnames()
```

```
> [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
> [5] "flipper_length_mm" "bm"                "gender"            "year"
```
Convert the name of the columns that include the string `"mm"` to upper case.

```r
penguins %>% rename_with(.fn = toupper, .cols = contains("mm")) %>% 
  colnames()
```

```
> [1] "species"           "island"            "BILL_LENGTH_MM"    "BILL_DEPTH_MM"    
> [5] "FLIPPER_LENGTH_MM" "body_mass_g"       "sex"               "year"
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `relocate()` changes the order of columns

Change the order of columns in the `tibble` according to the following scheme:
1. place `species` after `body_mass_g`
2. place `sex` before `species`
3. place `island` at the end

```r
penguins %>% 
  relocate(species, .after = body_mass_g) %>%
  relocate(sex, .before = species) %>%
  relocate(island, .after = last_col()) %>%
  colnames()
```

```
> [1] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
> [5] "sex"               "species"           "year"              "island"
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `mutate()` transforms the column values and/or creates new columns

Create a new `bm_kg` variable which reflects `body_mass_g` measured in kilo grams.

```r
penguins %>% 
  mutate(bm_kg = body_mass_g / 1000, .keep = "all", .after = island) %>% 
  slice_head(n = 5)
```

```
> # A tibble: 5 x 9
> species island bm_kg bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
> <fct> <fct> <dbl> <dbl> <dbl> <int> <int>
> 1 Adelie Torgersen 3.75 39.1 18.7 181 3750
> 2 Adelie Torgersen 3.8 39.5 17.4 186 3800
> 3 Adelie Torgersen 3.25 40.3 18 195 3250
> 4 Adelie Torgersen NA NA NA NA NA
> 5 Adelie Torgersen 3.45 36.7 19.3 193 3450
> # ... with 2 more variables: sex <fct>, year <int>
```

- Use the `.keep` argument to specify which columns to keep after manipulation.
- Use the `.before`/`.after` arguments to specify the position of the new column.
- For overriding a given column simply use the same column name.
- For keeping only the new column use `dplyr::transmute()`.

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `mutate()` transforms the column values and/or creates new columns

Create a *one-hot encoded* variable for `sex`.

```r
penguins %>% 
  mutate(
    sex_binary = case_when(
      sex == "male" ~ 1,
      sex == "female" ~ 0),
    .keep = "all", .after = island
  ) %>% 
  slice_head(n = 3)
```

```
> # A tibble: 3 x 9
> species island sex_binary bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
> <fct> <fct> <dbl> <dbl> <dbl> <int> <int>
> 1 Adelie Torgersen 1 39.1 18.7 181 3750
> 2 Adelie Torgersen 0 39.5 17.4 186 3800
> 3 Adelie Torgersen 0 40.3 18 195 3250
> # ... with 2 more variables: sex <fct>, year <int>
```

.footnote[
_**One-hot Encoding:** Encoding a categorical variable with `C` factor levels into `C` dummies (often in modeling you create `C-1` dummies otherwise you have a perfect linear combination of the variables)._
]

???
case_when:
- vectorized version of if_else
- two-sided formulas: LHS tests the condition, RHS specifies the replacement value
- for unmatched cases, the function returns NA
- use LHS `TRUE` to capture all cases not explicitly specified beforehand

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `mutate()` transforms the column values and/or creates new columns

Transform measurement variables to meters.

```r
penguins %>% 
  mutate(
    across(contains("mm"), ~ . / 1000),
    .keep = "all"
  ) %>% 
  slice_head(n = 3)
```

```
> # A tibble: 3 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <dbl> <int> <fct> 
> 1 Adelie Torgersen 0.0391 0.0187 0.181 3750 male 
> 2 Adelie Torgersen 0.0395 0.0174 0.186 3800 female
> 3 Adelie Torgersen 0.0403 0.018 0.195 3250 female
> # ... with 1 more variable: year <int>
```

???
across:
- apply same transformation across multiple columns
- allows you to use the semantics you know from the `select()` function
- does not require you to explicitly specify a column name as it only transform existing columns

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on columns:** `mutate()` transforms the column values and/or creates new columns

Define `species`, `island` and `sex` as a categorical variable, i.e. *factors*, using `across()`.

```r
penguins %>% 
  mutate(
    across(where(is.character), as.factor),
    .keep = "all"
  ) %>% 
  slice_head(n = 3)
```

```
> # A tibble: 3 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Adelie Torgersen 39.1 18.7 181 3750 male 
> 2 Adelie Torgersen 39.5 17.4 186 3800 female
> 3 Adelie Torgersen 40.3 18 195 3250 female
> # ... with 1 more variable: year <int>
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on grouped data:** `group_by()` partitions data based on one or several columns

```r
penguins %>% group_by(species)
```

```
> # A tibble: 344 x 8
> # Groups: species [3]
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Adelie Torgersen 39.1 18.7 181 3750 male 
> 2 Adelie Torgersen 39.5 17.4 186 3800 fema~
> 3 Adelie Torgersen 40.3 18 195 3250 fema~
> 4 Adelie Torgersen NA NA NA NA <NA> 
> 5 Adelie Torgersen 36.7 19.3 193 3450 fema~
> 6 Adelie Torgersen 39.3 20.6 190 3650 male 
> 7 Adelie Torgersen 38.9 17.8 181 3625 fema~
> 8 Adelie Torgersen 39.2 19.6 195 4675 male 
> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 
> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 
> # ... with 334 more rows, and 1 more variable: year <int>
```

Use `group_keys()`, `group_indices()` and `group_vars()` to access grouping keys, group indices per row and grouping variables.

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on grouped data:** `group_by()` partitions data based on one or several columns

Under the hood `group_by()` changes the representation of the `tibble` and transforms it into a grouped data frame (`grouped_df`). This allows us to operate on the subgroups individually using `summarise()`.

**Operations on grouped data:** `summarise()` reduces a group of data into a single row

```r
penguins %>% group_by(species) %>% summarise(count = n(), .groups = "drop")
```

```
> # A tibble: 3 x 2
> species count
> <fct> <int>
> 1 Adelie 152
> 2 Chinstrap 68
> 3 Gentoo 124
```
]
.panel[.panel-name[bivariate]

```r
penguins %>% group_by(species, sex) %>% summarise(count = n(), .groups = "drop")
```

```
> # A tibble: 8 x 3
> species sex count
> <fct> <fct> <int>
> 1 Adelie female 73
> 2 Adelie male 73
> 3 Adelie <NA> 6
> 4 Chinstrap female 34
> 5 Chinstrap male 34
> 6 Gentoo female 58
> 7 Gentoo male 61
> 8 Gentoo <NA> 5
```
]
]

???
- use `.groups = ` to indicate what happens to the groups after summarising them

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on grouped data:** `group_by()` partitions data based on one or several columns and `summarise()` reduces a group of data into a single row

```r
penguins %>%
  group_by(species) %>%
  summarise(
    across(contains("mm"), ~ mean(., na.rm = T), .names = "{.col}_avg"),
    .groups = "drop"
  )
```

```
> # A tibble: 3 x 4
> species bill_length_mm_avg bill_depth_mm_avg flipper_length_mm_avg
> <fct> <dbl> <dbl> <dbl>
> 1 Adelie 38.8 18.3 190.
> 2 Chinstrap 48.8 18.4 196.
> 3 Gentoo 47.5 15.0 217.
```

Using `group_by()`, followed by `summarise()` and `ungroup()` reflects the **split-apply-combine paradigm** in data analysis: Split the data into partitions, apply some function to the data and then merge the results.

???
- the true potential is unleashed if you combine `group_by` and `summarise`
- split-apply-combine paradigm particularly useful in parallel processing

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Operations on grouped data:** `group_by()` partitions data based on one or several columns and `summarise()` reduces a group of data into a single row

<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/group_by_ungroup.png" width="60%" height="60%" style="float:left; padding:10px" />
 
*Note: Instead of using `ungroup()` you may also set the `.groups` argument in `summarise()` equal to "drop".*

*But never forget to ungroup your data, otherwise you may run into errors later on in your analysis!*

???
- now lets look at some more advanced use cases

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Stacked `group_by()`:** Use `.add = T` to add new grouping variables (otherwise the first is overridden)

```r
penguins %>% 
  group_by(species) %>% 
  group_by(year, .add = T)   # equivalent to: group_by(species, year)
```

```
> # A tibble: 344 x 8
> # Groups: species, year [9]
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Adelie Torgersen 39.1 18.7 181 3750 male 
> 2 Adelie Torgersen 39.5 17.4 186 3800 fema~
> 3 Adelie Torgersen 40.3 18 195 3250 fema~
> 4 Adelie Torgersen NA NA NA NA <NA> 
> 5 Adelie Torgersen 36.7 19.3 193 3450 fema~
> 6 Adelie Torgersen 39.3 20.6 190 3650 male 
> 7 Adelie Torgersen 38.9 17.8 181 3625 fema~
> 8 Adelie Torgersen 39.2 19.6 195 4675 male 
> 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 
> 10 Adelie Torgersen 42 20.2 190 4250 <NA> 
> # ... with 334 more rows, and 1 more variable: year <int>
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Apply multiple summary functions:** Provide a list of `purrr`-style functions to `across()`

```r
penguins %>%
  group_by(species) %>%
  summarise(
    across(
      contains("mm"),
      list(avg = ~ mean(., na.rm = T), sd = ~ sd(., na.rm = T)),
      .names = "{.col}_{.fn}"
    ),
    .groups = "drop"
  )
```

```
> # A tibble: 3 x 7
> species bill_length_mm_avg bill_length_mm_sd bill_depth_mm_avg bill_depth_mm_sd
> <fct> <dbl> <dbl> <dbl> <dbl>
> 1 Adelie 38.8 2.66 18.3 1.22 
> 2 Chinstrap 48.8 3.34 18.4 1.14 
> 3 Gentoo 47.5 3.08 15.0 0.981
> # ... with 2 more variables: flipper_length_mm_avg <dbl>, flipper_length_mm_sd <dbl>
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Changed behavior of `mutate()`:** Summary functions, e.g., `mean()` or `sd()` now operate on partitions of the data instead of on the whole data

```r
penguins %>%
  group_by(species) %>% 
  mutate(stand_bm = (body_mass_g - mean(body_mass_g, na.rm = T)) / sd(body_mass_g, na.rm = T)) %>% 
  glimpse
```

```
> Rows: 344
> Columns: 9
> Groups: species [3]
> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~
> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~
> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~
> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~
> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~
> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~
> $ sex <fct> male, female, female, NA, female, male, female, male, NA,~
> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~
> $ stand_bm <dbl> 0.107591350, 0.216626878, -0.982763938, NA, -0.546621823,~
```

???
- here example of the z-transformation on a group level

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**`group_by()` a transformed column:** Provide a `mutate()`-like expression in your `group_by()` statement

```r
bm_breaks <- mean(penguins$body_mass_g, na.rm = T) - (-3:3) * sd(penguins$body_mass_g, na.rm = T)

penguins %>% 
  group_by(species, bm_bin = cut(body_mass_g, breaks = bm_breaks)) %>%
  summarise(count = n(), .groups = "drop")
```

```
> # A tibble: 12 x 3
> species bm_bin count
> <fct> <fct> <int>
> 1 Adelie (2.6e+03,3.4e+03] 39
> 2 Adelie (3.4e+03,4.2e+03] 87
> 3 Adelie (4.2e+03,5e+03] 25
> 4 Adelie <NA> 1
> 5 Chinstrap (2.6e+03,3.4e+03] 11
> 6 Chinstrap (3.4e+03,4.2e+03] 50
> 7 Chinstrap (4.2e+03,5e+03] 7
> 8 Gentoo (3.4e+03,4.2e+03] 6
> 9 Gentoo (4.2e+03,5e+03] 56
> 10 Gentoo (5e+03,5.81e+03] 52
> 11 Gentoo (5.81e+03,6.61e+03] 9
> 12 Gentoo <NA> 1
```

???
1. compute bins for body mass, the amount of standard deviations from the mean
2. group by data according to these bins (create bins in `group_by()` command)

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Changed behavior of `filter()`:** Filters now operate on partitions of the data instead of on the whole data

```r
penguins %>% 
  group_by(species, island) %>% 
  filter(flipper_length_mm == max(flipper_length_mm, na.rm = T))
```

```
> # A tibble: 5 x 8
> # Groups: species, island [5]
> species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Adelie Dream 40.8 18.9 208 4300 male 
> 2 Adelie Biscoe 41 20 203 4725 male 
> 3 Adelie Torgersen 44.1 18 210 4000 male 
> 4 Gentoo Biscoe 54.3 15.7 231 5650 male 
> 5 Chinstrap Dream 49 19.6 212 4300 male 
> # ... with 1 more variable: year <int>
```

???
- Group by all unique `species`-`island` combinations and filter for the penguins with the maximal flipper length per combination

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Nesting of grouped data:** Usually, you will find it more intuitive to use `group_by()` followed by `nest()` to produce a nested data frame compared to the example in [section 4.4](#tidyr_nest).

```r
penguins %>% 
  group_by(species, year) %>% 
  tidyr::nest()
```

```
> # A tibble: 9 x 3
> # Groups: species, year [9]
> species year data 
> <fct> <int> <list> 
> 1 Adelie 2007 <tibble [50 x 6]>
> 2 Adelie 2008 <tibble [50 x 6]>
> 3 Adelie 2009 <tibble [52 x 6]>
> 4 Gentoo 2007 <tibble [34 x 6]>
> 5 Gentoo 2008 <tibble [46 x 6]>
> 6 Gentoo 2009 <tibble [44 x 6]>
> 7 Chinstrap 2007 <tibble [26 x 6]>
> 8 Chinstrap 2008 <tibble [18 x 6]>
> 9 Chinstrap 2009 <tibble [24 x 6]>
```

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

**Other selected `dplyr` operations:**

```r
penguins %>% 
  distinct(species, island)
```

```
> # A tibble: 5 x 2
> species island 
> <fct> <fct> 
> 1 Adelie Torgersen
> 2 Adelie Biscoe 
> 3 Adelie Dream 
> 4 Gentoo Biscoe 
> 5 Chinstrap Dream
```
]
.panel[.panel-name[pull()]
`pull()` extracts single columns as vectors.

```r
penguins %>% 
  pull(year)  # equivalent to: penguins$year
```

```
>   [1] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
>  [17] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
>  [33] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
>  [49] 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008
>  [65] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008
>  [81] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008
>  [97] 2008 2008 2008 2008
>  [ reached getOption("max.print") -- omitted 244 entries ]
```

]
.panel[.panel-name[if_else()]
`if_else()` applies a vectorized if-else-statement.

```r
penguins %>% select(species, island, body_mass_g) %>% 
 mutate(penguin_size = if_else(body_mass_g < 3500, "tiny penguin", "big penguin"))
```

```
> # A tibble: 344 x 4
> species island body_mass_g penguin_size
> <fct> <fct> <int> <chr> 
> 1 Adelie Torgersen 3750 big penguin 
> 2 Adelie Torgersen 3800 big penguin 
> 3 Adelie Torgersen 3250 tiny penguin
> 4 Adelie Torgersen NA <NA> 
> 5 Adelie Torgersen 3450 tiny penguin
> 6 Adelie Torgersen 3650 big penguin 
> 7 Adelie Torgersen 3625 big penguin 
> 8 Adelie Torgersen 4675 big penguin 
> 9 Adelie Torgersen 3475 tiny penguin
> 10 Adelie Torgersen 4250 big penguin 
> # ... with 334 more rows
```
]
.panel[.panel-name[lag()]
`lag()` shifts column values by an offset of `n` forward.

```r
penguins %>% select(species, body_mass_g) %>% 
  mutate(lagged_bm = lag(body_mass_g, n = 1))
```

```
> # A tibble: 344 x 3
> species body_mass_g lagged_bm
> <fct> <int> <int>
> 1 Adelie 3750 NA
> 2 Adelie 3800 3750
> 3 Adelie 3250 3800
> 4 Adelie NA 3250
> 5 Adelie 3450 NA
> 6 Adelie 3650 3450
> 7 Adelie 3625 3650
> 8 Adelie 4675 3625
> 9 Adelie 3475 4675
> 10 Adelie 4250 3475
> # ... with 334 more rows
```
]
.panel[.panel-name[lead()]
`lead()` shifts column values by an offset of `n` backward.

```r
penguins %>% select(species, body_mass_g) %>% 
  mutate(lead_bm = lead(body_mass_g, n = 2))
```

```
> # A tibble: 344 x 3
> species body_mass_g lead_bm
> <fct> <int> <int>
> 1 Adelie 3750 3250
> 2 Adelie 3800 NA
> 3 Adelie 3250 3450
> 4 Adelie NA 3650
> 5 Adelie 3450 3625
> 6 Adelie 3650 4675
> 7 Adelie 3625 3475
> 8 Adelie 4675 4250
> 9 Adelie 3475 3300
> 10 Adelie 4250 3700
> # ... with 334 more rows
```
]
.panel[.panel-name[join()]
`left_join()`, `right_join()`, `inner_join()` and `full_join()` enable to merge different data frames by matching rows based on keys (similar to joins performed in SQL).
]
]

.pull-right[.pull-right[.footnote[
*Note: Find more information about `dplyr` by running `vignette("dplyr")` and consulting the official [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf).*
]]]

---

## 4.5 `dplyr`: A Grammar of Data Manipulation

.center[
*Src: [Steves (2021)](https://www.rstudio.com/resources/rstudioglobal-2021/the-dynamic-duo-sql-and-r/)*
]

---

# 4.6 `purrr`: Functional Programming Tools

---

background-image: url(https://raw.githubusercontent.com/tidyverse/purrr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---

## 4.6 `purrr`: Functional Programming Tools

`purrr` facilitates [*functional programming*](https://en.wikipedia.org/wiki/Functional_programming) (FP) with data frame objects in `R`. Whenever you would normally refer to a `for`-loop for solving an iterative problem, the family of `map_*()` functions allows you to rephrase your problem as a `tidyverse` pipeline.

**Four main types of `map_*()` functions:**
- `map(.x, .f, ...)` takes the input `.x` and applies `.f` to each element in `.x`.
- `map2(.x, .y, .f, ...)` takes the inputs `.x` and `.y` and applies `.f` to `.x` and `.y` in parallel.
- `pmap(.l, .f, ...)` takes a list `.l` of inputs and applies `.f` to each element in `.l` in parallel.
- `group_map(.data, .f, ...)` takes a grouped `tibble` and applies `.f` to each subgroup.

.pull-left[
By default `map()` returns a list. If you want to be more explicit about the output you may refer to
- `map_lgl()` to receive a logical output type,
- `map_chr()` to receive a character output type,
- `map_int()` to receive an integer output type,
- `map_dbl()` to receive a double output type ,
- `map_df()` to receive a data frame output.
]

.pull-right[
The input `.x` to any `map()_*` function can be either a vector, a list or a data frame.
- **Vector:** Iteration over vector elements
- **List:** Iteration over list elements
- **Data frame:** Iteration over columns
]

???
In functional programming, your code is organised into functions that perform the operations you need.

Your scripts will only be a sequence of calls to these functions, making them easier to understand.

---

## 4.6 `purrr`: Functional Programming Tools

.center[
_Src: [Rodrigues (2010)](https://b-rodrigues.github.io/modern_R/functional-programming.html)_
]

---

## 4.6 `purrr`: Functional Programming Tools

**Use Case:** Let's assume we have multiple data samples and require each of the samples to be `$z$`-normalized for further modeling. First, we would probably write a *named function* for performing `$z$`-normalization which takes our sample `.x` as input.

```r
z_transform <- function(.x) {
 mean <- mean(.x, na.rm = T)
 sd <- sd(.x , na.rm = T)
 return( (.x - mean) / sd )
}
```

Second, we draw samples from the `penguins` data set and store them as double vectors in a list.

```r
samples <- list(
 sample1 = slice_sample(penguins, n = 10)$bill_length_mm,
 sample2 = slice_sample(penguins, n = 10)$bill_depth_mm,
 sample3 = slice_sample(penguins, n = 10)$flipper_length_mm
)

samples[1]
```

```
> $sample1
>  [1] 55.9 43.2 37.2 34.6 36.0 52.2 32.1 37.2 38.1 50.5
```

???
- here: different means and sd

---

## 4.6 `purrr`: Functional Programming Tools

Third, perform the `$z$`-normalization.

```r
for (sample in samples) {
  print(z_transform(.x = sample)) 
}
```

```
>  [1]  1.7107192  0.1807098 -0.5421293 -0.8553596 -0.6866971  1.2649684 -1.1565426
>  [8] -0.5421293 -0.4337035  1.0601640
>  [1]  1.4295925  1.1598581 -0.1888141  0.2967079  0.6203892 -0.7822299 -1.3216988
>  [8] -1.5374863  0.5664423 -0.2427610
>  [1] -0.31304243  0.08829402 -0.71437887 -0.07224056  2.09497625  1.37257064
>  [7] -1.03544803 -0.15250785 -0.95518074 -0.31304243
```
]
.panel[.panel-name[map()]

```r
map(.x = samples, .f = ~ z_transform(.x))
```

```
> $sample1
>  [1]  1.7107192  0.1807098 -0.5421293 -0.8553596 -0.6866971  1.2649684 -1.1565426
>  [8] -0.5421293 -0.4337035  1.0601640
> 
> $sample2
>  [1]  1.4295925  1.1598581 -0.1888141  0.2967079  0.6203892 -0.7822299 -1.3216988
>  [8] -1.5374863  0.5664423 -0.2427610
> 
> $sample3
>  [1] -0.31304243  0.08829402 -0.71437887 -0.07224056  2.09497625  1.37257064
>  [7] -1.03544803 -0.15250785 -0.95518074 -0.31304243
```
]
]

???
often times, `map` statements are more efficient than for for-loops

---

## Excursus: Tilde-Shorthand

Within the `tidyverse`, the tilde-shorthand regularly occurs whenever an external function is required as an argument to one of the `tidyverse` functions. In general, you have different ways of including the second function call, one of which is the tilde-shorthand notation.

```r
map(.x = samples, .f = z_transform)
```
 Note that other function arguments can be passed on to the function as additional positional arguments beyond `.f`, e.g., `map(.x = samples, .f = mean, na.rm = T)` 
]
.panel[.panel-name[Option 2]
Defining an anonymous function inline.

```r
map(
 .x = samples,
 .f = function(.x) { (.x - mean(.x, na.rm = T)) / sd(.x, na.rm = T) }
)
```
 Note that here you could also omit `{ }`, since there is only a single expression involved in the function.
]
.panel[.panel-name[Option 3]
Defining an anonymous function inline using the tilde-shorthand.

```r
map(
 .x = samples,
 .f = ~ (.x - mean(.x, na.rm = T)) / sd(.x, na.rm = T)
)
```
 Note that whenever we use the tilde-shorthand, we refer to the argument of the anonymous function by `.x` or simply by `.` (if it only requires one input).
]
]

???
The tilde indicates: what comes next should be considered as a function

Most of the time, explicitly defining named functions and then choosing option 1 only makes sense if you require them at least more than once. Otherwise, I would strongly recommend using anonymous function, i.e. option 2. or 3.

---

## 4.6 `purrr`: Functional Programming Tools

<img src="https://media1.tenor.com/images/f72cb542d6b3e3c3421889e0a3d9628d/tenor.gif" width="50%" style="display: block; margin: auto;" />
 
.center[🕺 Now let us look at some other practical use cases! 💃]

---

## 4.6 `purrr`: Functional Programming Tools

```r
penguins %>%
  map_df(class) %>% 
  glimpse
```

```
> Rows: 1
> Columns: 8
> $ species <chr> "factor"
> $ island <chr> "factor"
> $ bill_length_mm <chr> "numeric"
> $ bill_depth_mm <chr> "numeric"
> $ flipper_length_mm <chr> "integer"
> $ body_mass_g <chr> "integer"
> $ sex <chr> "factor"
> $ year <chr> "integer"
```
]
.pull-right[
Check the number of missing values per column.

```r
penguins %>%
  map_df(~ .x %>% is.na %>% sum) %>% 
  glimpse
```

```
> Rows: 1
> Columns: 8
> $ species <int> 0
> $ island <int> 0
> $ bill_length_mm <int> 2
> $ bill_depth_mm <int> 2
> $ flipper_length_mm <int> 2
> $ body_mass_g <int> 2
> $ sex <int> 11
> $ year <int> 0
```
]

???
1: I give `map` a data frame as input (`penguins`), so it iterates over each column. And to each column I apply the `class()` function. I want the output to be returned as a data frame (`map_df`)

---

## 4.6 `purrr`: Functional Programming Tools

Check the number of distinct values per column.

```r
penguins %>%
  map_df(dplyr::n_distinct) %>% 
  glimpse
```

```
> Rows: 1
> Columns: 8
> $ species <int> 3
> $ island <int> 3
> $ bill_length_mm <int> 165
> $ bill_depth_mm <int> 81
> $ flipper_length_mm <int> 56
> $ body_mass_g <int> 95
> $ sex <int> 3
> $ year <int> 3
```

---

## 4.6 `purrr`: Functional Programming Tools

Check the highest value in each subset of the data (e.g., largest `flipper_length_mm` per `sex`).

```r
penguins %>%
  tidyr::drop_na() %>% 
  dplyr::group_by(sex) %>%
  group_map(~ dplyr::slice_max(.x, flipper_length_mm, n = 1), .keep = T)
```

```
> [[1]]
> # A tibble: 1 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct> 
> 1 Gentoo Biscoe 46.9 14.6 222 4875 female
> # ... with 1 more variable: year <int>
> 
> [[2]]
> # A tibble: 1 x 8
> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 
> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
> 1 Gentoo Biscoe 54.3 15.7 231 5650 male 
> # ... with 1 more variable: year <int>
```

???
- drop_na: because otherwise I would also have a subgroup of NA

---

## 4.6 `purrr`: Functional Programming Tools

Produce a series of identical plots, each depicting a separate subset of the underlying data.

```r
species <- penguins %>%
 dplyr::distinct(species, year) %>%
 dplyr::pull(species) # .x argument for map()

years <- penguins %>%
 dplyr::distinct(species, year) %>%
 dplyr::pull(year) # .y argument for map()

penguin_plots <- map2(
 .x = species,
 .y = years,
 .f = ~ {
 penguins %>%
 tidyr::drop_na() %>% 
 dplyr::filter(species == .x, year == .y) %>% 
 ggplot2::ggplot() +
 geom_point(aes(x = bill_length_mm, y = body_mass_g)) +
 labs(title = glue::glue("Scatter Plot Bill Length vs. Body Mass ({.x}, {.y})"))
 }
)
```

---

## 4.6 `purrr`: Functional Programming Tools

```r
penguin_plots[[1]]
```

<img src="index_files/figure-html/unnamed-chunk-148-1.png" width="576" style="display: block; margin: auto;" />
]
.pull-right[

```r
penguin_plots[[4]]
```

<img src="index_files/figure-html/unnamed-chunk-149-1.png" width="576" style="display: block; margin: auto;" />
]

---

## 4.6 `purrr`: Functional Programming Tools

Finally, `map()` is really powerful in the context of modeling. In the following, we fit a linear regression model for each `species`-`island` subset.

First, we create a nested data frame that contains a `tibble` to each `species`-`island` combination.

```r
nested_penguins <- penguins %>% 
 tidyr::drop_na() %>% 
 dplyr::group_by(species, island) %>% 
 tidyr::nest()

nested_penguins
```

```
> # A tibble: 5 x 3
> # Groups: species, island [5]
> species island data 
> <fct> <fct> <list> 
> 1 Adelie Torgersen <tibble [47 x 6]> 
> 2 Adelie Biscoe <tibble [44 x 6]> 
> 3 Adelie Dream <tibble [55 x 6]> 
> 4 Gentoo Biscoe <tibble [119 x 6]>
> 5 Chinstrap Dream <tibble [68 x 6]>
```

.pull-right[.footnote[
*Note: For accessing elements in a nested `tibble` you may use the `pluck()` function. For example, for accessing the first `tibble` in the column `data`, you may run `nested_penguins %>% pluck("data", 1)` (also see [section 4.4](#nested-data)).*
]]

---

## 4.6 `purrr`: Functional Programming Tools

Second, we fit a linear model to each data subset. In our model, `body_mass_g` is regressed (`~`) on all other variables (denoted by a dot in the `lm()` formula).

```r
nested_penguins <- nested_penguins %>% 
 dplyr::mutate(lin_reg = map(
 .x = data,
 .f = ~ lm(body_mass_g ~ ., data = .x)
 ))

nested_penguins
```

```
> # A tibble: 5 x 4
> # Groups: species, island [5]
> species island data lin_reg
> <fct> <fct> <list> <list> 
> 1 Adelie Torgersen <tibble [47 x 6]> <lm> 
> 2 Adelie Biscoe <tibble [44 x 6]> <lm> 
> 3 Adelie Dream <tibble [55 x 6]> <lm> 
> 4 Gentoo Biscoe <tibble [119 x 6]> <lm> 
> 5 Chinstrap Dream <tibble [68 x 6]> <lm>
```

---

## 4.6 `purrr`: Functional Programming Tools

Third, for each linear model, we generate a model summary using `summary()` and extract the model coefficients as a `tibble`. Finally, we use `unnest()` to receive a tidy data frame.

```r
nested_penguins <- nested_penguins %>% 
 dplyr::mutate(coefs = map(
 .x = lin_reg,
 .f = ~ summary(.) %>% .$coefficients %>% as_tibble(rownames = "variable")
 ))

nested_penguins
```

```
> # A tibble: 5 x 5
> # Groups: species, island [5]
> species island data lin_reg coefs 
> <fct> <fct> <list> <list> <list> 
> 1 Adelie Torgersen <tibble [47 x 6]> <lm> <tibble [6 x 5]>
> 2 Adelie Biscoe <tibble [44 x 6]> <lm> <tibble [6 x 5]>
> 3 Adelie Dream <tibble [55 x 6]> <lm> <tibble [6 x 5]>
> 4 Gentoo Biscoe <tibble [119 x 6]> <lm> <tibble [6 x 5]>
> 5 Chinstrap Dream <tibble [68 x 6]> <lm> <tibble [6 x 5]>
```
---

## 4.6 `purrr`: Functional Programming Tools

Third, for each linear model, we generate a model summary using `summary()` and extract the model coefficients as a `tibble`. Finally, we use `unnest()` to receive a tidy data frame.

```r
nested_penguins %>% tidyr::unnest(coefs)
```

```
> # A tibble: 30 x 9
> # Groups: species, island [5]
> species island data lin_reg variable Estimate `Std. Error` `t value` `Pr(>|t|)`
> <fct> <fct> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl>
> 1 Adelie Torgersen <tib~ <lm> (Interc~ 4.49e5 130401. 3.45 0.00133 
> 2 Adelie Torgersen <tib~ <lm> bill_le~ 4.20e0 17.3 0.243 0.809 
> 3 Adelie Torgersen <tib~ <lm> bill_de~ -6.20e1 54.6 -1.14 0.263 
> 4 Adelie Torgersen <tib~ <lm> flipper~ 1.55e1 8.74 1.77 0.0838 
> 5 Adelie Torgersen <tib~ <lm> sexmale 6.48e2 149. 4.33 0.0000926
> 6 Adelie Torgersen <tib~ <lm> year -2.23e2 64.9 -3.44 0.00136 
> 7 Adelie Biscoe <tib~ <lm> (Interc~ 6.37e4 140556. 0.454 0.653 
> 8 Adelie Biscoe <tib~ <lm> bill_le~ 3.78e1 24.1 1.57 0.125 
> 9 Adelie Biscoe <tib~ <lm> bill_de~ 1.16e2 44.3 2.62 0.0124 
> 10 Adelie Biscoe <tib~ <lm> flipper~ 2.41e1 8.21 2.94 0.00553 
> # ... with 20 more rows
```

.footnote[.pull-right[
*Note: There are specific packages (e.g., `broom`) for tidying model outputs. These provide convenient functions that help you achieve the same thing with much less code.*
]]

---

## 4.6 `purrr`: Functional Programming Tools

.pull-left[
.center[
🤔 How you may probably feel right now 
<img src="https://tenor.com/view/matg-calculate-confusing-figure-out-gif-6237717.gif" style="display: block; margin: auto;" />
]]

.pull-right[
.center[
🤓 After having mastered the intricacies of FP 
<img src="https://tenor.com/view/cat-computer-gif-5368357.gif" style="display: block; margin: auto;" />
]]

.footnote[
*Note: Find more information about `purrr` by consulting the official [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/purrr.pdf). For a great tutorial that helps you master the notion of functional programming with `R` see [this blogpost](http://www.rebeccabarter.com/blog/2019-08-19_purrr/#simplest-usage-repeated-looping-with-map) by Rebecca Barter.*
]

---

## 4.6 `purrr`: Functional Programming Tools

Finally, `purrr` also provides convenient [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function) for **error handling**. These come in handy if you are iterating over a very large data set and your program would simply stop if an error occurs. This is particularly frustrating as you would loose the whole progress.

For example, at some point you might want to train a separate prediction model (`lm`) for each unique value of `species` (Adelie, Gentoo, Chinstrap). Unfortunately, the following code is throwing an error ...

```r
grouped_penguins <- penguins %>% 
 dplyr::mutate(across(c(sex, island), as.factor)) %>% 
 dplyr::group_by(species)
```

```r
grouped_penguins %>% 
 group_map(.f = ~ lm(flipper_length_mm ~ bill_length_mm + island, data = .x))
```
```
> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
> contrasts can be applied only to factors with 2 or more levels
```

🤔 **Which group is eventually responsible for the error?**

???
- wrapper functions: wrap a function around another function, i.e. you call a function when applying another function

---

## 4.6 `purrr`: Functional Programming Tools

`purrr::possibly()` returns a list containing the function's result respectively a user-defined value (`otherwise`) if an error occurs.

```r
possibly_lm <- possibly(.f = lm, otherwise = "Error message")

grouped_penguins %>% 
  group_map(.f = ~ possibly_lm(flipper_length_mm ~ bill_length_mm + island, data = .x))
```

```
> [[1]]
> 
> Call:
> .f(formula = ..1, data = ..2)
> 
> Coefficients:
>     (Intercept)   bill_length_mm      islandDream  islandTorgersen  
>        157.5591           0.8014           1.3159           2.4199  
> 
> 
> [[2]]
> [1] "Error message"
> 
> [[3]]
> [1] "Error message"
```

.footnote[.pull-right[
*Note: Use `purrr::discard(. == "Error message")` (`purrr::keep()`) at the end of the pipeline to drop (keep) function calls that yielded an error. These work like `dplyr::select()` and `dplyr::filter()` in the context of `tibbles`.*
]]

---

## 4.6 `purrr`: Functional Programming Tools

`purrr::safely()` returns a named list containing the function's result (or `otherwise` if an error occurs) as well as an error object that captures the error message.

```r
safely_lm <- safely(.f = lm, otherwise = NULL)

grouped_penguins %>% 
 group_map(.f = ~ safely_lm(flipper_length_mm ~ bill_length_mm + island, data = .x)) 
```
 
- Use `purrr::map(., "result")` at the end of the pipeline to access the results of each function call stored in the list. 
- Use `purrr::map(., "error")` at the end of the pipeline to access the errors of each function call stored in the list.

.footnote[
*Note: Similarly, use `purrr::quietly()`  to return a named list containing not only the function's results and error but also other kinds of output, such as warnings or messages.*
]

???
- quietly: useful to capture warning messages that the code throws, e.g., `summarise()` frequently throws a warning if you do not specify the `.drop` argument

---

# 4.7 `ggplot2`: Create Elegant Data Visualisations Using the Grammar of Graphics

---

background-image: url(https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true

---