class: center, middle, hide-count count: false # Machine Learning in R ### Introduction to the Tidyverse ___ **Simon Schölzel** Winter Term 2021/2022 .small[(updated: 2021-09-25)] <br><br> <a href="https://www.wiwi.uni-muenster.de/"><img src="https://www.wiwi.uni-muenster.de/fakultaet/sites/all/themes/wwucd/assets/images/logos/secondary_wiwi_aacsb_german.jpg" alt="fb4-logo" height="45"></a> <a href="https://www.wiwi.uni-muenster.de/ctrl/aktuelles"><img src="https://www.wiwi.uni-muenster.de/ctrl/sites/all/themes/wwucd/assets/images/logos/berenslogo5.jpg" alt="ftb-logo" height="45"></a> <a href="https://www.wiwi.uni-muenster.de/iff2/de/news"><img src="https://www.wiwi.uni-muenster.de/iff2/sites/all/themes/wwucd/assets/images/logos/logo_iff2_en2.jpg" alt="ipb-logo" height="45"></a> --- ## Agenda **1 Learning Objectives** **2 Introduction to the `tidyverse`** > 2.1 What is the `tidyverse` 2.2 The Concept of Tidy Data **3 `palmerpenguins`: Palmer Archipelago (Antarctica) Penguin Data** **4 The Core `tidyverse` Packages** > 4.1 `magrittr`: A Forward-Pipe Operator for `R` 4.2 `tibble`: Simple Data Frames 4.3 `readr`: Read Rectangular Text Data 4.4 `tidyr`: Tidy Messy Data 4.5 `dplyr`: A Grammar of Data Manipulation 4.6 `purrr`: Functional Programming Tools 4.7 `ggplot2`: Create Elegant Data Visualisations Using the Grammar of Graphics --- ## 1 Learning Objectives 💡 This lecture teaches you important tools for working with tabular data sets in `R`. It introduces and showcases a suite of packages which ease your data science workflow in terms of data import, data cleaning, data transformation and data visualization. More specifically, after this lecture you will - be familiar with the main tools of the `tidyverse` and how it differs from `base R`,<br><br> - know your way around in working with the core packages of the `tidyverse` for importing, tidying, transforming and visualizing data,<br><br> - be proficient in processing (*non-tidy*) data of any shape and quality,<br><br> - be able to produce high-quality, fully customizable visualizations,<br><br> - have improved your overall data literacy. ??? especially highlight the last point: how you think about data, how you approach working with data whenever you open a new data set, build a mental model for data transformation operations --- class: middle, center, inverse, hide-logo # 2 Introduction to the `tidyverse` --- background-image: url(https://www.tidyverse.org/images/hex-tidyverse.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 2.1 What is the `tidyverse`? > The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/) > Its primary goal is to facilitate a conversation between a human and a computer about data. ~ [Wickham, et al. (2019)](https://joss.theoj.org/papers/10.21105/joss.01686) .pull-left[.center[ <img src="https://www.tidyverse.org/images/hex-tidyverse.png" width="45%" height="45%" /> Official `tidyverse` [Hex Sticker](https://github.com/rstudio/hex-stickers) ]] .pull-right[.center[ <img src="https://pbs.twimg.com/profile_images/905186381995147264/7zKAG5sY.jpg" width="50%" height="50%" /> Hadley Wickham - Chief Scientist @ RStudio, [Founding Father](https://twitter.com/hadleywickham/status/959507805282582528?s=20) of the `tidyverse` ]] ??? - Can also be seen as a philosophy of how to write code in R. Its a dialect. - Many people in the community argue that this dialect should be incorporated in base R. - Often when googling for specific solutions and reading the stackoverflow answers, you may find solutions which can be implemented using plain `base R` or using the `tidyverse` syntax. --- ## 2.1 What is the `tidyverse`? > The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/) .pull-left[ **`tidyverse` core packages:** - `readr`: data import - `tibble`: modern data frame object - `stringr`: working with strings - `forcats`: working with factors - `tidyr`: data tidying - `dplyr`: data manipulation - `ggplot2`: data visualization - `purrr`: functional programming ] .pull-right[ <img src="./img/tidyverse-hex.PNG" width="90%" height="90%" style="display: block; margin: auto;" /> ] ??? - Tidyverse can be viewed as a meta-package - each package has its own goal which makes the tidyverse a modular collection of packages - these are the core packages (there are many others for special purposes which integrate seamlessly, e.g., lubridate, stringr, forcats, ...) --- ## 2.1 What is the `tidyverse`? > The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/) ```r install.packages("tidyverse") library(tidyverse) ``` ``` -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.5 v purrr 0.3.4 v tibble 3.1.4 v dplyr 1.0.7 v tidyr 1.1.3 v stringr 1.4.0 v readr 2.0.1 v forcats 0.5.1 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() ``` .footnote[ *Note that `install.packages("tidyverse")` is essentially equivalent to running `install.packages("ggplot2")`, `install.packages("tibble")`, `install.packages("tidyr")`, `install.packages("readr")`, etc. individually.* ] ??? - see tidyverse package version as well as the version of the eight core packages - the eight core packages are loaded by loading the `tidyverse` package - note that some `base R` functions (`stats` namespace) are overwritten by their `tidyverse` equivalents (`dplyr` namespace) - when working with a new or rarely used package, i prefer to explicitly state the namespace to remember where the function is coming from --- ## 2.1 What is the `tidyverse`? > The tidyverse is an opinionated collection of R packages designed **for data science**. All packages share an underlying design philosophy, grammar, and data structures. ~ [tidyverse.org](https://www.tidyverse.org/) These packages are geared towards facilitating the day-2-day data science workflow: <img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="60%" height="60%" style="float:left; padding:20px" /> <br> **Import:** `readr` **Tidy:** `tidyr` **Transform:** `dplyr`, `forcats`, `stringr` **Visualize:** `ggplot2` **Model:** `tidymodels` **Communicate:** `rmarkdown` **Program:** `magrittr`, `purrr`, `tibble` .footnote[ *Source: [Wickham/Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html).* *Note: For the communication and modeling part of the workflow refer to the `rmarkdown` and `tidymodels` videos.* ] ??? workflow here as stylized example --- ## 2.1 What is the `tidyverse`? > The tidyverse is an opinionated collection of R packages designed for data science. All packages share an **underlying design philosophy, grammar, and data structures**. ~ [tidyverse.org](https://www.tidyverse.org/) This underlying design philosophy and grammar boils down to a consistent and easy-to-use API: - The `tibble` as the core underlying data structure - Extensive use of the `%>%`-operator for gluing together multiple function calls - Consistently applied naming conventions (e.g., function names in [*snakecase*](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/other-stats-artwork/coding_cases.png)) - Consistent order of function arguments (e.g., `fn(arg1 = data, arg2 = col names, ...)`) - ... -- The `tidyverse` syntax can be viewed as a "dialect" of `R`. When you have familiarized yourself with it, you will be able to easily transfer your knowledge about one function or package to other components of the `tidyverse`. Just like learning a new language. .footnote[ *Note: For further information see [Tidyverse Team (2020)](https://design.tidyverse.org/) and [Wickham (2019)](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html).* ] ??? - API: application programming interface - R data structures: atomic vector (character, integer, numeric, logical, complex), list, matrix, data frame, factors -> tibble simply an extension/better version of the data frame - snakecase: underscores, numbers and lowercase characters --- name: tidy data ## 2.2 The Concept of Tidy Data > Tidy data sets are all alike; but every messy data set is messy in its own way. ~ [Wickham/Grolemund (2017)](https://r4ds.had.co.nz/tidy-data.html) .pull-left[ **Tidy Data Principles:** The concept of tidy data has been coined by Hadley Wickham in his 2014 paper ["Tidy Data"](https://www.jstatsoft.org/article/view/v059i10). The concept formulates principles for structuring rectangular, tabular data sets consisting of rows and columns: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. ] .pull-right[ ``` > # A tibble: 344 x 8 > species island bill_length_mm bill_depth_mm > <fct> <fct> <dbl> <dbl> > 1 Adelie Torgersen 39.1 18.7 > 2 Adelie Torgersen 39.5 17.4 > 3 Adelie Torgersen 40.3 18 > 4 Adelie Torgersen NA NA > 5 Adelie Torgersen 36.7 19.3 > 6 Adelie Torgersen 39.3 20.6 > 7 Adelie Torgersen 38.9 17.8 > 8 Adelie Torgersen 39.2 19.6 > 9 Adelie Torgersen 34.1 18.1 > 10 Adelie Torgersen 42 20.2 > # ... with 334 more rows, and 4 more variables: > # flipper_length_mm <int>, body_mass_g <int>, > # sex <fct>, year <int> ``` ] ??? - 3: relates to the storage of one data set per table (analogy to principles in data base design) -> here the type of observational unit might be the citizen, he/she reserves a policy treatment, e.g., tax reduction (hence information about firms might be stored in a different data frame) - all the upcoming tools are geared towards bringing data into this tabular shape (inversely we will not work with text or image data) --- ## 2.2 The Concept of Tidy Data **Violations of the Tidy Data Principles:** 1. Column headers are values, not variable names. 2. Multiple variables are stored in one column. 3. Variables are stored in both rows and columns. 4. Multiple types of observational units are stored in the same table. 5. A single observational unit is stored in multiple tables. .panelset[ .panel[.panel-name[Example 1] ``` > # A tibble: 3 x 4 > species Biscoe Dream Torgersen > <fct> <int> <int> <int> > 1 Adelie 44 56 52 > 2 Chinstrap NA 68 NA > 3 Gentoo 124 NA NA ``` ] .panel[.panel-name[Example 2] ``` > # A tibble: 5 x 3 > col island year > <chr> <fct> <int> > 1 Gentoo_NA Biscoe 2007 > 2 Adelie_male Torgersen 2007 > 3 Gentoo_female Biscoe 2008 > 4 Chinstrap_male Dream 2008 > 5 Adelie_male Torgersen 2009 ``` ] .panel[.panel-name[Example 3] ``` > # A tibble: 3 x 4 > term bill_length_mm bill_depth_mm flipper_length_mm > <chr> <dbl> <dbl> <dbl> > 1 bill_length_mm NA -0.235 0.656 > 2 bill_depth_mm -0.235 NA -0.584 > 3 flipper_length_mm 0.656 -0.584 NA ``` ] .panel[.panel-name[Example 4] ``` > # A tibble: 6 x 6 > species island sex model mpg cyl > <fct> <fct> <fct> <chr> <dbl> <dbl> > 1 Chinstrap Dream female <NA> NA NA > 2 Gentoo Biscoe female <NA> NA NA > 3 Gentoo Biscoe male <NA> NA NA > 4 <NA> <NA> <NA> Merc 450SLC 15.2 8 > 5 <NA> <NA> <NA> Dodge Challenger 15.5 8 > 6 <NA> <NA> <NA> Pontiac Firebird 19.2 8 ``` ] .panel[.panel-name[Example 5] ``` # A tibble: 4 x 2 # A tibble: 4 x 4 species island species bill_length_mm bill_depth_mm flipper_length_mm <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 Adelie Torgersen 1 Adelie 39.1 18.7 181 2 Adelie Torgersen 2 Adelie 39.5 17.4 186 3 Adelie Torgersen 3 Adelie 40.3 18 195 4 Adelie Torgersen 4 Adelie NA NA NA ``` ] ] --- ## 2.2 The Concept of Tidy Data <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_3.jpg" width="80%" height="80%" style="display: block; margin: auto;" /> --- class: middle, center, inverse layout: false # 3 `palmerpenguins`:<br><br>Palmer Archipelago (Antarctica) Penguin Data --- background-image: url(https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 3 `palmerpenguins`: Palmer Archipelago<br>(Antarctica) Penguin Data .pull-left[ From here on, to illustrate the features of the `tidyverse` core packages we use data from the `palmerpenguins` package by [Allison Horst](https://allisonhorst.github.io/palmerpenguins/). The package comes with data about penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. ] .pull-right[ <img src="https://tenor.com/view/penguin-fat-the-struggle-is-real-lazy-gif-4242854.gif" width="60%" style="display: block; margin: auto;" /> ] --- ## 3 `palmerpenguins`: Palmer Archipelago<br>(Antarctica) Penguin Data .pull-left[ ```r library(palmerpenguins) penguins ``` ``` > # A tibble: 344 x 8 > species island bill_length_mm bill_depth_mm > <fct> <fct> <dbl> <dbl> > 1 Adelie Torgersen 39.1 18.7 > 2 Adelie Torgersen 39.5 17.4 > 3 Adelie Torgersen 40.3 18 > 4 Adelie Torgersen NA NA > 5 Adelie Torgersen 36.7 19.3 > 6 Adelie Torgersen 39.3 20.6 > 7 Adelie Torgersen 38.9 17.8 > 8 Adelie Torgersen 39.2 19.6 > 9 Adelie Torgersen 34.1 18.1 > 10 Adelie Torgersen 42 20.2 > # ... with 334 more rows, and 4 more variables: > # flipper_length_mm <int>, body_mass_g <int>, > # sex <fct>, year <int> ``` ] .pull-right[ <img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png" width="65%" height="65%" style="display: block; margin: auto;" /><img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png" width="65%" height="65%" style="display: block; margin: auto;" /> ] --- class: middle, center, inverse layout: false # 4.1 `magrittr`:<br><br>A Forward-Pipe Operator for `R` --- background-image: url(https://raw.githubusercontent.com/tidyverse/magrittr/master/man/figures/logo.png) background-position: 95% 5% background-size: 7.5% layout: true --- ## 4.1 `magrittr`: The Forward-Pipe Operator `magrittr` comes with a set of operators: - **Pipe Operator:** `%>%`<br><br> - **Assignment Operator:** `%<>%`<br><br> - **"Tee" Operator:** `%T>%`<br><br> - **Exposition Operator:** `%$%` -- <br> Essentially, these operators aim to improve the readability of your code in multiple ways: - arrange operations into an easily readable pipeline of chained commands (left-to-right), - avoid nested function calls (inside-out), - minimize the use of local variable assignments (`<-`) and function definitions, and - easily add and/or delete steps in your pipeline without breaking the code. ??? The exposition operator: %$% (explodes out variables in a data frame, no need to use pull()) --- ## 4.1 `magrittr`: The Forward-Pipe Operator **Basic Piping:** forward a value or object (LHS) into the next function call (RHS) as **first** argument ```r x %>% f # equivalent to: f(x) x %>% f(y) # equivalent to: f(x, y) x %>% f %>% g %>% h # equivalent to: h(g(f(x))) ``` -- **Piping with placeholders:** forward a value or object (LHS) into the next function call (RHS) as **any** argument ```r x %>% f(.) # equivalent to: x %>% f x %>% f(y, .) # equivalent to: f(y, x) x %>% f(y, z = .) # equivalent to: f(y, z = x) x %>% f(y = nrow(.), z = ncol(.)) # equivalent to: f(x, y = nrow(x), z = ncol(x)) ``` -- **Building functions and pipelines:** a sequence of code starting with the placeholder (`.`) returns a function which can be used to later apply the pipeline to concrete values ```r f <- . %>% cos %>% sin # equivalent to: f <- function(.) sin(cos(.)) f(20) # equivalent to: the pipeline 20 %>% cos %>% sin ``` .footnote[ *Note: Find out more about `%>%` by running `vignette("magrittr")`. Type `%>%` using the shortcut: Ctrl + Shift + M.* ] --- ## 4.1 `magrittr`: The Forward-Pipe Operator **Question:** What is the average body mass in grams for all penguins observed in the year 2007 (after excluding missing values)? **In a pipeless world:** ```r mean(subset(penguins, year == 2007)$body_mass_g, na.rm = T) # alternatively: peng_bm_2007 <- subset(penguins, year == 2007)$body_mass_g mean(peng_bm_2007, na.rm = T) ``` -- .pull-left[ **In a world full of pipes:** ```r penguins %>% subset(year == 2007) %>% .$body_mass_g %>% mean(na.rm = T) ``` ] .pull-right[ <br> - Sequential style improves readability! - Less deciphering of nested function calls! - No need to store intermediate results! - Modular modification of pipeline steps! ] .footnote[ *Note: As of version `4.1.0`, base `R` comes with a native pipe operator as well (`|>`).* ] ??? - Add or remove individual steps easily in your pipeline - The `magrittr` forward pipe is imported by the `tidyverse`, no need to load it separately --- ## 4.1 `magrittr`: The Forward-Pipe Operator **Advanced piping:** Use the more advanced pipe operators to further streamline your workflow. .panelset[ .panel[.panel-name[Tee Pipe] `%T>%` can be used to trigger the side-effect of a function, e.g., for plotting or printing results, and let the original data bypass the respective step. ```r penguins[1:5, c("island", "bill_length_mm")] %T>% print %>% .$bill_length_mm %>% mean(na.rm=T) ``` ``` > # A tibble: 5 x 2 > island bill_length_mm > <fct> <dbl> > 1 Torgersen 39.1 > 2 Torgersen 39.5 > 3 Torgersen 40.3 > 4 Torgersen NA > 5 Torgersen 36.7 ``` ``` > [1] 38.9 ``` ] .panel[.panel-name[Exposition Pipe] `%$%` exposes the names in LHS object to the RHS expression. This is useful if the RHS expression does not allow for a separate `data` argument. ```r penguins %$% plot(species, bill_length_mm) # equivalent to: plot(penguins$species, penguins$bill_length_mm) ``` <img src="index_files/figure-html/unnamed-chunk-35-1.png" width="432" /> ] .panel[.panel-name[Assignment Pipe] `%<>%` can be used equivalently to the base `R` assignment operator (`<-`). It reassign the result of the of the pipeline to the starting variable. ```r var <- penguins$bill_length_mm var %<>% mean(na.rm=T) var ``` ``` > [1] 43.92193 ``` ] ] --- class: middle, center, inverse layout: false # 4.2 `tibble`:<br><br>Simple Data Frames --- background-image: url(https://raw.githubusercontent.com/tidyverse/tibble/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.2 `tibble`: Simple Data Frames `tibble` provides an enhanced data frame object of class `tbl_df`, a so-called `tibble`. A `tibble` can be created in four different ways. .panelset[ .panel[.panel-name[tibble()] Create a `tibble` from column vectors with `tibble()`. ```r tibble( x = c("a", "b"), y = c(1, 2), z = c(T, F) ) ``` ``` > # A tibble: 2 x 3 > x y z > <chr> <dbl> <lgl> > 1 a 1 TRUE > 2 b 2 FALSE ``` ] .panel[.panel-name[tribble()] Create a *transposed* `tibble` row by row with `tribble()`. ```r tribble( ~x, ~y, ~z, "a", 1, T, "b", 2, F ) ``` ``` > # A tibble: 2 x 3 > x y z > <chr> <dbl> <lgl> > 1 a 1 TRUE > 2 b 2 FALSE ``` ] .panel[.panel-name[as_tibble()] Create a `tibble` from an existing data frame with `as_tibble()`. ```r data.frame( x = c("a", "b"), y = c(1, 2), z = c(T, F) ) %>% as_tibble ``` ``` > # A tibble: 2 x 3 > x y z > <chr> <dbl> <lgl> > 1 a 1 TRUE > 2 b 2 FALSE ``` ] .panel[.panel-name[enframe()] Create a `tibble` from named vectors with `enframe()`. ```r c(x = "a", y = "b", z = 1) %>% enframe(name = "x", value = "y") ``` ``` > # A tibble: 3 x 2 > x y > <chr> <chr> > 1 x a > 2 y b > 3 z 1 ``` ] ] -- There are three important differences between a `tibble` and a `data.frame` object. ??? - named vector: i have key-value pairs --- ## 4.2 `tibble`: Simple Data Frames **Printing:** By default, `tibble()` prints only the first ten rows and all the columns that fit on the screen as well as a description of the data type. This gives you a much more concise view of your data. ```r penguins ``` ``` > # A tibble: 344 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ > 3 Adelie Torgersen 40.3 18 195 3250 fema~ > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ > 6 Adelie Torgersen 39.3 20.6 190 3650 male > 7 Adelie Torgersen 38.9 17.8 181 3625 fema~ > 8 Adelie Torgersen 39.2 19.6 195 4675 male > 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> > 10 Adelie Torgersen 42 20.2 190 4250 <NA> > # ... with 334 more rows, and 1 more variable: year <int> ``` ??? - you will never again have the problem that `R` takes minutes to print a large data frame entirely to your console (`reached 'max' / getOption("max.print")`) --- ## 4.2 `tibble`: Simple Data Frames **Printing:** By default, `tibble()` prints only the first ten rows and all the columns that fit on the screen as well as a description of the data type. .panelset[ .panel[.panel-name[data.frame()] ```r data.frame(penguins) ``` ``` > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g > 1 Adelie Torgersen 39.1 18.7 181 3750 > 2 Adelie Torgersen 39.5 17.4 186 3800 > 3 Adelie Torgersen 40.3 18.0 195 3250 > 4 Adelie Torgersen NA NA NA NA > 5 Adelie Torgersen 36.7 19.3 193 3450 > sex year > 1 male 2007 > 2 female 2007 > 3 female 2007 > 4 <NA> 2007 > 5 female 2007 > [ reached 'max' / getOption("max.print") -- omitted 339 rows ] ``` ] .panel[.panel-name[tibble() (Option 1)] ```r penguins ``` ``` > # A tibble: 344 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ > 3 Adelie Torgersen 40.3 18 195 3250 fema~ > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ > 6 Adelie Torgersen 39.3 20.6 190 3650 male > 7 Adelie Torgersen 38.9 17.8 181 3625 fema~ > 8 Adelie Torgersen 39.2 19.6 195 4675 male > 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> > 10 Adelie Torgersen 42 20.2 190 4250 <NA> > # ... with 334 more rows, and 1 more variable: year <int> ``` ] .panel[.panel-name[tibble() (Option 2)] ```r penguins %>% glimpse ``` ``` > Rows: 344 > Columns: 8 > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~ > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ > $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~ > $ sex <fct> male, female, female, NA, female, male, female, male, NA,~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ ``` ] ] ??? - in contrast to `data.frame()` which prints an extensive number of rows (wh) `glimpse` transposed version of `print()` --- ## 4.2 `tibble`: Simple Data Frames **Subsetting:** Subsetting a `tibble` (`[]`) always returns another `tibble` and never a vector (in contrast to standard `data.frame` objects). .panelset[ .panel[.panel-name[data.frame()] ```r data.frame(penguins) %>% .[,"species"] %>% class ``` ``` > [1] "factor" ``` ] .panel[.panel-name[tibble()] ```r penguins[,"species"] %>% class ``` ``` > [1] "tbl_df" "tbl" "data.frame" ``` ] ] --- ## 4.2 `tibble`: Simple Data Frames **Partial Matching:** Subsetting a `tibble` does not allow for partial matching, i.e. you must always provide the whole column name. .panelset[ .panel[.panel-name[data.frame()] ```r data.frame(penguins)$spec ``` ``` > [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie > [12] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie > [23] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie > [34] Adelie Adelie Adelie Adelie Adelie Adelie Adelie > [ reached getOption("max.print") -- omitted 304 entries ] > Levels: Adelie Chinstrap Gentoo ``` ] .panel[.panel-name[tibble()] ```r penguins$spec ``` ``` > Warning: Unknown or uninitialised column: `spec`. ``` ``` > NULL ``` ] ] ??? - also an advantage of tibbles: Giving you better warning messages to confront you with problems early on. --- class: middle, center, inverse layout: false # 4.3 `readr`:<br><br>Read Rectangular Text Data ??? - not data in the form of texts, but as stored in a text file (txt, csv, excel file) --- background-image: url(https://raw.githubusercontent.com/tidyverse/readr/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.3 `readr`: Read Rectangular Text Data `readr` provides read and write functions for multiple different file formats: - `read_delim()`: general delimited files - `read_csv()`: comma separated files - `read_csv2()`: semicolon separated files - `read_tsv()`: tab separated files - `read_fwf()`: fixed width files - `read_table()`: white-space separated files - `read_log()`: web log files Conveniently, the `write_*()` functions work analog. In addition, use the `readxl` package for Excel files, the `haven` package for Stata files, the `googlesheets4` package for Google Sheets or the `rvest` package for HTML files. .footnote[ *Note: In most European countries Microsoft Excel is using `;` as the common delimiter, which can be accounted for by leveraging the `read_csv2()` function.* ] ??? - `read_delim()` as a generalization of the other functions - `rvest` as the go-to package in the context of web scraping with `R` --- ## 4.3 `readr`: Read Rectangular Text Data Let's try it out by reading in the penguins data. For the purpose of illustrating the `readr` package, the `penguins` data is written to a csv-file a priori using `write_csv(penguins, file = "./data/penguins.csv")`. .panelset[ .panel[.panel-name[Base Case] ```r data <- read_csv(file = "./data/penguins.csv") ``` ``` > Rows: 344 Columns: 8 ``` ``` > -- Column specification ------------------------------------------------------------- > Delimiter: "," > chr (3): species, island, sex > dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year ``` ``` > > i Use `spec()` to retrieve the full column specification for this data. > i Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ] .panel[.panel-name[Select Columns] ```r data <- read_csv(file = "./data/penguins.csv", col_select = c(species, island)) ``` ``` > Rows: 344 Columns: 2 ``` ``` > -- Column specification ------------------------------------------------------------- > Delimiter: "," > chr (2): species, island ``` ``` > > i Use `spec()` to retrieve the full column specification for this data. > i Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ] .panel[.panel-name[Name Columns] ```r data <- read_csv(file = "./data/penguins.csv", col_names = paste("Var", 1:8, sep = "_")) ``` ``` > Rows: 345 Columns: 8 ``` ``` > -- Column specification ------------------------------------------------------------- > Delimiter: "," > chr (8): Var_1, Var_2, Var_3, Var_4, Var_5, Var_6, Var_7, Var_8 ``` ``` > > i Use `spec()` to retrieve the full column specification for this data. > i Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ] .panel[.panel-name[Skip Rows] ```r data <- read_csv(file = "./data/penguins.csv", skip = 5) ``` ``` > Rows: 339 Columns: 8 ``` ``` > -- Column specification ------------------------------------------------------------- > Delimiter: "," > chr (3): Adelie, Torgersen, female > dbl (5): 36.7, 19.3, 193, 3450, 2007 ``` ``` > > i Use `spec()` to retrieve the full column specification for this data. > i Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ] ] .pull-right[.pull-right[ .footnote[ <i>Note: The output of any `read_*()` function is a `tibble` object.</i> ]]] --- ## 4.3 `readr`: Read Rectangular Text Data `readr` prints the column specifications after importing. By default, it tries to infer the column type (e.g., `int`, `dbl`, `chr`, `fct`, `date`, `lgl`) from the first 1,000 rows and parses the columns accordingly. Try to make column specifications explicit! You likely get more familiar with your data and see warnings if something changes unexpectedly. .panelset[ .panel[.panel-name[Option 1] ```r read_csv( file = "./data/penguins.csv", col_types = cols( species = col_character(), year = col_datetime(format = "%Y"), island = col_skip()) ) ``` ] .panel[.panel-name[Option 2] ```r read_csv( file = "./data/penguins.csv", col_types = "_f?di" # skip, factor, guess, double, integer, ... ) ``` ] ] Parsing only the first 1,000 rows is efficient but can lead to erroneous guesses: ```r read_csv(file = "./data/penguins.csv", guess_max = 2000) ``` .footnote[ *Note: Find more information and functions on the `readr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).* ] ??? - Hint: sometimes you may have trouble when reading in text data (type character): special signs such as ö, ä or ü may be strangely encoded as cryptic symbols -> in those cases you must control for the encoding of your data in the read_csv function (e.g., UTF-8) --- ## 4.3 `readr`: Read Rectangular Text Data .pull-left[ Eventually, you would want to cease using `.xlsx` and `.csv` files as they are not capable of reliably storing your metadata (e.g., data types). <img src="./img/excel.jpg" width="60%" height="60%" style="display: block; margin: auto;" /> ] -- .pull-right[ `write_rds()` and `read_rds()` provide a nice alternative for [serializing](https://en.wikipedia.org/wiki/Serialization) your `R` objects (e.g., `tibbles`, models) and storing them as `.rds` files. ```r penguins %>% write_rds(file = "./data/penguins.rds") ``` ```r penguins <- read_rds(file = "./data/penguins.rds") ``` <br> Note that - `write_rds()` can only be used to save one object at a time, - a loaded `.rds` file must be stored into a new variable, i.e. given a new name, - `read_rds()` preserves data types! ] ??? - serialization: the process of translating a data structure or object state into a format that can be stored, transmitted and reconstructed later (possibly in a different computer environment). --- class: middle, center, inverse layout: false # 4.4 `tidyr`:<br><br>Tidy Messy Data --- background-image: url(https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.4 `tidyr`: Tidy Messy Data `tidyr` provides several functions that help you bring your data into the *tidy data* format (e.g., reshaping data, splitting columns, handling missing values or nesting data). ```r penguins ``` ``` > # A tibble: 344 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ > 3 Adelie Torgersen 40.3 18 195 3250 fema~ > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ > 6 Adelie Torgersen 39.3 20.6 190 3650 male > 7 Adelie Torgersen 38.9 17.8 181 3625 fema~ > 8 Adelie Torgersen 39.2 19.6 195 4675 male > 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> > 10 Adelie Torgersen 42 20.2 190 4250 <NA> > # ... with 334 more rows, and 1 more variable: year <int> ``` ??? - Let's again start with our `penguins` data set which already is in *tidy data* format - in the following i highlight the dimensionality of the data to show you what happens DIM: 344 x 8 --- ## 4.4 `tidyr`: Tidy Messy Data **Pivotting:** Converts between long and wide format using `pivot_longer()` and `pivot_wider()`. .panelset[ .panel[.panel-name[pivot_longer()] ```r long_penguins <- penguins %>% pivot_longer( cols = c(species, island), names_to = "variable", values_to = "value" ) long_penguins %>% glimpse ``` ``` > Rows: 688 > Columns: 8 > $ bill_length_mm <dbl> 39.1, 39.1, 39.5, 39.5, 40.3, 40.3, NA, NA, 36.7, 36.7, 3~ > $ bill_depth_mm <dbl> 18.7, 18.7, 17.4, 17.4, 18.0, 18.0, NA, NA, 19.3, 19.3, 2~ > $ flipper_length_mm <int> 181, 181, 186, 186, 195, 195, NA, NA, 193, 193, 190, 190,~ > $ body_mass_g <int> 3750, 3750, 3800, 3800, 3250, 3250, NA, NA, 3450, 3450, 3~ > $ sex <fct> male, male, female, female, female, female, NA, NA, femal~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ > $ variable <chr> "species", "island", "species", "island", "species", "isl~ > $ value <fct> Adelie, Torgersen, Adelie, Torgersen, Adelie, Torgersen, ~ ``` ] .panel[.panel-name[pivot_wider()] ```r long_penguins %>% pivot_wider( names_from = "variable", values_from = "value" ) %>% glimpse ``` ``` > Rows: 344 > Columns: 8 > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ > $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~ > $ sex <fct> male, female, female, NA, female, male, female, male, NA,~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~ ``` ] ] ??? `pivot_longer()`: - now for each observation we have two rows, one row per variable that are pivotted -> no tidy format any longer - DIM: 688 x 8 `pivot_wider()` - invert `pivot_longer()` - DIM: 344 x 8 --- ## 4.4 `tidyr`: Tidy Messy Data .right[ <img src="https://raw.githubusercontent.com/apreshill/teachthat/master/pivot/pivot_longer_smaller.gif" width="80%" height="80%" /> ] .footnote[.pull-left[ *Source: [Allison Hill](https://github.com/apreshill/teachthat/blob/master/pivot/pivot_longer_smaller.gif)* <i>Note: Find more information about `pivot_*()` in the [pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html).</i> ]] --- name: tidyr_nest ## 4.4 `tidyr`: Tidy Messy Data **Nesting:** Groups similar data such that each group becomes a single row in a data frame. ```r nested_penguins <- penguins %>% nest(nested_data = c(island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex)) nested_penguins ``` ``` > # A tibble: 9 x 3 > species year nested_data > <fct> <int> <list> > 1 Adelie 2007 <tibble [50 x 6]> > 2 Adelie 2008 <tibble [50 x 6]> > 3 Adelie 2009 <tibble [52 x 6]> > 4 Gentoo 2007 <tibble [34 x 6]> > 5 Gentoo 2008 <tibble [46 x 6]> > 6 Gentoo 2009 <tibble [44 x 6]> > 7 Chinstrap 2007 <tibble [26 x 6]> > 8 Chinstrap 2008 <tibble [18 x 6]> > 9 Chinstrap 2009 <tibble [24 x 6]> ``` ??? - note that `nest()` produces a nested data frame with one row per species and year - note that the `nested_data` column contains `tibbles` with six columns each and a varying amount of observations - the work with nested data can be particularly helpful if you would like to apply functions to each subset of the data (e.g., fit a model for each year or for each species) --- name: nested-data ## 4.4 `tidyr`: Tidy Messy Data **Rectangling:** Disentangles nested data structures (e.g., JSON, HTML) and brings it into *tidy data* format. .panelset[ .panel[.panel-name[pluck()] Extract individual objects from a nested data structure via `purrr::pluck()`. ```r nested_penguins %>% purrr::pluck("nested_data", 1) ``` ``` > # A tibble: 50 x 6 > island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <dbl> <dbl> <int> <int> <fct> > 1 Torgersen 39.1 18.7 181 3750 male > 2 Torgersen 39.5 17.4 186 3800 female > 3 Torgersen 40.3 18 195 3250 female > 4 Torgersen NA NA NA NA <NA> > 5 Torgersen 36.7 19.3 193 3450 female > 6 Torgersen 39.3 20.6 190 3650 male > 7 Torgersen 38.9 17.8 181 3625 female > 8 Torgersen 39.2 19.6 195 4675 male > 9 Torgersen 34.1 18.1 193 3475 <NA> > 10 Torgersen 42 20.2 190 4250 <NA> > # ... with 40 more rows ``` ] .panel[.panel-name[unnest()] Flatten nested data structures via `tidyr::unnest()`. ```r nested_penguins %>% unnest(cols = c(nested_data)) ``` ``` > # A tibble: 344 x 8 > species year island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g > <fct> <int> <fct> <dbl> <dbl> <int> <int> > 1 Adelie 2007 Torgersen 39.1 18.7 181 3750 > 2 Adelie 2007 Torgersen 39.5 17.4 186 3800 > 3 Adelie 2007 Torgersen 40.3 18 195 3250 > 4 Adelie 2007 Torgersen NA NA NA NA > 5 Adelie 2007 Torgersen 36.7 19.3 193 3450 > 6 Adelie 2007 Torgersen 39.3 20.6 190 3650 > 7 Adelie 2007 Torgersen 38.9 17.8 181 3625 > 8 Adelie 2007 Torgersen 39.2 19.6 195 4675 > 9 Adelie 2007 Torgersen 34.1 18.1 193 3475 > 10 Adelie 2007 Torgersen 42 20.2 190 4250 > # ... with 334 more rows, and 1 more variable: sex <fct> ``` ] .panel[.panel-name[hoist()] Selectively extract individual components from an object in a nested data structure via `tidyr::hoist()`. ```r nested_penguins %>% hoist(nested_data, hoisted_col = "bill_length_mm") ``` ``` > # A tibble: 9 x 4 > species year hoisted_col nested_data > <fct> <int> <list> <list> > 1 Adelie 2007 <dbl [50]> <tibble [50 x 5]> > 2 Adelie 2008 <dbl [50]> <tibble [50 x 5]> > 3 Adelie 2009 <dbl [52]> <tibble [52 x 5]> > 4 Gentoo 2007 <dbl [34]> <tibble [34 x 5]> > 5 Gentoo 2008 <dbl [46]> <tibble [46 x 5]> > 6 Gentoo 2009 <dbl [44]> <tibble [44 x 5]> > 7 Chinstrap 2007 <dbl [26]> <tibble [26 x 5]> > 8 Chinstrap 2008 <dbl [18]> <tibble [18 x 5]> > 9 Chinstrap 2009 <dbl [24]> <tibble [24 x 5]> ``` ] ] ??? Alternatively use `unnest_wider()` or `unnest_longer()` for more control over the rectangling operation. --- ## 4.4 `tidyr`: Tidy Messy Data **Splitting** and **Combining:** Transforms a single character column into multiple columns and vice versa. .panelset[ .panel[.panel-name[unite()] Collapse multiple columns into a single column. ```r penguins %>% unite(col = "species_gender", c(species, sex), sep = "_", remove = T) ``` ``` > # A tibble: 344 x 7 > species_gender island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g > <chr> <fct> <dbl> <dbl> <int> <int> > 1 Adelie_male Torgersen 39.1 18.7 181 3750 > 2 Adelie_female Torgersen 39.5 17.4 186 3800 > 3 Adelie_female Torgersen 40.3 18 195 3250 > 4 Adelie_NA Torgersen NA NA NA NA > 5 Adelie_female Torgersen 36.7 19.3 193 3450 > 6 Adelie_male Torgersen 39.3 20.6 190 3650 > 7 Adelie_female Torgersen 38.9 17.8 181 3625 > 8 Adelie_male Torgersen 39.2 19.6 195 4675 > 9 Adelie_NA Torgersen 34.1 18.1 193 3475 > 10 Adelie_NA Torgersen 42 20.2 190 4250 > # ... with 334 more rows, and 1 more variable: year <int> ``` ] .panel[.panel-name[separate()] Separate a single column, containing multiple values, into multiple columns. ```r penguins %>% separate(bill_length_mm, sep = 2, into = c("cm", "mm")) ``` ``` > # A tibble: 344 x 9 > species island cm mm bill_depth_mm flipper_length_~ body_mass_g sex year > <fct> <fct> <chr> <chr> <dbl> <int> <int> <fct> <int> > 1 Adelie Torger~ 39 ".1" 18.7 181 3750 male 2007 > 2 Adelie Torger~ 39 ".5" 17.4 186 3800 fema~ 2007 > 3 Adelie Torger~ 40 ".3" 18 195 3250 fema~ 2007 > 4 Adelie Torger~ <NA> <NA> NA NA NA <NA> 2007 > 5 Adelie Torger~ 36 ".7" 19.3 193 3450 fema~ 2007 > 6 Adelie Torger~ 39 ".3" 20.6 190 3650 male 2007 > 7 Adelie Torger~ 38 ".9" 17.8 181 3625 fema~ 2007 > 8 Adelie Torger~ 39 ".2" 19.6 195 4675 male 2007 > 9 Adelie Torger~ 34 ".1" 18.1 193 3475 <NA> 2007 > 10 Adelie Torger~ 42 "" 20.2 190 4250 <NA> 2007 > # ... with 334 more rows ``` ] .panel[.panel-name[separate_rows()] Separate a single column, containing multiple values, into multiple rows. ```r penguins %>% separate_rows(island, sep = "s", convert = T) ``` ``` > # A tibble: 564 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <chr> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torger 39.1 18.7 181 3750 male > 2 Adelie en 39.1 18.7 181 3750 male > 3 Adelie Torger 39.5 17.4 186 3800 female > 4 Adelie en 39.5 17.4 186 3800 female > 5 Adelie Torger 40.3 18 195 3250 female > 6 Adelie en 40.3 18 195 3250 female > 7 Adelie Torger NA NA NA NA <NA> > 8 Adelie en NA NA NA NA <NA> > 9 Adelie Torger 36.7 19.3 193 3450 female > 10 Adelie en 36.7 19.3 193 3450 female > # ... with 554 more rows, and 1 more variable: year <int> ``` ] ] ??? can also `separate` based on character match --- ## 4.4 `tidyr`: Tidy Messy Data **Handling missing values:** Drop or replace explicit or implicit missing values (`NA`). .panelset[ .panel[.panel-name[Base Case] ```r incompl_penguins ``` ``` > # A tibble: 4 x 3 > species year measurement > <chr> <dbl> <dbl> > 1 Adelie 2007 31.0 > 2 Adelie 2008 39.7 > 3 Gentoo 2008 43.3 > 4 Chinstrap 2007 NA ``` ] .panel[.panel-name[complete()] Make implicit missing values explicit. ```r incompl_penguins %>% complete(species, year, fill = list(measurement = NA)) ``` ``` > # A tibble: 6 x 3 > species year measurement > <chr> <dbl> <dbl> > 1 Adelie 2007 31.0 > 2 Adelie 2008 39.7 > 3 Chinstrap 2007 NA > 4 Chinstrap 2008 NA > 5 Gentoo 2007 NA > 6 Gentoo 2008 43.3 ``` .pull-right[.pull-right[.footnote[ ]]] ] .panel[.panel-name[drop_na()] Make explicit missing values implicit. ```r incompl_penguins %>% drop_na(measurement) ``` ``` > # A tibble: 3 x 3 > species year measurement > <chr> <dbl> <dbl> > 1 Adelie 2007 31.0 > 2 Adelie 2008 39.7 > 3 Gentoo 2008 43.3 ``` ] .panel[.panel-name[fill()] Replace missing values with the next/previous value. ```r incompl_penguins %>% fill(measurement, .direction = "down") ``` ``` > # A tibble: 4 x 3 > species year measurement > <chr> <dbl> <dbl> > 1 Adelie 2007 31.0 > 2 Adelie 2008 39.7 > 3 Gentoo 2008 43.3 > 4 Chinstrap 2007 43.3 ``` ] .panel[.panel-name[replace_na()] Replace missing values with a pre-defined value. ```r incompl_penguins %>% replace_na(replace = list(measurement = mean(.$measurement, na.rm = T))) ``` ``` > # A tibble: 4 x 3 > species year measurement > <chr> <dbl> <dbl> > 1 Adelie 2007 31.0 > 2 Adelie 2008 39.7 > 3 Gentoo 2008 43.3 > 4 Chinstrap 2007 38.0 ``` ] ] .footnote[ *Note: Find more information and functions on the `tidyr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).* ] ??? Note: function arguments preceded by a dot in the tidyverse may have one of two reasons: - the function is still pre-mature, i.e. developers still think about the best way of implementing and naming the function - the function is regularly applied within another function so that you don't confuse function arguments between the inner and outer function --- class: middle, center, inverse layout: false # 4.5 `dplyr`:<br><br>A Grammar of Data Manipulation --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.5 `dplyr`: A Grammar of Data Manipulation `dplyr` provides a set of functions for manipulating data frame objects (e.g., `tibbles`) while relying on a consistent grammar. Functions are intuitively represented by "verbs" that reflect the underlying operations and always output a new or modified `tibble`. **Operations on rows:** - `filter()` picks rows that meet one or several logical criteria - `slice()` picks rows based on their location in the data - `arrange()` changes the order of rows **Operations on columns:** - `select()` picks respectively drops certain columns - `rename()` changes the column names - `relocate()` changes the order of columns - `mutate()` transforms the column values and/or creates new columns **Operations on grouped data:** - `group_by()` partitions data based on one or several columns - `summarise()` reduces a group of data into a single row --- ## 4.5 `dplyr`: A Grammar of Data Manipulation <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_filter.jpg" width="40%" height="40%" style="float:right; padding:10px" /> **Operations on rows:** `filter()` picks rows that meet one or several logical criteria Filter for all penguins of `species` "Adelie". ```r penguins %>% filter(species == "Adelie") ``` Filter for all penguins with a missing value in the `bill_length_mm` measurement. ```r penguins %>% filter(is.na(bill_length_mm) == T) # filter(!is.na(bill_length_mm) == F) ``` Filter for all penguins observed prior to `year` 2008 or subsequent to `year` 2008 and where the body mass (`body_mass_g`) lies between 3,800 and 4,000 grams. ```r penguins %>% filter(between(body_mass_g, 3800, 4000) & (year < 2008 | year > 2008)) ``` ??? - Note that using `=` instead of `==` is a common mistakes for beginners (`<-` = `=`). --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on rows:** `slice()` picks rows based on their location in the data .panelset[ .panel[.panel-name[slide()] Pick rows based on their index. ```r penguins %>% slice(23:27) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Biscoe 35.9 19.2 189 3800 female > 2 Adelie Biscoe 38.2 18.1 185 3950 male > 3 Adelie Biscoe 38.8 17.2 180 3800 male > 4 Adelie Biscoe 35.3 18.9 187 3800 female > 5 Adelie Biscoe 40.6 18.6 183 3550 male > # ... with 1 more variable: year <int> ``` ] .panel[.panel-name[slice_head()] Pick the first `n` rows (vice versa for `slice_tail()`). ```r penguins %>% slice_head(n = 5) # alternatively: slice_head(frac = 0.05) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 female > 3 Adelie Torgersen 40.3 18 195 3250 female > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 female > # ... with 1 more variable: year <int> ``` ] .panel[.panel-name[slice_sample()] Pick a random sample of `n` rows (with or without replacement). ```r penguins %>% slice_sample(n = 5) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Dream 35.6 17.5 191 3175 fema~ > 2 Chinstrap Dream 51.4 19 201 3950 male > 3 Adelie Biscoe 35.3 18.9 187 3800 fema~ > 4 Adelie Torgers~ 38.9 17.8 181 3625 fema~ > 5 Chinstrap Dream 50.2 18.7 198 3775 fema~ > # ... with 1 more variable: year <int> ``` ] .panel[.panel-name[slice_max()] Pick the `n` rows with the largest value (vice versa for `slice_min()`). ```r penguins %>% slice_max(bill_length_mm, n = 5) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Gentoo Biscoe 59.6 17 230 6050 male > 2 Chinstrap Dream 58 17.8 181 3700 female > 3 Gentoo Biscoe 55.9 17 228 5600 male > 4 Chinstrap Dream 55.8 19.8 207 4000 male > 5 Gentoo Biscoe 55.1 16 230 5850 male > # ... with 1 more variable: year <int> ``` ] ] ??? - slice_sample to generate bootstrapped samples --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on rows:** `arrange()` changes the order of rows .panelset[ .panel[.panel-name[Ascending] Return the five penguins with the smallest body mass. ```r penguins %>% arrange(body_mass_g) %>% slice_head(n = 5) # equivalent to: slice_min(body_mass_g, n = 3) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Chinstrap Dream 46.9 16.6 192 2700 female > 2 Adelie Biscoe 36.5 16.6 181 2850 female > 3 Adelie Biscoe 36.4 17.1 184 2850 female > 4 Adelie Biscoe 34.5 18.1 187 2900 female > 5 Adelie Dream 33.1 16.1 178 2900 female > # ... with 1 more variable: year <int> ``` ] .panel[.panel-name[Descending] Return the five penguins with the highest body mass. ```r penguins %>% arrange(desc(body_mass_g)) %>% slice_head(n = 5) # equivalent to: slice_max(body_mass_g, n = 3) ``` ``` > # A tibble: 5 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Gentoo Biscoe 49.2 15.2 221 6300 male > 2 Gentoo Biscoe 59.6 17 230 6050 male > 3 Gentoo Biscoe 51.1 16.3 220 6000 male > 4 Gentoo Biscoe 48.8 16.2 222 6000 male > 5 Gentoo Biscoe 45.2 16.4 223 5950 male > # ... with 1 more variable: year <int> ``` ] ] ??? - arrange by default always sorts from smallest to largest --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on columns:** `select()` picks respectively drops certain columns .panelset[ .panel[.panel-name[select() by index] ```r penguins %>% select(1:3) %>% glimpse ``` ``` > Rows: 344 > Columns: 3 > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torge~ > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37~ ``` ] .panel[.panel-name[select() by name] ```r penguins %>% select(species, island, bill_length_mm) %>% glimpse ``` ``` > Rows: 344 > Columns: 3 > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torge~ > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37~ ``` ] ] --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on columns:** `select()` picks respectively drops certain columns (using `tidyselect` helpers) .panelset[ .panel[.panel-name[everything()] Select all columns. ```r penguins %>% select(everything()) %>% glimpse ``` ``` > Rows: 344 > Columns: 8 > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~ > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ > $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~ > $ sex <fct> male, female, female, NA, female, male, female, male, NA,~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ ``` ] .panel[.panel-name[last_col()] Select the last column in the data frame. ```r penguins %>% select(last_col()) %>% glimpse ``` ``` > Rows: 344 > Columns: 1 > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~ ``` ] .panel[.panel-name[starts_with()] Select columns which names start with a certain string. ```r penguins %>% select(starts_with("bill")) %>% glimpse ``` ``` > Rows: 344 > Columns: 2 > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17~ ``` ] .panel[.panel-name[ends_with()] Select columns which names end with a certain string. ```r penguins %>% select(ends_with("mm")) %>% glimpse ``` ``` > Rows: 344 > Columns: 3 > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ ``` ] .panel[.panel-name[contains()] Select columns which name contains a certain string. ```r penguins %>% select(contains("e") & contains("a")) %>% glimpse ``` ``` > Rows: 344 > Columns: 1 > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~ ``` ] .panel[.panel-name[machtes()] Select columns based on a regular expression ([regex](https://www.rexegg.com/regex-quickstart.html)). ```r penguins %>% select(matches("_\\w*_mm$")) %>% glimpse ``` ``` > Rows: 344 > Columns: 3 > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ ``` ] .panel[.panel-name[where()] Select columns for which a function evaluates to `TRUE`. ```r penguins %>% select(where(is.numeric)) %>% glimpse ``` ``` > Rows: 344 > Columns: 5 > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ > $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ ``` ] ] --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on columns:** `select()` picks respectively drops certain columns Which columns are returned by the following queries? ```r penguins %>% select(starts_with("s")) ``` ```r penguins %>% select(ends_with("mm")) ``` ```r penguins %>% select(contains("mm")) ``` ```r penguins %>% select(-contains("mm")) ``` ```r penguins %>% select(where(~ is.numeric(.))) %>% # equivalent to: select(where(is.numeric)) select(where(~ mean(., na.rm = T) > 1000)) ``` ??? deselect: - if you want to deselect something put a minus in front where: - feed a function that takes a vector and returns T or F - when using a function within another function you usually require the formula (~) notation (see `purrr` part), except when only using a function with one argument --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on columns:** `rename()` changes the column names Change the name of the column `body_mass_g` (`sex`) to `bm` (`gender`). ```r penguins %>% rename(bm = body_mass_g, gender = sex) %>% colnames() ``` ``` > [1] "species" "island" "bill_length_mm" "bill_depth_mm" > [5] "flipper_length_mm" "bm" "gender" "year" ``` Convert the name of the columns that include the string `"mm"` to upper case. ```r penguins %>% rename_with(.fn = toupper, .cols = contains("mm")) %>% colnames() ``` ``` > [1] "species" "island" "BILL_LENGTH_MM" "BILL_DEPTH_MM" > [5] "FLIPPER_LENGTH_MM" "body_mass_g" "sex" "year" ``` --- ## 4.5 `dplyr`: A Grammar of Data Manipulation <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_relocate.png" width="40%" height="40%" style="float:right; padding:10px" /> **Operations on columns:** `relocate()` changes the order of columns Change the order of columns in the `tibble` according to the following scheme: 1. place `species` after `body_mass_g` 2. place `sex` before `species` 3. place `island` at the end ```r penguins %>% relocate(species, .after = body_mass_g) %>% relocate(sex, .before = species) %>% relocate(island, .after = last_col()) %>% colnames() ``` ``` > [1] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" > [5] "sex" "species" "year" "island" ``` --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on columns:** `mutate()` transforms the column values and/or creates new columns Create a new `bm_kg` variable which reflects `body_mass_g` measured in kilo grams. ```r penguins %>% mutate(bm_kg = body_mass_g / 1000, .keep = "all", .after = island) %>% slice_head(n = 5) ``` ``` > # A tibble: 5 x 9 > species island bm_kg bill_length_mm bill_depth_mm flipper_length_mm body_mass_g > <fct> <fct> <dbl> <dbl> <dbl> <int> <int> > 1 Adelie Torgersen 3.75 39.1 18.7 181 3750 > 2 Adelie Torgersen 3.8 39.5 17.4 186 3800 > 3 Adelie Torgersen 3.25 40.3 18 195 3250 > 4 Adelie Torgersen NA NA NA NA NA > 5 Adelie Torgersen 3.45 36.7 19.3 193 3450 > # ... with 2 more variables: sex <fct>, year <int> ``` - Use the `.keep` argument to specify which columns to keep after manipulation. - Use the `.before`/`.after` arguments to specify the position of the new column. - For overriding a given column simply use the same column name. - For keeping only the new column use `dplyr::transmute()`. --- ## 4.5 `dplyr`: A Grammar of Data Manipulation <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_case_when.png" width="40%" height="40%" style="float:right; padding:10px" /> **Operations on columns:** `mutate()` transforms the column values and/or creates new columns Create a *one-hot encoded* variable for `sex`. ```r penguins %>% mutate( sex_binary = case_when( sex == "male" ~ 1, sex == "female" ~ 0), .keep = "all", .after = island ) %>% slice_head(n = 3) ``` ``` > # A tibble: 3 x 9 > species island sex_binary bill_length_mm bill_depth_mm flipper_length_~ body_mass_g > <fct> <fct> <dbl> <dbl> <dbl> <int> <int> > 1 Adelie Torgersen 1 39.1 18.7 181 3750 > 2 Adelie Torgersen 0 39.5 17.4 186 3800 > 3 Adelie Torgersen 0 40.3 18 195 3250 > # ... with 2 more variables: sex <fct>, year <int> ``` .footnote[ _**One-hot Encoding:** Encoding a categorical variable with `C` factor levels into `C` dummies (often in modeling you create `C-1` dummies otherwise you have a perfect linear combination of the variables)._ ] ??? case_when: - vectorized version of if_else - two-sided formulas: LHS tests the condition, RHS specifies the replacement value - for unmatched cases, the function returns NA - use LHS `TRUE` to capture all cases not explicitly specified beforehand --- ## 4.5 `dplyr`: A Grammar of Data Manipulation <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_across.png" width="40%" height="40%" style="float:right; padding:10px" /> **Operations on columns:** `mutate()` transforms the column values and/or creates new columns Transform measurement variables to meters. ```r penguins %>% mutate( across(contains("mm"), ~ . / 1000), .keep = "all" ) %>% slice_head(n = 3) ``` ``` > # A tibble: 3 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <dbl> <int> <fct> > 1 Adelie Torgersen 0.0391 0.0187 0.181 3750 male > 2 Adelie Torgersen 0.0395 0.0174 0.186 3800 female > 3 Adelie Torgersen 0.0403 0.018 0.195 3250 female > # ... with 1 more variable: year <int> ``` ??? across: - apply same transformation across multiple columns - allows you to use the semantics you know from the `select()` function - does not require you to explicitly specify a column name as it only transform existing columns --- ## 4.5 `dplyr`: A Grammar of Data Manipulation <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_across.png" width="40%" height="40%" style="float:right; padding:10px" /> **Operations on columns:** `mutate()` transforms the column values and/or creates new columns Define `species`, `island` and `sex` as a categorical variable, i.e. *factors*, using `across()`. ```r penguins %>% mutate( across(where(is.character), as.factor), .keep = "all" ) %>% slice_head(n = 3) ``` ``` > # A tibble: 3 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 female > 3 Adelie Torgersen 40.3 18 195 3250 female > # ... with 1 more variable: year <int> ``` --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on grouped data:** `group_by()` partitions data based on one or several columns ```r penguins %>% group_by(species) ``` ``` > # A tibble: 344 x 8 > # Groups: species [3] > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ > 3 Adelie Torgersen 40.3 18 195 3250 fema~ > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ > 6 Adelie Torgersen 39.3 20.6 190 3650 male > 7 Adelie Torgersen 38.9 17.8 181 3625 fema~ > 8 Adelie Torgersen 39.2 19.6 195 4675 male > 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> > 10 Adelie Torgersen 42 20.2 190 4250 <NA> > # ... with 334 more rows, and 1 more variable: year <int> ``` Use `group_keys()`, `group_indices()` and `group_vars()` to access grouping keys, group indices per row and grouping variables. --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on grouped data:** `group_by()` partitions data based on one or several columns Under the hood `group_by()` changes the representation of the `tibble` and transforms it into a grouped data frame (`grouped_df`). This allows us to operate on the subgroups individually using `summarise()`. -- **Operations on grouped data:** `summarise()` reduces a group of data into a single row .panelset[ .panel[.panel-name[univariate] ```r penguins %>% group_by(species) %>% summarise(count = n(), .groups = "drop") ``` ``` > # A tibble: 3 x 2 > species count > <fct> <int> > 1 Adelie 152 > 2 Chinstrap 68 > 3 Gentoo 124 ``` ] .panel[.panel-name[bivariate] ```r penguins %>% group_by(species, sex) %>% summarise(count = n(), .groups = "drop") ``` ``` > # A tibble: 8 x 3 > species sex count > <fct> <fct> <int> > 1 Adelie female 73 > 2 Adelie male 73 > 3 Adelie <NA> 6 > 4 Chinstrap female 34 > 5 Chinstrap male 34 > 6 Gentoo female 58 > 7 Gentoo male 61 > 8 Gentoo <NA> 5 ``` ] ] ??? - use `.groups = ` to indicate what happens to the groups after summarising them --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on grouped data:** `group_by()` partitions data based on one or several columns and `summarise()` reduces a group of data into a single row ```r penguins %>% group_by(species) %>% summarise( across(contains("mm"), ~ mean(., na.rm = T), .names = "{.col}_avg"), .groups = "drop" ) ``` ``` > # A tibble: 3 x 4 > species bill_length_mm_avg bill_depth_mm_avg flipper_length_mm_avg > <fct> <dbl> <dbl> <dbl> > 1 Adelie 38.8 18.3 190. > 2 Chinstrap 48.8 18.4 196. > 3 Gentoo 47.5 15.0 217. ``` Using `group_by()`, followed by `summarise()` and `ungroup()` reflects the **split-apply-combine paradigm** in data analysis: Split the data into partitions, apply some function to the data and then merge the results. ??? - the true potential is unleashed if you combine `group_by` and `summarise` - split-apply-combine paradigm particularly useful in parallel processing --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Operations on grouped data:** `group_by()` partitions data based on one or several columns and `summarise()` reduces a group of data into a single row <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/group_by_ungroup.png" width="60%" height="60%" style="float:left; padding:10px" /> <br> *Note: Instead of using `ungroup()` you may also set the `.groups` argument in `summarise()` equal to "drop".* *But never forget to ungroup your data, otherwise you may run into errors later on in your analysis!* ??? - now lets look at some more advanced use cases --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Stacked `group_by()`:** Use `.add = T` to add new grouping variables (otherwise the first is overridden) ```r penguins %>% group_by(species) %>% group_by(year, .add = T) # equivalent to: group_by(species, year) ``` ``` > # A tibble: 344 x 8 > # Groups: species, year [9] > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Torgersen 39.1 18.7 181 3750 male > 2 Adelie Torgersen 39.5 17.4 186 3800 fema~ > 3 Adelie Torgersen 40.3 18 195 3250 fema~ > 4 Adelie Torgersen NA NA NA NA <NA> > 5 Adelie Torgersen 36.7 19.3 193 3450 fema~ > 6 Adelie Torgersen 39.3 20.6 190 3650 male > 7 Adelie Torgersen 38.9 17.8 181 3625 fema~ > 8 Adelie Torgersen 39.2 19.6 195 4675 male > 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> > 10 Adelie Torgersen 42 20.2 190 4250 <NA> > # ... with 334 more rows, and 1 more variable: year <int> ``` --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Apply multiple summary functions:** Provide a list of `purrr`-style functions to `across()` ```r penguins %>% group_by(species) %>% summarise( across( contains("mm"), list(avg = ~ mean(., na.rm = T), sd = ~ sd(., na.rm = T)), .names = "{.col}_{.fn}" ), .groups = "drop" ) ``` ``` > # A tibble: 3 x 7 > species bill_length_mm_avg bill_length_mm_sd bill_depth_mm_avg bill_depth_mm_sd > <fct> <dbl> <dbl> <dbl> <dbl> > 1 Adelie 38.8 2.66 18.3 1.22 > 2 Chinstrap 48.8 3.34 18.4 1.14 > 3 Gentoo 47.5 3.08 15.0 0.981 > # ... with 2 more variables: flipper_length_mm_avg <dbl>, flipper_length_mm_sd <dbl> ``` --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Changed behavior of `mutate()`:** Summary functions, e.g., `mean()` or `sd()` now operate on partitions of the data instead of on the whole data ```r penguins %>% group_by(species) %>% mutate(stand_bm = (body_mass_g - mean(body_mass_g, na.rm = T)) / sd(body_mass_g, na.rm = T)) %>% glimpse ``` ``` > Rows: 344 > Columns: 9 > Groups: species [3] > $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~ > $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, To~ > $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,~ > $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2,~ > $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180~ > $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250,~ > $ sex <fct> male, female, female, NA, female, male, female, male, NA,~ > $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200~ > $ stand_bm <dbl> 0.107591350, 0.216626878, -0.982763938, NA, -0.546621823,~ ``` ??? - here example of the z-transformation on a group level --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **`group_by()` a transformed column:** Provide a `mutate()`-like expression in your `group_by()` statement ```r bm_breaks <- mean(penguins$body_mass_g, na.rm = T) - (-3:3) * sd(penguins$body_mass_g, na.rm = T) penguins %>% group_by(species, bm_bin = cut(body_mass_g, breaks = bm_breaks)) %>% summarise(count = n(), .groups = "drop") ``` ``` > # A tibble: 12 x 3 > species bm_bin count > <fct> <fct> <int> > 1 Adelie (2.6e+03,3.4e+03] 39 > 2 Adelie (3.4e+03,4.2e+03] 87 > 3 Adelie (4.2e+03,5e+03] 25 > 4 Adelie <NA> 1 > 5 Chinstrap (2.6e+03,3.4e+03] 11 > 6 Chinstrap (3.4e+03,4.2e+03] 50 > 7 Chinstrap (4.2e+03,5e+03] 7 > 8 Gentoo (3.4e+03,4.2e+03] 6 > 9 Gentoo (4.2e+03,5e+03] 56 > 10 Gentoo (5e+03,5.81e+03] 52 > 11 Gentoo (5.81e+03,6.61e+03] 9 > 12 Gentoo <NA> 1 ``` ??? 1. compute bins for body mass, the amount of standard deviations from the mean 2. group by data according to these bins (create bins in `group_by()` command) --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Changed behavior of `filter()`:** Filters now operate on partitions of the data instead of on the whole data ```r penguins %>% group_by(species, island) %>% filter(flipper_length_mm == max(flipper_length_mm, na.rm = T)) ``` ``` > # A tibble: 5 x 8 > # Groups: species, island [5] > species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Adelie Dream 40.8 18.9 208 4300 male > 2 Adelie Biscoe 41 20 203 4725 male > 3 Adelie Torgersen 44.1 18 210 4000 male > 4 Gentoo Biscoe 54.3 15.7 231 5650 male > 5 Chinstrap Dream 49 19.6 212 4300 male > # ... with 1 more variable: year <int> ``` ??? - Group by all unique `species`-`island` combinations and filter for the penguins with the maximal flipper length per combination --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Nesting of grouped data:** Usually, you will find it more intuitive to use `group_by()` followed by `nest()` to produce a nested data frame compared to the example in [section 4.4](#tidyr_nest). ```r penguins %>% group_by(species, year) %>% tidyr::nest() ``` ``` > # A tibble: 9 x 3 > # Groups: species, year [9] > species year data > <fct> <int> <list> > 1 Adelie 2007 <tibble [50 x 6]> > 2 Adelie 2008 <tibble [50 x 6]> > 3 Adelie 2009 <tibble [52 x 6]> > 4 Gentoo 2007 <tibble [34 x 6]> > 5 Gentoo 2008 <tibble [46 x 6]> > 6 Gentoo 2009 <tibble [44 x 6]> > 7 Chinstrap 2007 <tibble [26 x 6]> > 8 Chinstrap 2008 <tibble [18 x 6]> > 9 Chinstrap 2009 <tibble [24 x 6]> ``` .footnote[ *Note: Find more information about `group_by()` by running `vignette("grouping")`.* ] --- ## 4.5 `dplyr`: A Grammar of Data Manipulation **Other selected `dplyr` operations:** .panelset[ .panel[.panel-name[distinct()] `distinct()` selects only unique rows. ```r penguins %>% distinct(species, island) ``` ``` > # A tibble: 5 x 2 > species island > <fct> <fct> > 1 Adelie Torgersen > 2 Adelie Biscoe > 3 Adelie Dream > 4 Gentoo Biscoe > 5 Chinstrap Dream ``` ] .panel[.panel-name[pull()] `pull()` extracts single columns as vectors. ```r penguins %>% pull(year) # equivalent to: penguins$year ``` ``` > [1] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 > [17] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 > [33] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 > [49] 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 > [65] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 > [81] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 > [97] 2008 2008 2008 2008 > [ reached getOption("max.print") -- omitted 244 entries ] ``` ] .panel[.panel-name[if_else()] `if_else()` applies a vectorized if-else-statement. ```r penguins %>% select(species, island, body_mass_g) %>% mutate(penguin_size = if_else(body_mass_g < 3500, "tiny penguin", "big penguin")) ``` ``` > # A tibble: 344 x 4 > species island body_mass_g penguin_size > <fct> <fct> <int> <chr> > 1 Adelie Torgersen 3750 big penguin > 2 Adelie Torgersen 3800 big penguin > 3 Adelie Torgersen 3250 tiny penguin > 4 Adelie Torgersen NA <NA> > 5 Adelie Torgersen 3450 tiny penguin > 6 Adelie Torgersen 3650 big penguin > 7 Adelie Torgersen 3625 big penguin > 8 Adelie Torgersen 4675 big penguin > 9 Adelie Torgersen 3475 tiny penguin > 10 Adelie Torgersen 4250 big penguin > # ... with 334 more rows ``` ] .panel[.panel-name[lag()] `lag()` shifts column values by an offset of `n` forward. ```r penguins %>% select(species, body_mass_g) %>% mutate(lagged_bm = lag(body_mass_g, n = 1)) ``` ``` > # A tibble: 344 x 3 > species body_mass_g lagged_bm > <fct> <int> <int> > 1 Adelie 3750 NA > 2 Adelie 3800 3750 > 3 Adelie 3250 3800 > 4 Adelie NA 3250 > 5 Adelie 3450 NA > 6 Adelie 3650 3450 > 7 Adelie 3625 3650 > 8 Adelie 4675 3625 > 9 Adelie 3475 4675 > 10 Adelie 4250 3475 > # ... with 334 more rows ``` ] .panel[.panel-name[lead()] `lead()` shifts column values by an offset of `n` backward. ```r penguins %>% select(species, body_mass_g) %>% mutate(lead_bm = lead(body_mass_g, n = 2)) ``` ``` > # A tibble: 344 x 3 > species body_mass_g lead_bm > <fct> <int> <int> > 1 Adelie 3750 3250 > 2 Adelie 3800 NA > 3 Adelie 3250 3450 > 4 Adelie NA 3650 > 5 Adelie 3450 3625 > 6 Adelie 3650 4675 > 7 Adelie 3625 3475 > 8 Adelie 4675 4250 > 9 Adelie 3475 3300 > 10 Adelie 4250 3700 > # ... with 334 more rows ``` ] .panel[.panel-name[join()] `left_join()`, `right_join()`, `inner_join()` and `full_join()` enable to merge different data frames by matching rows based on keys (similar to joins performed in SQL). ] ] .pull-right[.pull-right[.footnote[ *Note: Find more information about `dplyr` by running `vignette("dplyr")` and consulting the official [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf).* ]]] --- ## 4.5 `dplyr`: A Grammar of Data Manipulation .center[**Similarities between `dplyr` and `SQL` statements:**] <img src="./img/sql-tidyverse.png" width="70%" height="70%" style="display: block; margin: auto;" /> .center[ *Src: [Steves (2021)](https://www.rstudio.com/resources/rstudioglobal-2021/the-dynamic-duo-sql-and-r/)* ] --- class: middle, center, inverse layout: false # 4.6 `purrr`:<br><br>Functional Programming Tools --- background-image: url(https://raw.githubusercontent.com/tidyverse/purrr/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.6 `purrr`: Functional Programming Tools `purrr` facilitates [*functional programming*](https://en.wikipedia.org/wiki/Functional_programming) (FP) with data frame objects in `R`. Whenever you would normally refer to a `for`-loop for solving an iterative problem, the family of `map_*()` functions allows you to rephrase your problem as a `tidyverse` pipeline. **Four main types of `map_*()` functions:** - `map(.x, .f, ...)` takes the input `.x` and applies `.f` to each element in `.x`. - `map2(.x, .y, .f, ...)` takes the inputs `.x` and `.y` and applies `.f` to `.x` and `.y` in parallel. - `pmap(.l, .f, ...)` takes a list `.l` of inputs and applies `.f` to each element in `.l` in parallel. - `group_map(.data, .f, ...)` takes a grouped `tibble` and applies `.f` to each subgroup. -- .pull-left[ By default `map()` returns a list. If you want to be more explicit about the output you may refer to - `map_lgl()` to receive a logical output type, - `map_chr()` to receive a character output type, - `map_int()` to receive an integer output type, - `map_dbl()` to receive a double output type , - `map_df()` to receive a data frame output. ] -- .pull-right[ The input `.x` to any `map()_*` function can be either a vector, a list or a data frame. - **Vector:** Iteration over vector elements - **List:** Iteration over list elements - **Data frame:** Iteration over columns ] ??? In functional programming, your code is organised into functions that perform the operations you need. Your scripts will only be a sequence of calls to these functions, making them easier to understand. --- ## 4.6 `purrr`: Functional Programming Tools <img src="https://upload.wikimedia.org/wikipedia/commons/0/06/Mapping-steps-loillibe-new.gif" width="60%" style="display: block; margin: auto;" /> .center[ _Src: [Rodrigues (2010)](https://b-rodrigues.github.io/modern_R/functional-programming.html)_ ] --- ## 4.6 `purrr`: Functional Programming Tools **Use Case:** Let's assume we have multiple data samples and require each of the samples to be `\(z\)`-normalized for further modeling. First, we would probably write a *named function* for performing `\(z\)`-normalization which takes our sample `.x` as input. ```r z_transform <- function(.x) { mean <- mean(.x, na.rm = T) sd <- sd(.x , na.rm = T) return( (.x - mean) / sd ) } ``` -- Second, we draw samples from the `penguins` data set and store them as double vectors in a list. ```r samples <- list( sample1 = slice_sample(penguins, n = 10)$bill_length_mm, sample2 = slice_sample(penguins, n = 10)$bill_depth_mm, sample3 = slice_sample(penguins, n = 10)$flipper_length_mm ) samples[1] ``` ``` > $sample1 > [1] 55.9 43.2 37.2 34.6 36.0 52.2 32.1 37.2 38.1 50.5 ``` ??? - here: different means and sd --- ## 4.6 `purrr`: Functional Programming Tools Third, perform the `\(z\)`-normalization. .panelset[ .panel[.panel-name[for-loop] ```r for (sample in samples) { print(z_transform(.x = sample)) } ``` ``` > [1] 1.7107192 0.1807098 -0.5421293 -0.8553596 -0.6866971 1.2649684 -1.1565426 > [8] -0.5421293 -0.4337035 1.0601640 > [1] 1.4295925 1.1598581 -0.1888141 0.2967079 0.6203892 -0.7822299 -1.3216988 > [8] -1.5374863 0.5664423 -0.2427610 > [1] -0.31304243 0.08829402 -0.71437887 -0.07224056 2.09497625 1.37257064 > [7] -1.03544803 -0.15250785 -0.95518074 -0.31304243 ``` ] .panel[.panel-name[map()] ```r map(.x = samples, .f = ~ z_transform(.x)) ``` ``` > $sample1 > [1] 1.7107192 0.1807098 -0.5421293 -0.8553596 -0.6866971 1.2649684 -1.1565426 > [8] -0.5421293 -0.4337035 1.0601640 > > $sample2 > [1] 1.4295925 1.1598581 -0.1888141 0.2967079 0.6203892 -0.7822299 -1.3216988 > [8] -1.5374863 0.5664423 -0.2427610 > > $sample3 > [1] -0.31304243 0.08829402 -0.71437887 -0.07224056 2.09497625 1.37257064 > [7] -1.03544803 -0.15250785 -0.95518074 -0.31304243 ``` ] ] ??? often times, `map` statements are more efficient than for for-loops --- ## Excursus: Tilde-Shorthand Within the `tidyverse`, the tilde-shorthand regularly occurs whenever an external function is required as an argument to one of the `tidyverse` functions. In general, you have different ways of including the second function call, one of which is the tilde-shorthand notation. .panelset[ .panel[.panel-name[Option 1] Referring to the external function using its name. ```r map(.x = samples, .f = z_transform) ``` <br> Note that other function arguments can be passed on to the function as additional positional arguments beyond `.f`, e.g., `map(.x = samples, .f = mean, na.rm = T)` ] .panel[.panel-name[Option 2] Defining an anonymous function inline. ```r map( .x = samples, .f = function(.x) { (.x - mean(.x, na.rm = T)) / sd(.x, na.rm = T) } ) ``` <br> Note that here you could also omit `{ }`, since there is only a single expression involved in the function. ] .panel[.panel-name[Option 3] Defining an anonymous function inline using the tilde-shorthand. ```r map( .x = samples, .f = ~ (.x - mean(.x, na.rm = T)) / sd(.x, na.rm = T) ) ``` <br> Note that whenever we use the tilde-shorthand, we refer to the argument of the anonymous function by `.x` or simply by `.` (if it only requires one input). ] ] ??? The tilde indicates: what comes next should be considered as a function Most of the time, explicitly defining named functions and then choosing option 1 only makes sense if you require them at least more than once. Otherwise, I would strongly recommend using anonymous function, i.e. option 2. or 3. --- ## 4.6 `purrr`: Functional Programming Tools <img src="https://media1.tenor.com/images/f72cb542d6b3e3c3421889e0a3d9628d/tenor.gif" width="50%" style="display: block; margin: auto;" /> <br><br> .center[🕺 Now let us look at some other practical use cases! 💃] --- ## 4.6 `purrr`: Functional Programming Tools .pull-left[ Check the columns' data types. ```r penguins %>% map_df(class) %>% glimpse ``` ``` > Rows: 1 > Columns: 8 > $ species <chr> "factor" > $ island <chr> "factor" > $ bill_length_mm <chr> "numeric" > $ bill_depth_mm <chr> "numeric" > $ flipper_length_mm <chr> "integer" > $ body_mass_g <chr> "integer" > $ sex <chr> "factor" > $ year <chr> "integer" ``` ] .pull-right[ Check the number of missing values per column. ```r penguins %>% map_df(~ .x %>% is.na %>% sum) %>% glimpse ``` ``` > Rows: 1 > Columns: 8 > $ species <int> 0 > $ island <int> 0 > $ bill_length_mm <int> 2 > $ bill_depth_mm <int> 2 > $ flipper_length_mm <int> 2 > $ body_mass_g <int> 2 > $ sex <int> 11 > $ year <int> 0 ``` ] ??? 1: I give `map` a data frame as input (`penguins`), so it iterates over each column. And to each column I apply the `class()` function. I want the output to be returned as a data frame (`map_df`) --- ## 4.6 `purrr`: Functional Programming Tools Check the number of distinct values per column. ```r penguins %>% map_df(dplyr::n_distinct) %>% glimpse ``` ``` > Rows: 1 > Columns: 8 > $ species <int> 3 > $ island <int> 3 > $ bill_length_mm <int> 165 > $ bill_depth_mm <int> 81 > $ flipper_length_mm <int> 56 > $ body_mass_g <int> 95 > $ sex <int> 3 > $ year <int> 3 ``` --- ## 4.6 `purrr`: Functional Programming Tools Check the highest value in each subset of the data (e.g., largest `flipper_length_mm` per `sex`). ```r penguins %>% tidyr::drop_na() %>% dplyr::group_by(sex) %>% group_map(~ dplyr::slice_max(.x, flipper_length_mm, n = 1), .keep = T) ``` ``` > [[1]] > # A tibble: 1 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Gentoo Biscoe 46.9 14.6 222 4875 female > # ... with 1 more variable: year <int> > > [[2]] > # A tibble: 1 x 8 > species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex > <fct> <fct> <dbl> <dbl> <int> <int> <fct> > 1 Gentoo Biscoe 54.3 15.7 231 5650 male > # ... with 1 more variable: year <int> ``` ??? - drop_na: because otherwise I would also have a subgroup of NA --- ## 4.6 `purrr`: Functional Programming Tools Produce a series of identical plots, each depicting a separate subset of the underlying data. ```r species <- penguins %>% dplyr::distinct(species, year) %>% dplyr::pull(species) # .x argument for map() years <- penguins %>% dplyr::distinct(species, year) %>% dplyr::pull(year) # .y argument for map() penguin_plots <- map2( .x = species, .y = years, .f = ~ { penguins %>% tidyr::drop_na() %>% dplyr::filter(species == .x, year == .y) %>% ggplot2::ggplot() + geom_point(aes(x = bill_length_mm, y = body_mass_g)) + labs(title = glue::glue("Scatter Plot Bill Length vs. Body Mass ({.x}, {.y})")) } ) ``` --- ## 4.6 `purrr`: Functional Programming Tools .pull-left[ ```r penguin_plots[[1]] ``` <img src="index_files/figure-html/unnamed-chunk-148-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ ```r penguin_plots[[4]] ``` <img src="index_files/figure-html/unnamed-chunk-149-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.6 `purrr`: Functional Programming Tools Finally, `map()` is really powerful in the context of modeling. In the following, we fit a linear regression model for each `species`-`island` subset. First, we create a nested data frame that contains a `tibble` to each `species`-`island` combination. ```r nested_penguins <- penguins %>% tidyr::drop_na() %>% dplyr::group_by(species, island) %>% tidyr::nest() nested_penguins ``` ``` > # A tibble: 5 x 3 > # Groups: species, island [5] > species island data > <fct> <fct> <list> > 1 Adelie Torgersen <tibble [47 x 6]> > 2 Adelie Biscoe <tibble [44 x 6]> > 3 Adelie Dream <tibble [55 x 6]> > 4 Gentoo Biscoe <tibble [119 x 6]> > 5 Chinstrap Dream <tibble [68 x 6]> ``` .pull-right[.footnote[ *Note: For accessing elements in a nested `tibble` you may use the `pluck()` function. For example, for accessing the first `tibble` in the column `data`, you may run `nested_penguins %>% pluck("data", 1)` (also see [section 4.4](#nested-data)).* ]] --- ## 4.6 `purrr`: Functional Programming Tools Second, we fit a linear model to each data subset. In our model, `body_mass_g` is regressed (`~`) on all other variables (denoted by a dot in the `lm()` formula). ```r nested_penguins <- nested_penguins %>% dplyr::mutate(lin_reg = map( .x = data, .f = ~ lm(body_mass_g ~ ., data = .x) )) nested_penguins ``` ``` > # A tibble: 5 x 4 > # Groups: species, island [5] > species island data lin_reg > <fct> <fct> <list> <list> > 1 Adelie Torgersen <tibble [47 x 6]> <lm> > 2 Adelie Biscoe <tibble [44 x 6]> <lm> > 3 Adelie Dream <tibble [55 x 6]> <lm> > 4 Gentoo Biscoe <tibble [119 x 6]> <lm> > 5 Chinstrap Dream <tibble [68 x 6]> <lm> ``` --- ## 4.6 `purrr`: Functional Programming Tools Third, for each linear model, we generate a model summary using `summary()` and extract the model coefficients as a `tibble`. Finally, we use `unnest()` to receive a tidy data frame. ```r nested_penguins <- nested_penguins %>% dplyr::mutate(coefs = map( .x = lin_reg, .f = ~ summary(.) %>% .$coefficients %>% as_tibble(rownames = "variable") )) nested_penguins ``` ``` > # A tibble: 5 x 5 > # Groups: species, island [5] > species island data lin_reg coefs > <fct> <fct> <list> <list> <list> > 1 Adelie Torgersen <tibble [47 x 6]> <lm> <tibble [6 x 5]> > 2 Adelie Biscoe <tibble [44 x 6]> <lm> <tibble [6 x 5]> > 3 Adelie Dream <tibble [55 x 6]> <lm> <tibble [6 x 5]> > 4 Gentoo Biscoe <tibble [119 x 6]> <lm> <tibble [6 x 5]> > 5 Chinstrap Dream <tibble [68 x 6]> <lm> <tibble [6 x 5]> ``` --- ## 4.6 `purrr`: Functional Programming Tools Third, for each linear model, we generate a model summary using `summary()` and extract the model coefficients as a `tibble`. Finally, we use `unnest()` to receive a tidy data frame. ```r nested_penguins %>% tidyr::unnest(coefs) ``` ``` > # A tibble: 30 x 9 > # Groups: species, island [5] > species island data lin_reg variable Estimate `Std. Error` `t value` `Pr(>|t|)` > <fct> <fct> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> > 1 Adelie Torgersen <tib~ <lm> (Interc~ 4.49e5 130401. 3.45 0.00133 > 2 Adelie Torgersen <tib~ <lm> bill_le~ 4.20e0 17.3 0.243 0.809 > 3 Adelie Torgersen <tib~ <lm> bill_de~ -6.20e1 54.6 -1.14 0.263 > 4 Adelie Torgersen <tib~ <lm> flipper~ 1.55e1 8.74 1.77 0.0838 > 5 Adelie Torgersen <tib~ <lm> sexmale 6.48e2 149. 4.33 0.0000926 > 6 Adelie Torgersen <tib~ <lm> year -2.23e2 64.9 -3.44 0.00136 > 7 Adelie Biscoe <tib~ <lm> (Interc~ 6.37e4 140556. 0.454 0.653 > 8 Adelie Biscoe <tib~ <lm> bill_le~ 3.78e1 24.1 1.57 0.125 > 9 Adelie Biscoe <tib~ <lm> bill_de~ 1.16e2 44.3 2.62 0.0124 > 10 Adelie Biscoe <tib~ <lm> flipper~ 2.41e1 8.21 2.94 0.00553 > # ... with 20 more rows ``` .footnote[.pull-right[ *Note: There are specific packages (e.g., `broom`) for tidying model outputs. These provide convenient functions that help you achieve the same thing with much less code.* ]] --- ## 4.6 `purrr`: Functional Programming Tools .pull-left[ .center[ 🤔 How you may probably feel right now<br><br> <img src="https://tenor.com/view/matg-calculate-confusing-figure-out-gif-6237717.gif" style="display: block; margin: auto;" /> ]] -- .pull-right[ .center[ 🤓 After having mastered the intricacies of FP<br><br> <img src="https://tenor.com/view/cat-computer-gif-5368357.gif" style="display: block; margin: auto;" /> ]] .footnote[ *Note: Find more information about `purrr` by consulting the official [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/purrr.pdf). For a great tutorial that helps you master the notion of functional programming with `R` see [this blogpost](http://www.rebeccabarter.com/blog/2019-08-19_purrr/#simplest-usage-repeated-looping-with-map) by Rebecca Barter.* ] --- ## 4.6 `purrr`: Functional Programming Tools Finally, `purrr` also provides convenient [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function) for **error handling**. These come in handy if you are iterating over a very large data set and your program would simply stop if an error occurs. This is particularly frustrating as you would loose the whole progress. For example, at some point you might want to train a separate prediction model (`lm`) for each unique value of `species` (Adelie, Gentoo, Chinstrap). Unfortunately, the following code is throwing an error ... ```r grouped_penguins <- penguins %>% dplyr::mutate(across(c(sex, island), as.factor)) %>% dplyr::group_by(species) ``` ```r grouped_penguins %>% group_map(.f = ~ lm(flipper_length_mm ~ bill_length_mm + island, data = .x)) ``` ``` > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels ``` -- <br><br> 🤔 **Which group is eventually responsible for the error?** ??? - wrapper functions: wrap a function around another function, i.e. you call a function when applying another function --- ## 4.6 `purrr`: Functional Programming Tools `purrr::possibly()` returns a list containing the function's result respectively a user-defined value (`otherwise`) if an error occurs. ```r possibly_lm <- possibly(.f = lm, otherwise = "Error message") grouped_penguins %>% group_map(.f = ~ possibly_lm(flipper_length_mm ~ bill_length_mm + island, data = .x)) ``` ``` > [[1]] > > Call: > .f(formula = ..1, data = ..2) > > Coefficients: > (Intercept) bill_length_mm islandDream islandTorgersen > 157.5591 0.8014 1.3159 2.4199 > > > [[2]] > [1] "Error message" > > [[3]] > [1] "Error message" ``` .footnote[.pull-right[ *Note: Use `purrr::discard(. == "Error message")` (`purrr::keep()`) at the end of the pipeline to drop (keep) function calls that yielded an error.<br>These work like `dplyr::select()` and `dplyr::filter()` in the context of `tibbles`.* ]] --- ## 4.6 `purrr`: Functional Programming Tools `purrr::safely()` returns a named list containing the function's result (or `otherwise` if an error occurs) as well as an error object that captures the error message. ```r safely_lm <- safely(.f = lm, otherwise = NULL) grouped_penguins %>% group_map(.f = ~ safely_lm(flipper_length_mm ~ bill_length_mm + island, data = .x)) ``` <br><br> - Use `purrr::map(., "result")` at the end of the pipeline to access the results of each function call stored in the list.<br><br> - Use `purrr::map(., "error")` at the end of the pipeline to access the errors of each function call stored in the list. .footnote[ *Note: Similarly, use `purrr::quietly()` to return a named list containing not only the function's results and error but also other kinds of output, such as warnings or messages.* ] ??? - quietly: useful to capture warning messages that the code throws, e.g., `summarise()` frequently throws a warning if you do not specify the `.drop` argument --- class: middle, center, inverse layout: false # 4.7 `ggplot2`:<br><br>Create Elegant Data Visualisations<br>Using the Grammar of Graphics --- background-image: url(https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: true --- ## 4.7 `ggplot2`: Elegant Data Visualisations `ggplot2` is Hadley Wickham's [reimplementation](https://www.tandfonline.com/doi/abs/10.1198/jcgs.2009.07098) of the 2005 published *The Grammar of Graphics* by Leland Wilkinson. It provides a large amount of functions for generating high-quality graphs in a layer-based fashion and has even sparked a whole ecosystem of 'gg'-style visualization packages. <br> <img src="./img/grammar-of-graphic-layers.png" width="75%" style="display: block; margin: auto;" /> .center[ *Src: [towardsdatascience](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149)* ] ??? - Where `dplyr` provides a grammar for data manipulation, `ggplot2` does the same for plotting - Up to now, most likely the `graphics` package included in base `R` was your go-to address for crafting visualisations (`plot()`, `hist()`, `boxplot()`). --- ## 4.7 `ggplot2`: Elegant Data Visualisations <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/ggplot2_masterpiece.png" width="60%" height="60%" style="display: block; margin: auto;" /> --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Data:** The data set (usually a `tibble`) from which to select the variables that are about to be plotted. It is specified by the first argument in `ggplot()` and thus predestined to be piped into our plot pipeline. .pull-left[ **Univariate example:** ```r penguins %>% ggplot(data = .) # equivalent to: ggplot ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-162-1.png" width="576" style="display: block; margin: auto;" /> ] .footnote[ *Note: A compact go-to-guide for data visualisations with `ggplot2` is the official [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization.pdf).* ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Aesthetics:** Mappings that describe how variables in the data are mapped to aesthetic attributes in the plot, such as axes, shapes, sizes or colors. .pull-left[ **Univariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm)) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-163-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - You can already see that `ggplot2` extracts the ranges in your variables. --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Geoms:** Geometric objects that determine your overall plot type, e.g., bar, lines, points or boxplots. They specify the graphical representation of your data. .pull-left[ **Univariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_histogram(na.rm = TRUE) ``` `ggplot2` comes with decent default settings. Each `geom_*()` has its own options for customizing the geom, e.g., - change the number of bins with the `bins` argument, - change the width of the bins with `binwidth` argument. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-164-1.png" width="576" style="display: block; margin: auto;" /> ] ??? Note that ggplots are constructed by adding layers with `+` instead of ` %>% ` --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Geoms:** Geometric objects that determine your overall plot type, e.g., bar, lines, points or boxplots. They specify the graphical representation of your data. .pull-left[ **Univariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_bar(na.rm = TRUE) ``` Eventually, you may realize the beauty of the `geom_*()` layers. They do all the required calculations for you! This is due to the frequently overlooked `stat` argument (which defaults to `stat = "count"` for the `geom_bar()` layer). ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-165-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Stats:** Statistical transformations provide a summary of the data. They can be used to transform a given variable without changing the plot type (i.e. geom). .pull-left[ **Univariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_bar(na.rm = TRUE, stat = "density") ``` Most of the time you will just plot the data as-is (`stat = "identity"`). As soon as you require some form of statistical transformation (e.g., count, density or unique) before plotting, the `stat` argument can handle this for you. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-166-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - you can also do all the transformations beforehand using `group_by` and `summarise` --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Stats:** Statistical transformations provide a summary of the data. They can be used to transform a given variable without changing the plot type (i.e. geom). .pull-left[ **Univariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_density(na.rm = TRUE) ``` If you have to manually change the default setting of the `stat` argument, it is likely that `ggplot2` has implemented a corresponding `geom_*()` already. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-167-1.png" width="576" style="display: block; margin: auto;" /> ] .footnote[ .pull-left[ *Note: For a great explanation of the inner workings of the `stat` layer, see this [blog post](https://yjunechoe.github.io/posts/2020-09-26-demystifying-stat-layers-ggplot2/) by June Choe.* ]] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Data:** The data set (usually a `tibble`) from which to select the variables that are about to be plotted. It is specified by the first argument in `ggplot()` and thus predestined to be piped into our plot pipeline. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-168-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Aesthetics:** Mappings that describe how variables in the data are mapped to aesthetic attributes in the plot, such as axes, shapes, sizes or colors. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-169-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Geoms:** Geometric objects that determine your overall plot type, e.g., bar, lines, points or boxplots. They specify the graphical representation of your data. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(na.rm = TRUE) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-170-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations There are multiple ways of changing the color, shape or size aesthetics. Remember that using the `aes()` argument **maps** variable values to your aesthetic. The behavior differs for discrete vs. continuous variables. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species), na.rm = TRUE) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-171-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations There are multiple ways of changing the color, shape or size aesthetics. Remember that using the `aes()` argument **maps** variable values to your aesthetic. The behavior differs for discrete vs. continuous variables. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = bill_depth_mm), na.rm = TRUE) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-172-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations By specifying the `color` argument outside of the `aes()` argument, we **set** the color without considering the values of any other variable. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(color = "red", na.rm = TRUE) ``` For truly customized colors you may refer to [HTML color codes](https://www.w3schools.com/colors/colors_picker.asp) (also called *hex codes*, e.g., `#ff0000` for red) instead of specifying colors by their [predefined name](http://sape.inf.usi.ch/quick-reference/ggplot2/colour) in `R`. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-173-1.png" width="576" style="display: block; margin: auto;" /> ] ??? Remember: - if something is specified inside of `aes` it is mapped, i.e. the characteristic depends on the data - if something is specified outside of `aes`, it is assigned rather manually - hexcodes: codes specifying the level of red (first two), green (second two) and blue (last two digits) color intensity --- ## 4.7 `ggplot2`: Elegant Data Visualisations We can do the same in order to change the `shape` and `size` of our data points. Either by mapping them to the values of another variable or by setting them manually outside of the `aes()` argument. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(shape = species), size = 4, na.rm = TRUE) ``` `ggplot2` provides 24 available shapes for customizing your plot (see [shape overview](https://ggplot2.tidyverse.org/reference/scale_shape-6.png)). ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-174-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations We can do the same in order to change the `shape` and `size` of our data points. Either by mapping them to the values of another variable or by setting them manually outside of the `aes()` argument. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(shape = species, size = bill_depth_mm), na.rm = TRUE) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-175-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Facets:** Facets split the plot into multiple subplots based on the levels of one or more factor variables. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(shape = species), na.rm = TRUE) + facet_wrap(~ year) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-176-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Facets:** Facets split the plot into multiple subplots based on the levels of one or more factor variables. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(shape = species), na.rm = TRUE) + facet_wrap(~ year + island) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-177-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - lets go back to one of the previous plots --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Scales:** Scales control the aesthetic mappings by overriding the *default* settings. For example, they allow to refine the presentation of x- and y-axis, labels or color palettes ex post. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species), size = 3, na.rm = TRUE) + scale_colour_brewer(palette = "Set3") ``` The family of `scale_colour_*()` functions enables you to adjust the values of your `color` aesthetic (e.g., `scale_colour_brewer()` selects a palette from the famous [ColorBrewer](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3) project). ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-178-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - use scales to change the default mappings of `ggplot2` afterwards --- ## 4.7 `ggplot2`: Elegant Data Visualisations **Scales:** Scales control the aesthetic mappings by overriding the *default* settings. For example, they allow to refine the presentation of x- and y-axis, labels or color palettes ex post. .pull-left[ **Bivariate example:** ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species), size = 3, na.rm = TRUE) + scale_y_log10() ``` Or use the `scale_*_log10()` functions to improve the readability of your plot in the presence of high-variance variables. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-179-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - here it doesn't really change a lot due to the absence of outliers (respectively rahter low variations) --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Boxplots for numeric variables ```r penguins_long <- penguins %>% tidyr::pivot_longer( cols = contains("mm"), names_to = "var", values_to = "val" ) %>% tidyr::drop_na() penguins_long ``` ] .pull-right[ <br> - Use `dplyr::pivot_longer()` to bring data frame into *long* format. - Take care of missing values using `dplyr::drop_na()` to avoid error messages. ] ``` > # A tibble: 999 x 7 > species island body_mass_g sex year var val > <fct> <fct> <int> <fct> <int> <chr> <dbl> > 1 Adelie Torgersen 3750 male 2007 bill_length_mm 39.1 > 2 Adelie Torgersen 3750 male 2007 bill_depth_mm 18.7 > 3 Adelie Torgersen 3750 male 2007 flipper_length_mm 181 > 4 Adelie Torgersen 3800 female 2007 bill_length_mm 39.5 > 5 Adelie Torgersen 3800 female 2007 bill_depth_mm 17.4 > 6 Adelie Torgersen 3800 female 2007 flipper_length_mm 186 > # ... with 993 more rows ``` --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Boxplots for numeric variables ```r penguins_long %>% ggplot(aes(x = var, y = val)) + geom_boxplot(na.rm = TRUE) + geom_jitter(alpha = 0.2, width = 0.3) ``` - Use `geom_jitter()` to induce some random noise to the data points to prevent overlapping (alternative to `geom_point()`). - Control transparency of the respective plot element via the `alpha` aesthetic. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-182-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Ordered bar chart ```r plot <- penguins %>% dplyr::count(species) %>% dplyr::mutate(prop = n / sum(n)) %>% ggplot() plot + geom_col(aes(x = prop, y = species)) ``` - You can easily store an `ggplot` object in a user-defined variable. - Use `dplyr::count()` as shortcut for `group_by()` and `summarise(n = n())`. - You can either set `aes()` and `data` in `ggplot()` (*global*) or in `geom_*()` (*local*). In the latter case the data and mappings are only active on the *geom*-level. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-183-1.png" width="576" style="display: block; margin: auto;" /> ] ??? - global vs. local: if you want to use different data sets for each layer. - see that you can easily add new layers to an preexisting `ggplot` object --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Ordered bar chart ```r plot + geom_col( aes(x = prop, y = forcats::fct_reorder(species, prop))) + scale_x_continuous( labels = scales::label_percent(1.)) ``` - Use `fct_reorder()` from the `forcats` package to reorder the levels of `species` by their relative frequency (`prop`). - Finally, `scales::label_percent(1.)` formats the axis as percentages, rounded to percentage points. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-184-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Adjacent bar chart ```r penguins %>% ggplot(aes(x = species)) + geom_bar(aes(fill = island), position = "dodge") ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-185-1.png" width="576" style="display: block; margin: auto;" /> ] .footnote[.pull-left[ *Note: `geom_col()` takes a `x` and `y` argument, whereas `geom_bar()` only takes an `x` argument and computes the `y`-quantity internally (e.g., the frequency using `stat = "count"`).* ]] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Stacked bar chart ```r penguins %>% ggplot(aes(x = species)) + geom_bar(aes(fill = island), position = "stack") ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-186-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Stacked bar chart ```r penguins %>% ggplot( aes(x = forcats::fct_lump(species, n = 1))) + geom_bar( aes(fill = island), position = "stack") ``` In this crude example we lump together all factor levels except the `n = 1` level(s) with the highest number of observations using `forcats::fct_lump()`. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-187-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** High-quality density plot ```r p <- penguins %>% ggplot(aes(x = body_mass_g)) + geom_density(aes(fill = species), na.rm = T, alpha = 0.4) p ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-188-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** High-quality density plot ```r breaks <- seq(from = 3000, to = 6000, by = 500) scales <- scales::label_comma(accuracy = 0.0001) p <- p + scale_x_continuous(breaks = breaks, limits = c(2000, 7000)) + scale_y_continuous(labels = scales) p ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-189-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** High-quality density plot ```r p <- p + labs( title = "Density Function for Three Penguin Species of Palmer Penguins", subtitle = "Palmer Archipelago (2007-2009)", caption = "Data: https://github.com/allisonhorst/palmerpenguins", x = "Body mass [grams]", y = "Statistical density" ) p ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-190-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** High-quality density plot ```r p <- p + theme_classic() # otherwise: theme_minimal() p ``` The `theme` function allows you to customize all elements of your plot which are not immediately related to your data, e.g., titles, labels, fonts, background, or legends. `ggplot2` also comes with a set of [predefined themes](https://ggplot2.tidyverse.org/reference/ggtheme.html) (`theme_*()`). ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-191-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** High-quality density plot ```r p <- p + theme( legend.position = "top", plot.title = element_text(size = 14, face = "bold"), plot.subtitle = element_text(size = 12), plot.caption = element_text(size = 10, face = "italic"), axis.text.x = element_text(size = 10), axis.text.y = element_blank(), axis.title = element_text(size = 10), ) p ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-192-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Violin Plot ```r penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_violin(aes(fill = species), na.rm = T) + theme_classic() ggsave("./img/violin-plot.PNG", device = "png", dpi = 300) ``` - `geom_violin()` creates a cross-over version of a box-plot and a density plot, particularly suitable for visualizing continuous variables. - `ggsave()` writes the most recent plot to disk. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-193-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## 4.7 `ggplot2`: Elegant Data Visualisations .pull-left[ **Other examples:** Lines of Best Fit ```r penguins %>% tidyr::drop_na() %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species)) + geom_smooth(method = "lm", se = T) ``` - Use `geom_smooth()` to fit a smooth line to depict the relationship between `x` and `y`. - For the `method` argument specify one of: - *lm* (linear model), - *glm* (generalized linear model), - *gam* (generalized additive model), - *loess* (local regression). - Set the `se` argument to `TRUE` to obtain standard error bands (i.e. confidence intervals). ] .pull-right[ .panelset[ .panel[.panel-name[lm] <img src="index_files/figure-html/unnamed-chunk-194-1.png" width="576" style="display: block; margin: auto;" /> ] .panel[.panel-name[glm] <img src="index_files/figure-html/unnamed-chunk-195-1.png" width="576" style="display: block; margin: auto;" /> ] .panel[.panel-name[gam] <img src="index_files/figure-html/unnamed-chunk-196-1.png" width="576" style="display: block; margin: auto;" /> ] .panel[.panel-name[loess] <img src="index_files/figure-html/unnamed-chunk-197-1.png" width="576" style="display: block; margin: auto;" /> ] ] ] ??? - lm: y follows the normal distribution - glm: y follows distribution other than normal (e.g., logistic or poisson), but also includes lm (generalization) - gam: y and x can be exhibit non-linear relationships - loess: local regression fits the relationship between y and x locally and allows for substantial non-linearities --- ## 4.7 `ggplot2`: Elegant Data Visualisations By now, there is a whole ecosystem (aka the [ggverse](https://github.com/erikgahner/awesome-ggplot2)) of amazing packages, all created in the spirit of `ggplot2`, which further extend its capabilities: <img src="https://tenor.com/view/shocked-po-kung-fu-panda-gif-4255877.gif" width="40%" height="40%" style="float:right; padding:10px" /> - `scales`: Scale Functions for Visualization - `ggtext`: Improved Text Rendering Support for `ggplot2` - `ggraph`: An Implementation of Grammar of Graphics for Graphs and Networks - `ggstatsplot`: `ggplot2` Based Plots with Statistical Details - `plotly`: Create Interactive Web Graphics via `plotly.js` - `patchwork`: The Composer of Plots - `ggforce`: Accelerating `ggplot2` - etc. ??? - ggforce: extension off ggplot2 functionality to allows for more customizations --- background-image: url(https://raw.githubusercontent.com/ropensci/plotly/master/man/figures/plotly.png) background-position: 97.5% 2.5% background-size: 15% layout: false ## 4.7 `plotly`: Interactive Web Graphics ```r plotly::ggplotly(p) ```
--- background-image: url(https://raw.githubusercontent.com/thomasp85/patchwork/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: false ## 4.7 `patchwork`: The Composer of Plots .pull-left[ ```r library(patchwork) p + p + p ``` <img src="index_files/figure-html/unnamed-chunk-200-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ ```r library(patchwork) p + (p / p) ``` <img src="index_files/figure-html/unnamed-chunk-201-1.png" width="576" style="display: block; margin: auto;" /> ] --- background-image: url(https://raw.githubusercontent.com/thomasp85/ggforce/master/man/figures/logo.png) background-position: 97.5% 2.5% background-size: 7.5% layout: false ## 4.7 `ggforce`: Accelerating `ggplot2` .pull-left[ ```r penguins %>% drop_na %>% ggplot(aes(x = .panel_x, y = .panel_y, col = sex, fill = sex)) + ggforce::geom_autopoint(alpha = 0.5) + ggforce::geom_autohistogram(alpha = 0.5) + ggforce::facet_matrix( rows = vars(species, island, body_mass_g, flipper_length_mm), switch = "both", layer.diag = 2) + theme_bw() + theme(axis.text.x = element_text(angle = 90)) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-202-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## Thank You! .pull-left[ .center[🤔 **Right now**]<br> <img src="https://tenor.com/view/homer-daydreaming-thinking-simpsons-gif-8949118.gif" style="display: block; margin: auto;" /> ] .pull-right[ .center[🤓 **After having mastered the `tidyverse`**]<br><br> <img src="https://tenor.com/view/homer-gif-10571731.gif" style="display: block; margin: auto;" /> ] .footnote[ *Note: Eventually, not everything is great in the `tidyverse`. You should always be aware of its [downsides](https://github.com/matloff/TidyverseSkeptic/blob/master/READMEFull.md) and know when to return to using `base R`.* ] --- ## Further Resources **Wickham, H./Grolemund, G. (2017):** R for Data Science: Visualize, Model, Transform, Tidy, and Import Data. URL: https://r4ds.had.co.nz/tidy-data.html. (*Best read for starting in the `tidyverse`*) **Wickham, H./Navarro, D./Lin Pedersen, T. (2020):** ggplot2: Elegant Graphics for Data Analysis. 3rd. edition, Online Publication 2020. URL: https://ggplot2-book.org/. (*Additional resource for diving deeper into the world of `ggplot2`*) Stay up-to-date with recent developments in the `tidyverse`: https://www.tidyverse.org/blog/ Watch live-coding sessions related to the [TidyTuesday](https://github.com/rfordatascience/tidytuesday) Project, e.g., the episodes by David Robinson: https://www.youtube.com/user/safe4democracy/videos ## Credits Educational resources are inspired by [workshop materials](https://github.com/rstudio-education/remaster-the-tidyverse) of Garrett Grolemund and [blog posts](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/) by Mine Çetinkaya-Rundel of the RStudio Education team. `tidyverse` [artworks and illustration](https://github.com/allisonhorst/stats-illustrations) are provided by Allison Horst.