Tidy data and data wrangling

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy
--

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

<div class="figure" style="text-align: center">
<img src="img/04/untidy-data/hyperwar-airplanes-on-hand.png" alt="WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)" width="80%" />
<p class="caption">WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)</p>
</div>

---

<br>

<div class="figure" style="text-align: center">
<img src="img/04/untidy-data/us-general-economic-characteristic-acs-2017.png" alt="[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&amp;src=pt)" width="90%" />
<p class="caption">[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)</p>
</div>

---

## Summary tables

```
## # A tibble: 87 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows
```
]
.pull-right[

```
## # A tibble: 3 × 2
##   gender    avg_height
##   <chr>          <dbl>
## 1 feminine        165.
## 2 masculine       177.
## 3 <NA>            181.
```
]
]

---

## Displaying data

```r
starwars %>%
  select(name, height, mass)
```

<br>

## Summarizing data

```r
starwars %>%
  group_by(gender) %>%
  summarize(
    avg_height = mean(height, na.rm = TRUE) %>% round(2)
  )
```

---

# Grammar of data wrangling

---

## A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

.pull-left[
<img src="img/04/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
.small[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]
]

---

## Rules of **dplyr** functions

- First argument is *always* a data frame

- Subsequent arguments say what to do with that data frame

- Always return a data frame

---

## Data: Data Science Survey

- Data from Kaggle: In 2017, Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning.

- https://www.kaggle.com/datasets/kaggle/kaggle-survey-2017/
  
- Observations: Each row represents a different survey respondent

```r
datascience <- read_csv("data/kaggle_survey_subset.csv", show_col_types = F) 
```

---

## First look: Variables

```r
names(datascience)
```

```
##  [1] "...1"                   "Country"                "Gender"                
##  [4] "Age"                    "EmploymentStatus"       "EmployerIndustry"      
##  [7] "FormalEducation"        "Major"                  "CompensationAmount"    
## [10] "CompensationCurrency"   "CurrentJobTitle"        "TitleFit"              
## [13] "LanguageRecommendation" "DataScienceIdentity"    "WorkDataVisualizations"
## [16] "JobSatisfaction"        "JobSatisfaction2"
```

---

## Second look: Overview

```r
glimpse(datascience)
```

```
## Rows: 2,618
## Columns: 17
## $ ...1                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ Country                <chr> "United States", "Russia", "Taiwan", "United St…
## $ Gender                 <chr> "Male", "Male", "Male", "Male", "Male", "Male",…
## $ Age                    <dbl> 56, 33, 26, 25, 33, 21, 22, 29, 35, 37, 31, 40,…
## $ EmploymentStatus       <chr> "Independent contractor, freelancer, or self-em…
## $ EmployerIndustry       <chr> "Mix of fields", "Internet-based", "Financial",…
## $ FormalEducation        <chr> "Master's degree", "Bachelor's degree", "Master…
## $ Major                  <chr> "Mathematics or statistics", "Other", "Computer…
## $ CompensationAmount     <dbl> 250000, 1200000, 1100000, 20000, 100000, 20000,…
## $ CompensationCurrency   <chr> "USD", "RUB", "TWD", "USD", "USD", "CZK", "RUB"…
## $ CurrentJobTitle        <chr> "Operations Research Practitioner", "Software D…
## $ TitleFit               <chr> "Poorly", "Fine", "Fine", "Fine", "Perfectly", …
## $ LanguageRecommendation <chr> "Python", "Python", "Python", "Python", "Matlab…
## $ DataScienceIdentity    <chr> "Yes", "Sort of (Explain more)", "No", "Yes", "…
## $ WorkDataVisualizations <chr> "100% of projects", "Less than 10% of projects"…
## $ JobSatisfaction        <chr> "10 - Highly Satisfied", "7", "5", "6", "7", "5…
## $ JobSatisfaction2       <dbl> 10, 7, 5, 6, 7, 5, 8, 9, 8, 9, 6, 8, 6, 7, 7, 1…
```

---

## Select a single column

View only the `Age`:

--
.pull-left[

```r
datascience %>%
  select(Age)
```

```
## # A tibble: 2,618 × 1
##      Age
##    <dbl>
##  1    56
##  2    33
##  3    26
##  4    25
##  5    33
##  6    21
##  7    22
##  8    29
##  9    35
## 10    37
## # … with 2,608 more rows
```
]
--
.pull-right[
- Start with a data frame
- Pass it to the `select()` function.
- Argument is variable we want to select: `Age`
- The result is a data frame with 2,657 and 1 column: --dplyr functions always 
expect a data frame and always yield a data frame.
]

---

## Select multiple columns

View only the `Age` type and `Major`:

```r
datascience %>%
  select(Age, Major)
```

```
## # A tibble: 2,618 × 2
##      Age Major                                                       
##    <dbl> <chr>                                                       
##  1    56 Mathematics or statistics                                   
##  2    33 Other                                                       
##  3    26 Computer Science                                            
##  4    25 Physics                                                     
##  5    33 Electrical Engineering                                      
##  6    21 Computer Science                                            
##  7    22 Information technology, networking, or system administration
##  8    29 Computer Science                                            
##  9    35 Physics                                                     
## 10    37 Electrical Engineering                                      
## # … with 2,608 more rows
```
]
--
.pull-right[
.discussion[
What if we wanted to select these columns, and then arrange the data in descending order of Age?
]
]

---

## Data wrangling, step-by-step

```r
datascience %>%
  select(Age, Major)
```

--
.pull-right[
.vocab[Select, then arrange:]

```r
datascience %>%
  select(Age, Major) %>%
  arrange(desc(Age))
```

```
## # A tibble: 2,618 × 2
##      Age Major                                                       
##    <dbl> <chr>                                                       
##  1   100 Engineering (non-computer focused)                          
##  2    99 Information technology, networking, or system administration
##  3    78 Electrical Engineering                                      
##  4    74 A social science                                            
##  5    73 Electrical Engineering                                      
##  6    70 Mathematics or statistics                                   
##  7    70 Engineering (non-computer focused)                          
##  8    70 Engineering (non-computer focused)                          
##  9    69 Physics                                                     
## 10    68 Mathematics or statistics                                   
## # … with 2,608 more rows
```
]

---

## What's happening here?

```r
datascience %>%
  select(Age, Major) %>%
  arrange(Major)
```

```
## # A tibble: 2,618 × 2
##      Age Major           
##    <dbl> <chr>           
##  1    26 A health science
##  2    28 A health science
##  3    28 A health science
##  4    49 A health science
##  5    34 A health science
##  6    30 A health science
##  7    30 A health science
##  8    43 A health science
##  9    39 A health science
## 10    31 A health science
## # … with 2,608 more rows
```

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `datasccience`, and pass it to the `select()` function,
]
.pull-right[
.small[

```r
*datascience %>%
  select(Age, Major) %>%
  arrange(desc(Age))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `datascience`, and pass it to the `select()` function,
- then we select the variables `Age` and `Major`,
]
.pull-right[
.small[

```r
datascience %>%
* select(Age, Major) %>%
  arrange(desc(Age))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `datascience`, and pass it to the `select()` function,
- then we select the variables `Age` and `Major`,
- and then we arrange the data frame by `Age` in descending order.
]
.pull-right[
.small[

```r
datascience %>%
  select(Age, Major) %>% 
* arrange(desc(Age))
```

---

## How does a pipe work?

- You can think about the following sequence of actions - find keys,  start car, drive to work, park.

--
- Expressed as a set of nested functions in R pseudocode this would look like:

```r
park(drive(start_car(find("keys")), to = "work"))
```

--
- Writing it out using pipes give it a more natural (and easier to read)
structure:
  - Read the pipe as "and then"

```r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---

## What about other arguments?

Use the dot to

- send results to a function argument other than first one or 
- use the previous result for multiple arguments

```r
datascience %>%
  filter(Major == "Computer Science") %>%
* ggplot(., aes(x = Age)) +
  geom_histogram(bins = 10)
```

---

# Working with a single data frame

---

### You have a single data frame, and you want to process it and prepare it for anlaysis!

---

## `select` to keep variables

```r
datascience %>%
* select(Age, Major)
```

---

## `select` to exclude variables

```r
datascience %>%
* select(-Age)
```

```
## # A tibble: 2,618 × 16
##     ...1 Country    Gender Emplo…¹ Emplo…² Forma…³ Major Compe…⁴ Compe…⁵ Curre…⁶
##    <dbl> <chr>      <chr>  <chr>   <chr>   <chr>   <chr>   <dbl> <chr>   <chr>  
##  1     1 United St… Male   Indepe… Mix of… Master… Math…  250000 USD     Operat…
##  2     2 Russia     Male   Employ… Intern… Bachel… Other 1200000 RUB     Softwa…
##  3     3 Taiwan     Male   Employ… Financ… Master… Comp… 1100000 TWD     Softwa…
##  4     4 United St… Male   Employ… Academ… Bachel… Phys…   20000 USD     Resear…
##  5     5 United St… Male   Employ… Teleco… Doctor… Elec…  100000 USD     Scient…
##  6     6 Czech Rep… Male   Employ… Academ… Some c… Comp…   20000 CZK     Other  
##  7     7 Russia     Male   Employ… Mix of… Bachel… Info…  624000 RUB     Data A…
##  8     8 Poland     Male   Employ… Pharma… Master… Comp…  126000 PLN     Softwa…
##  9     9 United St… Male   Employ… Techno… Doctor… Phys…  133000 USD     Engine…
## 10    10 Sweden     Male   Employ… Techno… Master… Elec…   80000 USD     Machin…
## # … with 2,608 more rows, 6 more variables: TitleFit <chr>,
## #   LanguageRecommendation <chr>, DataScienceIdentity <chr>,
## #   WorkDataVisualizations <chr>, JobSatisfaction <chr>,
## #   JobSatisfaction2 <dbl>, and abbreviated variable names ¹EmploymentStatus,
## #   ²EmployerIndustry, ³FormalEducation, ⁴CompensationAmount,
## #   ⁵CompensationCurrency, ⁶CurrentJobTitle
```
]

---

## `select` a range of variables

```r
datascience %>%
* select(Gender:EmploymentStatus)
```

```
## # A tibble: 2,618 × 3
##    Gender   Age EmploymentStatus                                    
##    <chr>  <dbl> <chr>                                               
##  1 Male      56 Independent contractor, freelancer, or self-employed
##  2 Male      33 Employed full-time                                  
##  3 Male      26 Employed full-time                                  
##  4 Male      25 Employed part-time                                  
##  5 Male      33 Employed full-time                                  
##  6 Male      21 Employed part-time                                  
##  7 Male      22 Employed full-time                                  
##  8 Male      29 Employed full-time                                  
##  9 Male      35 Employed full-time                                  
## 10 Male      37 Employed full-time                                  
## # … with 2,608 more rows
```

---

## `arrange` in ascending / descending order

```r
datascience %>%
  select(Age, Gender) %>%
* arrange(Age)
```

```
## # A tibble: 2,618 × 2
##      Age Gender              
##    <dbl> <chr>               
##  1     0 Male                
##  2     1 Male                
##  3    19 Female              
##  4    19 Female              
##  5    19 Male                
##  6    20 Male                
##  7    20 Male                
##  8    20 Female              
##  9    20 A different identity
## 10    20 Male                
## # … with 2,608 more rows
```
]
.pull-right[

```r
datascience %>%
  select(Age, Gender) %>%
* arrange(desc(Age))
```

```
## # A tibble: 2,618 × 2
##      Age Gender              
##    <dbl> <chr>               
##  1   100 A different identity
##  2    99 A different identity
##  3    78 Male                
##  4    74 Male                
##  5    73 Male                
##  6    70 Male                
##  7    70 Male                
##  8    70 Male                
##  9    69 Male                
## 10    68 A different identity
## # … with 2,608 more rows
```
]

---

## `slice` for certain row numbers

```r
# first five
datascience %>%
* slice(1:5)
```

```
## # A tibble: 5 × 17
##    ...1 Country       Gender   Age Emplo…¹ Emplo…² Forma…³ Major Compe…⁴ Compe…⁵
##   <dbl> <chr>         <chr>  <dbl> <chr>   <chr>   <chr>   <chr>   <dbl> <chr>  
## 1     1 United States Male      56 Indepe… Mix of… Master… Math…  250000 USD    
## 2     2 Russia        Male      33 Employ… Intern… Bachel… Other 1200000 RUB    
## 3     3 Taiwan        Male      26 Employ… Financ… Master… Comp… 1100000 TWD    
## 4     4 United States Male      25 Employ… Academ… Bachel… Phys…   20000 USD    
## 5     5 United States Male      33 Employ… Teleco… Doctor… Elec…  100000 USD    
## # … with 7 more variables: CurrentJobTitle <chr>, TitleFit <chr>,
## #   LanguageRecommendation <chr>, DataScienceIdentity <chr>,
## #   WorkDataVisualizations <chr>, JobSatisfaction <chr>,
## #   JobSatisfaction2 <dbl>, and abbreviated variable names ¹EmploymentStatus,
## #   ²EmployerIndustry, ³FormalEducation, ⁴CompensationAmount,
## #   ⁵CompensationCurrency
```
]

---

.tip[
In R, you can use the `#` (hashtag or pound sign, depending on your age for adding comments to your code. 
Any text following `#` will be printed as is, and won't be run as R code.
This is useful for leaving comments in your code and for temporarily disabling 
certain lines of code while debugging.
]

```r
datascience %>%
  # slice the first five rows  # this line is a comment
  #select(Age) %>%           # this one doesn't run
  slice(1:5)                   # this line runs
```

---

## `slice` for certain row numbers

```r
# last five
last_row <- nrow(datascience) # nrow() gives the number of rows in a data frame
datascience %>%
* slice((last_row - 4):last_row)
```

```
## # A tibble: 5 × 17
##    ...1 Country Gender   Age EmploymentS…¹ Emplo…² Forma…³ Major Compe…⁴ Compe…⁵
##   <dbl> <chr>   <chr>  <dbl> <chr>         <chr>   <chr>   <chr>   <dbl> <chr>  
## 1  2614 Japan   Male      32 Employed par… Retail  Bachel… Comp…   3.5e6 JPY    
## 2  2615 Other   Male      26 Independent … Financ… Bachel… Other   1.2e7 UGX    
## 3  2616 Other   Male      56 Employed ful… Other   Bachel… Math…   5  e4 USD    
## 4  2617 Germany Male      34 Employed ful… Teleco… Bachel… Comp…   8.4e4 EUR    
## 5  2618 Norway  Male      32 Employed ful… Financ… Master… Math…   7.5e4 EUR    
## # … with 7 more variables: CurrentJobTitle <chr>, TitleFit <chr>,
## #   LanguageRecommendation <chr>, DataScienceIdentity <chr>,
## #   WorkDataVisualizations <chr>, JobSatisfaction <chr>,
## #   JobSatisfaction2 <dbl>, and abbreviated variable names ¹EmploymentStatus,
## #   ²EmployerIndustry, ³FormalEducation, ⁴CompensationAmount,
## #   ⁵CompensationCurrency
```
]

---

## `filter` to select a subset of rows

```r
# respondents majoring in CS
datascience %>%
* filter(Major  == "Computer Science")
```

```
## # A tibble: 801 × 17
##     ...1 Country      Gender   Age Emplo…¹ Emplo…² Forma…³ Major Compe…⁴ Compe…⁵
##    <dbl> <chr>        <chr>  <dbl> <chr>   <chr>   <chr>   <chr>   <dbl> <chr>  
##  1     3 Taiwan       Male      26 Employ… Financ… Master… Comp… 1100000 TWD    
##  2     6 Czech Repub… Male      21 Employ… Academ… Some c… Comp…   20000 CZK    
##  3     8 Poland       Male      29 Employ… Pharma… Master… Comp…  126000 PLN    
##  4    11 Iran         Male      31 Employ… Techno… Doctor… Comp…   15000 USD    
##  5    19 Brazil       Male      25 Employ… Academ… Master… Comp…    1500 BRL    
##  6    20 Brazil       Male      32 Employ… Academ… Master… Comp…   25000 BRL    
##  7    21 Hungary      Male      38 Employ… Techno… Bachel… Comp…  372000 HUF    
##  8    23 Russia       Male      31 Indepe… CRM/Ma… Some c… Comp…  840000 RUB    
##  9    29 India        Male      23 Employ… Techno… Master… Comp…  360000 INR    
## 10    32 New Zealand  Male      30 Employ… Govern… Bachel… Comp…   90000 NZD    
## # … with 791 more rows, 7 more variables: CurrentJobTitle <chr>,
## #   TitleFit <chr>, LanguageRecommendation <chr>, DataScienceIdentity <chr>,
## #   WorkDataVisualizations <chr>, JobSatisfaction <chr>,
## #   JobSatisfaction2 <dbl>, and abbreviated variable names ¹EmploymentStatus,
## #   ²EmployerIndustry, ³FormalEducation, ⁴CompensationAmount,
## #   ⁵CompensationCurrency
```
]

---

## `filter` for many conditions at once

```r
datascience %>%
  filter( 
*   Major == "Computer Science",
*   Age == 30
    ) %>% 
  select(Major, Age)
```

```
## # A tibble: 41 × 2
##    Major              Age
##    <chr>            <dbl>
##  1 Computer Science    30
##  2 Computer Science    30
##  3 Computer Science    30
##  4 Computer Science    30
##  5 Computer Science    30
##  6 Computer Science    30
##  7 Computer Science    30
##  8 Computer Science    30
##  9 Computer Science    30
## 10 Computer Science    30
## # … with 31 more rows
```

---

## `filter` for more complex conditions

```r
datascience %>%
  filter( 
*   Major == "Computer Science" | Major == "Other",   # | means or
*   Age == 30
    ) %>% 
  select(Major, Age)
```

```
## # A tibble: 50 × 2
##    Major              Age
##    <chr>            <dbl>
##  1 Computer Science    30
##  2 Computer Science    30
##  3 Computer Science    30
##  4 Computer Science    30
##  5 Computer Science    30
##  6 Computer Science    30
##  7 Computer Science    30
##  8 Computer Science    30
##  9 Computer Science    30
## 10 Other               30
## # … with 40 more rows
```

---

## Logical operators in R

<br>

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                   ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

## Demo

---

## `distinct` to filter for unique rows

... and `arrange` to order alphabetically. Note which variables are kept in the resulting data frame!

```r
datascience %>% 
* distinct(EmployerIndustry) %>%
  arrange(EmployerIndustry)
```

```
## # A tibble: 16 × 1
##    EmployerIndustry                
##    <chr>                           
##  1 Academic                        
##  2 CRM/Marketing                   
##  3 Financial                       
##  4 Government                      
##  5 Hospitality/Entertainment/Sports
##  6 Insurance                       
##  7 Internet-based                  
##  8 Manufacturing                   
##  9 Military/Security               
## 10 Mix of fields                   
## 11 Non-profit                      
## 12 Other                           
## 13 Pharmaceutical                  
## 14 Retail                          
## 15 Technology                      
## 16 Telecommunications
```
]

```r
datascience %>% 
* distinct(Country, EmployerIndustry) %>%
  arrange(Country, EmployerIndustry)
```

```
## # A tibble: 479 × 2
##    Country   EmployerIndustry
##    <chr>     <chr>           
##  1 Argentina Government      
##  2 Argentina Mix of fields   
##  3 Argentina Other           
##  4 Argentina Retail          
##  5 Argentina Technology      
##  6 Australia Academic        
##  7 Australia CRM/Marketing   
##  8 Australia Financial       
##  9 Australia Government      
## 10 Australia Insurance       
## # … with 469 more rows
```
]
]

---

## `count` to create frequency tables

```r
# alphabetical order by default
datascience %>%
* count(EmployerIndustry)
```

```
## # A tibble: 16 × 2
##    EmployerIndustry                     n
##    <chr>                            <int>
##  1 Academic                           555
##  2 CRM/Marketing                       80
##  3 Financial                          245
##  4 Government                         161
##  5 Hospitality/Entertainment/Sports    31
##  6 Insurance                           72
##  7 Internet-based                     153
##  8 Manufacturing                       82
##  9 Military/Security                   37
## 10 Mix of fields                      220
## 11 Non-profit                          41
## 12 Other                              234
## 13 Pharmaceutical                      58
## 14 Retail                              69
## 15 Technology                         500
## 16 Telecommunications                  80
```
]
--
.pull-right[

```r
# descending frequency order
datascience %>%
  count(EmployerIndustry, 
*       sort = TRUE)
```

```
## # A tibble: 16 × 2
##    EmployerIndustry                     n
##    <chr>                            <int>
##  1 Academic                           555
##  2 Technology                         500
##  3 Financial                          245
##  4 Other                              234
##  5 Mix of fields                      220
##  6 Government                         161
##  7 Internet-based                     153
##  8 Manufacturing                       82
##  9 CRM/Marketing                       80
## 10 Telecommunications                  80
## 11 Insurance                           72
## 12 Retail                              69
## 13 Pharmaceutical                      58
## 14 Non-profit                          41
## 15 Military/Security                   37
## 16 Hospitality/Entertainment/Sports    31
```
]

---

## `count` and `arrange`

```r
# ascending frequency order
datascience  %>%
  count(EmployerIndustry) %>%
* arrange(n)
```

```
## # A tibble: 16 × 2
##    EmployerIndustry                     n
##    <chr>                            <int>
##  1 Hospitality/Entertainment/Sports    31
##  2 Military/Security                   37
##  3 Non-profit                          41
##  4 Pharmaceutical                      58
##  5 Retail                              69
##  6 Insurance                           72
##  7 CRM/Marketing                       80
##  8 Telecommunications                  80
##  9 Manufacturing                       82
## 10 Internet-based                     153
## 11 Government                         161
## 12 Mix of fields                      220
## 13 Other                              234
## 14 Financial                          245
## 15 Technology                         500
## 16 Academic                           555
```
]
.pull-right[

```r
# descending frequency order
# just like adding sort = TRUE
datascience %>%
  count(EmployerIndustry) %>%
* arrange(desc(n))
```

---

## `count` for multiple variables

```r
datascience %>%
* count(EmploymentStatus, EmployerIndustry)
```

```
## # A tibble: 47 × 3
##    EmploymentStatus   EmployerIndustry                     n
##    <chr>              <chr>                            <int>
##  1 Employed full-time Academic                           479
##  2 Employed full-time CRM/Marketing                       66
##  3 Employed full-time Financial                          222
##  4 Employed full-time Government                         141
##  5 Employed full-time Hospitality/Entertainment/Sports    26
##  6 Employed full-time Insurance                           66
##  7 Employed full-time Internet-based                     122
##  8 Employed full-time Manufacturing                       78
##  9 Employed full-time Military/Security                   29
## 10 Employed full-time Mix of fields                      180
## # … with 37 more rows
```

---

## order matters when you `count`

```r
datascience %>%
* count(Country, Age)
```

```
## # A tibble: 879 × 3
##    Country     Age     n
##    <chr>     <dbl> <int>
##  1 Argentina    24     1
##  2 Argentina    25     2
##  3 Argentina    28     1
##  4 Argentina    37     1
##  5 Argentina    38     1
##  6 Argentina    40     1
##  7 Argentina    46     1
##  8 Argentina    48     1
##  9 Argentina    50     1
## 10 Australia     0     1
## # … with 869 more rows
```
]
.pull-right[

```r
datascience %>%
* count(Age, Country)
```

```
## # A tibble: 879 × 3
##      Age Country           n
##    <dbl> <chr>         <int>
##  1     0 Australia         1
##  2     1 United States     1
##  3    19 Canada            1
##  4    19 Other             1
##  5    19 Philippines       1
##  6    20 Hungary           1
##  7    20 Indonesia         1
##  8    20 Nigeria           1
##  9    20 Russia            2
## 10    20 United States     2
## # … with 869 more rows
```
]
]

---

## Demo

---

## `mutate` to add a new variable

```r
datascience %>%
* mutate(compensation_1k = CompensationAmount/1000) %>%
  select(compensation_1k, CompensationAmount)
```

```
## # A tibble: 2,618 × 2
##    compensation_1k CompensationAmount
##              <dbl>              <dbl>
##  1             250             250000
##  2            1200            1200000
##  3            1100            1100000
##  4              20              20000
##  5             100             100000
##  6              20              20000
##  7             624             624000
##  8             126             126000
##  9             133             133000
## 10              80              80000
## # … with 2,608 more rows
```

---

```r
datascience %>%
  count(FormalEducation, EmploymentStatus) %>%
  mutate(prop = n / sum(n))
```

```
## # A tibble: 14 × 4
##    FormalEducation                                         Emplo…¹     n    prop
##    <chr>                                                   <chr>   <int>   <dbl>
##  1 Bachelor's degree                                       Employ…   631 2.41e-1
##  2 Bachelor's degree                                       Employ…    43 1.64e-2
##  3 Bachelor's degree                                       Indepe…    72 2.75e-2
##  4 Doctoral degree                                         Employ…   537 2.05e-1
##  5 Doctoral degree                                         Employ…    21 8.02e-3
##  6 Doctoral degree                                         Indepe…    34 1.30e-2
##  7 I prefer not to answer                                  Employ…     5 1.91e-3
##  8 I prefer not to answer                                  Employ…     1 3.82e-4
##  9 Master's degree                                         Employ…  1041 3.98e-1
## 10 Master's degree                                         Employ…    48 1.83e-2
## 11 Master's degree                                         Indepe…   107 4.09e-2
## 12 Some college/university study without earning a bachel… Employ…    52 1.99e-2
## 13 Some college/university study without earning a bachel… Employ…    11 4.20e-3
## 14 Some college/university study without earning a bachel… Indepe…    15 5.73e-3
## # … with abbreviated variable name ¹EmploymentStatus
```
]

---

## `summarise` for summary stats

```r
# mean age of all respondents
datascience %>%
* summarise(mean_age = mean(Age))
```

```
## # A tibble: 1 × 1
##   mean_age
##      <dbl>
## 1     34.2
```

.tip[
`summarise()` changes the data frame entirely, it collapses rows down to a single 
summary statistics, and removes all columns that are irrelevant to the calculation.
]

---

.tip[
`summarise()` also lets you get away with being sloppy and not naming your new 
column, but that's not recommended!
]

```r
datascience %>%
  summarise(mean(Age)) 
```

```
## # A tibble: 1 × 1
##   `mean(Age)`
##         <dbl>
## 1        34.2
```
]

```r
datascience %>%
  summarise(mean_age = mean(Age)) 
```

```
## # A tibble: 1 × 1
##   mean_age
##      <dbl>
## 1     34.2
```
]
]
---

## `group_by` for grouped operations

```r
# mean age for each category of employment status
datascience %>%
* group_by(EmploymentStatus) %>%
  summarise(mean_age = mean(Age))
```

```
## # A tibble: 3 × 2
##   EmploymentStatus                                     mean_age
##   <chr>                                                   <dbl>
## 1 Employed full-time                                       34.1
## 2 Employed part-time                                       28.8
## 3 Independent contractor, freelancer, or self-employed     38.2
```

---

## Calculating frequencies

The following two give the same result, so `count` is simply short for `group_by` then determine frequencies

```r
datascience %>%
  group_by(EmployerIndustry) %>%
  summarise(n = n())
```

```r
datascience %>%
  count(EmployerIndustry)
```

---

## Multiple summary statistics

`summarise` can be used for multiple summary statistics as well

```r
datascience %>%
  summarise(
    mean_age = mean(Age),
    median_age = median(Age),
    max_age = max(Age)
    )
```

```
## # A tibble: 1 × 3
##   mean_age median_age max_age
##      <dbl>      <dbl>   <dbl>
## 1     34.2         32     100
```

---

## Demo