STATS 220

# STATS 220
## 🚀Factors & date-times📆

---

???

* forcats: my fav logo
* lubridate: my fav pkg name

---

## Atomic vectors

.pull-left[
 
 
.center[<img src="img/lego-vector.png" width="100%">]
]
.pull-right[
.center[<img src="https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png" width="85%">]
]

---

.left-column[
.center[[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/forcats.png" width="60%">](http://forcats.tidyverse.org)]
]
.right-column[
## Coerce to factors from one type

```r
dept <- c("Physics", "Mathematics", "Statistics",
 "Computer Science")
dept
```

```
#> [1] "Physics"          "Mathematics"      "Statistics"      
#> [4] "Computer Science"
```

```r
library(tidyverse) # library(forcats)
dept_fct <- as_factor(dept)
dept_fct
```

```
#> [1] Physics          Mathematics      Statistics      
#> [4] Computer Science
*#> 4 Levels: Physics Mathematics ... Computer Science
```
]

???

* diff displays
* fct: w meta info

---

```r
typeof(dept)
```

```
#> [1] "character"
```

```r
class(dept)
```

```
#> [1] "character"
```

```r
as.integer(dept)
```

```
#> [1] NA NA NA NA
```

```r
sort(dept)
```

```
#> [1] "Computer Science"
#> [2] "Mathematics"     
#> [3] "Physics"         
#> [4] "Statistics"
```
]
.pull-right[

```r
typeof(dept_fct)
```

```
#> [1] "integer"
```

```r
class(dept_fct)
```

```
#> [1] "factor"
```

```r
as.integer(dept_fct)
```

```
#> [1] 1 2 3 4
```

```r
sort(dept_fct)
```

```
#> [1] Physics          Mathematics     
#> [3] Statistics       Computer Science
#> 4 Levels: Physics ... Computer Science
```
]

---

## Factors

* Factors are used to represent a **categorical variable** in R.
* There is a fixed and known set of possible values.
* The fixed set of values is called the **levels** of the factor.

```r
dept_fct
```

```
#> [1] Physics          Mathematics      Statistics       Computer Science
#> Levels: Physics Mathematics Statistics Computer Science
```

```r
levels(dept_fct)
```

```
#> [1] "Physics"          "Mathematics"      "Statistics"       "Computer Science"
```

```r
rep(dept_fct, 3)
```

```
#>  [1] Physics          Mathematics      Statistics       Computer Science
#>  [5] Physics          Mathematics      Statistics       Computer Science
#>  [9] Physics          Mathematics      Statistics       Computer Science
#> Levels: Physics Mathematics Statistics Computer Science
```
]

---

## Create factors

* change the base level for modelling
* display characters in a non-alphabetical order

```r
dist_dept <- unique(dept)
factor(dept, levels = dist_dept) # in first appearance order
```

```
#> [1] Physics          Mathematics      Statistics       Computer Science
*#> Levels: Physics Mathematics Statistics Computer Science
```

```r
factor(dept, levels = rev(dist_dept)) # in reverse order
```

```
#> [1] Physics          Mathematics      Statistics       Computer Science
*#> Levels: Computer Science Statistics Mathematics Physics
```

???

* data values not changing, only meta info levels changed

---

## Reorder factor levels to easily perceive patterns
## .center[`sci_tbl` ]

.pull-left[
<img src="figure/fct-vis-1.png" width="420" style="display: block; margin: auto;" />
]
.pull-right[
<img src="figure/fct-ror-1.png" width="420" style="display: block; margin: auto;" />
]

???

* in default order, takes time to process info

---

## Reorder factor levels to easily perceive patterns
## .center[`movies` ]

.pull-left[
<img src="figure/box-movies-1.png" width="540" style="display: block; margin: auto;" />
]
.pull-right[
<img src="figure/box-movies-med-1.png" width="540" style="display: block; margin: auto;" />
]

---

## `fct_reorder()` by sorting along another variable

```r
sci_tbl %>% 
* mutate(dept = fct_reorder(dept, count)) %>%
  ggplot(aes(dept, count)) +
  geom_col()
```

<img src="figure/fct-ror-p2-1.png" width="420" style="display: block; margin: auto;" />
]
]
.pull-right[
.small[

```r
sci_tbl %>% 
* mutate(dept = fct_reorder(dept, -count)) %>%
  ggplot(aes(dept, count)) +
  geom_col()
```

<img src="figure/fct-ror2-1.png" width="420" style="display: block; margin: auto;" />
]
]

---

## `fct_reorder()` by sorting along another variable

```r
fct_reorder(sci_tbl$dept, sci_tbl$count)
```

```
#> [1] Physics          Mathematics      Statistics       Computer Science
*#> Levels: Mathematics Physics Statistics Computer Science
```

```r
fct_reorder(sci_tbl$dept, -sci_tbl$count)
```

```
#> [1] Physics          Mathematics      Statistics       Computer Science
*#> Levels: Computer Science Statistics Physics Mathematics
```

---

## `fct_reorder()` by sorting along another variable with `fun()`

```r
movies %>% 
 mutate(
 Major_Genre = fct_reorder(
 Major_Genre, 
 Rotten_Tomatoes_Rating,
* .fun = median, na.rm = TRUE)) %>%
 ggplot(aes(
 Rotten_Tomatoes_Rating, 
 Major_Genre)) +
 geom_boxplot()
```
]
.pull-right[
<img src="figure/box-movies-med3-1.png" width="540" style="display: block; margin: auto;" />
]

---

## `fct_infreq()` by counting obs with each level (largest first)

```r
movies %>% 
 mutate(Major_Genre = fct_infreq(
 Major_Genre)) %>% 
 ggplot(aes(y = Major_Genre)) +
 geom_bar()
```
]
.pull-right[
<img src="figure/unnamed-chunk-1-1.png" width="540" style="display: block; margin: auto;" />
]

---

## `fct_lump()` by lumping together factor levels into "other"

```r
movies %>% 
 mutate(Major_Genre = fct_infreq(
* fct_lump(Major_Genre, n = 6))) %>%
 ggplot(aes(y = Major_Genre)) +
 geom_bar()
```
]
.pull-right[
<img src="figure/unnamed-chunk-2-1.png" width="540" style="display: block; margin: auto;" />
]

---

## Convert numerics to factors: UoA grade scales

```r
set.seed(220)
scores_sim <- round(
 rnorm(309, mean = 70, sd = 10),
 digits = 2)
scores_tbl <- tibble(score = scores_sim)
scores_tbl
```

```
#> # A tibble: 309 x 1
#> score
#> <dbl>
#> 1 58.2
#> 2 80.1
#> 3 51.4
#> 4 80.5
#> 5 63.8
#> 6 51.0
#> # … with 303 more rows
```
]
.pull-right[

```r
scores_tbl %>% 
  ggplot(aes(x = score)) +
  geom_histogram() +
  geom_vline(xintercept = 70, 
    colour = "red")
```

<img src="figure/hist-1.png" width="540" style="display: block; margin: auto;" />
]

---

## `cut()` numerics to factors

```r
(rng <- c(0, seq(39, 89, by = 5), 100))
```

```
#>  [1]   0  39  44  49  54  59  64  69  74  79  84  89 100
```

```r
scores_tbl %>% 
  mutate(range = cut(score, breaks = rng, include.lowest = TRUE))
```

```
#> # A tibble: 309 x 2
#> score range 
#> <dbl> <fct> 
#> 1 58.2 (54,59]
#> 2 80.1 (79,84]
#> 3 51.4 (49,54]
#> 4 80.5 (79,84]
#> 5 63.8 (59,64]
#> 6 51.0 (49,54]
#> # … with 303 more rows
```

???

* underappreciated `cut()`, built-in function
* include `0`, but doesn't matter for this data

---

## `fct_recode()` changes factor levels by hand

```r
scores_schemes <- scores_tbl %>% 
 mutate(
 range = cut(score, breaks = rng, include.lowest = TRUE),
 grade = fct_recode(range, # new_lvl = old_lvl
 "D-" = "[0,39]", "D" = "(39,44]", "D+" = "(44,49]",
 "C-" = "(49,54]", "C" = "(54,59]", "C+" = "(59,64]",
 "B-" = "(64,69]", "B" = "(69,74]", "B+" = "(74,79]",
 "A-" = "(79,84]", "A" = "(84,89]", "A+" = "(89,100]"))
scores_schemes
```

```
#> # A tibble: 309 x 3
#> score range grade
#> <dbl> <fct> <fct>
#> 1 58.2 (54,59] C 
#> 2 80.1 (79,84] A- 
#> 3 51.4 (49,54] C- 
#> 4 80.5 (79,84] A- 
#> 5 63.8 (59,64] C+ 
#> 6 51.0 (49,54] C- 
#> # … with 303 more rows
```

???

* Here I show the `fct_recode()`, manual work
* what's other way to do this quickly

---

```r
scores_schemes %>% 
  ggplot(aes(x = range)) +
  geom_bar()
```

<img src="figure/bar-range-1.png" width="540" style="display: block; margin: auto;" />
]
.pull-right[

```r
scores_schemes %>% 
  ggplot(aes(x = grade)) +
  geom_bar()
```

<img src="figure/bar-grade-1.png" width="540" style="display: block; margin: auto;" />
]

---

## Your turn

> What function can we use to replace `fct_recode()` for the `scores_tbl` data?

???

live demo: `fct_rev()`

---

.left-column[
.center[[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/lubridate.png" width="60%">](http://lubridate.tidyverse.org)]
]
.right-column[
⬇️ {lubridate} is NOT part of the core {tidyverse}, so load with

```r
library(lubridate)
```
Relative and exact time units:
1. An **instant** is a specific moment in time, such as January 1st, 2012.
2. An **interval** is a period of time that occurs between two specific instants.
3. A **duration** records the time span in seconds, it will have an exact length since seconds always have the same length.
4. A **period** records a time span in units larger than seconds, such as years, months, weeks, days, hours, and minutes.
]

???

* temporal data (recorded over time) is everywhere. Ur phone tracks your daily life.
* instant: timestamp
* leap seconds/years, time zones, DST
* lab04, tz

---

```r
(td <- today())
```

```
#> [1] "2021-03-31"
```

```r
class(td)
```

```
#> [1] "Date"
```

```r
typeof(td)
```

```
#> [1] "double"
```

```r
as.integer(td) # 1970-01-01
```

```
#> [1] 18717
```
]
.pull-right[
## ⌚Date-times

```r
(current <- now())
```

```
#> [1] "2021-03-31 12:22:35 NZDT"
```

```r
class(current)
```

```
#> [1] "POSIXct" "POSIXt"
```

```r
typeof(current)
```

```
#> [1] "double"
```

```r
as.integer(current) # 1970-01-01 00:00:00
```

```
#> [1] 1617146555
```
]

---

## Create date-times

```r
make_date(2021, c(3, 6), c(31, 4))
```

```
#> [1] "2021-03-31" "2021-06-04"
```

```r
make_datetime(2021, c(3, 6), c(31, 4), c(16, 10))
```

```
#> [1] "2021-03-31 16:00:00 UTC" "2021-06-04 10:00:00 UTC"
```

```r
make_datetime(2021, c(3, 6), c(31, 4), c(16, 10), tz = "Pacific/Auckland")
```

```
#> [1] "2021-03-31 16:00:00 NZDT" "2021-06-04 10:00:00 NZST"
```

---

## Available time zones (~ 600‼️)

```r
set.seed(220)
OlsonNames()[sample(1:length(OlsonNames()), 32)]
```

```
#>  [1] "Pacific/Midway"                 "Africa/Asmera"                 
#>  [3] "Africa/Lusaka"                  "ROK"                           
#>  [5] "America/Montreal"               "Europe/Dublin"                 
#>  [7] "Asia/Irkutsk"                   "Africa/Cairo"                  
#>  [9] "Asia/Dubai"                     "America/Yellowknife"           
#> [11] "Asia/Tbilisi"                   "America/Menominee"             
#> [13] "Atlantic/Azores"                "GMT-0"                         
#> [15] "America/Louisville"             "Europe/Astrakhan"              
#> [17] "Pacific/Fakaofo"                "America/Nome"                  
#> [19] "Etc/GMT+10"                     "Pacific/Efate"                 
#> [21] "GB-Eire"                        "Asia/Thimphu"                  
#> [23] "US/Eastern"                     "Europe/Busingen"               
#> [25] "Australia/NSW"                  "America/Hermosillo"            
#> [27] "MET"                            "Pacific/Enderbury"             
#> [29] "America/Argentina/Rio_Gallegos" "Asia/Ashgabat"                 
#> [31] "Africa/Dakar"                   "Canada/Atlantic"
```

---

## Parse date-times

```r
ymd(c("2021/03/31", "2021-June-04"))
```

```
#> [1] "2021-03-31" "2021-06-04"
```

```r
ymd_h(c("2021-03-31 16", "2021-June-04 10"))
```

```
#> [1] "2021-03-31 16:00:00 UTC" "2021-06-04 10:00:00 UTC"
```

```r
(dttm <- ymd_h(c("2021-03-31 16", "2021-June-04 10"), tz = "Pacific/Auckland"))
```

```
#> [1] "2021-03-31 16:00:00 NZDT" "2021-06-04 10:00:00 NZST"
```
.pull-left[
* `ymd()`, `ymd_h()`, `ymd_hm()`, `ymd_hms()`
* `dmy()`, `dmy_h()`, `dmy_hm()`, `dmy_hms()`
]
.pull-right[
* `mdy()`, `mdy_h()`, `mdy_hm()`, `mdy_hms()`
]

---

## Extract components of date-times

```r
date(dttm)
```

```
#> [1] "2021-03-31" "2021-06-04"
```

```r
year(dttm)
```

```
#> [1] 2021 2021
```

```r
yday(dttm)
```

```
#> [1]  90 155
```

```r
week(dttm)
```

```
#> [1] 13 23
```
]
.pull-right[

```r
day(dttm) # mday(dttm)
```

```
#> [1] 31  4
```

```r
hour(dttm)
```

```
#> [1] 16 10
```

```r
minute(dttm)
```

```
#> [1] 0 0
```

```r
second(dttm)
```

```
#> [1] 0 0
```
]

---

## Extract months/weekdays of date-times

```r
month(dttm)
```

```
#> [1] 3 6
```

```r
month(dttm, label = TRUE)
```

```
#> [1] Mar Jun
#> 12 Levels: Jan < Feb < Mar < ... < Dec
```
]
.pull-right[
* weekday

```r
wday(dttm, week_start = 1)
```

```
#> [1] 3 5
```

```r
wday(dttm, label = TRUE)
```

```
#> [1] Wed Fri
#> 7 Levels: Sun < Mon < Tue < ... < Sat
```

```r
wday(dttm, label = TRUE, week_start = 1)
```

```
#> [1] Wed Fri
#> 7 Levels: Mon < Tue < Wed < ... < Sun
```
]

---

## Round, floor and ceiling date-times

```r
floor_date(dttm, "3 hours")
```

```
#> [1] "2021-03-31 15:00:00 NZDT" "2021-06-04 09:00:00 NZST"
```

```r
ceiling_date(dttm, "2 days")
```

```
#> [1] "2021-04-02 NZDT" "2021-06-05 NZST"
```

```r
round_date(dttm, "1 month")
```

```
#> [1] "2021-04-01 NZDT" "2021-06-01 NZST"
```

---

## Perform accurate math on date-times

```r
dttm + 1
```

```
#> [1] "2021-03-31 16:00:01 NZDT"
#> [2] "2021-06-04 10:00:01 NZST"
```

```r
dttm + minutes(2)
```

```
#> [1] "2021-03-31 16:02:00 NZDT"
#> [2] "2021-06-04 10:02:00 NZST"
```

```r
dttm + hours(3)
```

```
#> [1] "2021-03-31 19:00:00 NZDT"
#> [2] "2021-06-04 13:00:00 NZST"
```

```r
dttm + days(4)
```

```
*#> [1] "2021-04-04 16:00:00 NZST"
#> [2] "2021-06-08 10:00:00 NZST"
```
]
.pull-right[

```r
dttm + weeks(5)
```

```
#> [1] "2021-05-05 16:00:00 NZST"
#> [2] "2021-07-09 10:00:00 NZST"
```

```r
dttm + months(6)
```

```
*#> [1] NA                        
#> [2] "2021-12-04 10:00:00 NZDT"
```

```r
dttm + years(7)
```

```
#> [1] "2028-03-31 16:00:00 NZDT"
#> [2] "2028-06-04 10:00:00 NZST"
```
]
]

---

## Format date-times (also coerce to characters)

```r
format(dttm)
```

```
#> [1] "2021-03-31 16:00:00" "2021-06-04 10:00:00"
```

```r
format(dttm, "%Y/%b/%d")
```

```
#> [1] "2021/Mar/31" "2021/Jun/04"
```

```r
format(dttm, "%y/%b/%d %H:%M:%S")
```

```
#> [1] "21/Mar/31 16:00:00" "21/Jun/04 10:00:00"
```

```r
format(dttm, "on %d %B (%a)")
```

```
#> [1] "on 31 March (Wed)" "on 04 June (Fri)"
```
]
]
.pull-right[
* `a`/`A`: Abbreviated/full weekday name.
* `b`/`B`: Abbreviated or full month name.
* `m`: Month as decimal number (01-12 or 1-12).
* `d`: Day of the month as decimal number (01-31 or 0-31)
* `w`: Weekday as decimal number (0-6, Sunday is 0).
* `y`/`Y`: Year without/with century.
* more on [`?parse_date_time()`](https://lubridate.tidyverse.org/reference/parse_date_time.html)
]

---

```r
movies$Release_Date[c(38:39, 268)]
```

```
#> [1] "18-Oct-06"  "1963-01-01" NA
```

```r
movies %>% 
 mutate(
* Release_Date = parse_date_time(
* Release_Date, c("%d-%b-%y", "%Y-%m-%d")),
* Year = year(Release_Date)
 ) %>% 
 filter(Year < 2012) %>% 
 ggplot(aes(Year, IMDB_Rating)) +
 geom_hex()
```
]
]
.pull-right[
<img src="figure/unnamed-chunk-3-1.png" width="540" style="display: block; margin: auto;" />
]

---

## Reading

.pull-left[
.center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="520px">](https://r4ds.had.co.nz)]
]
.pull-right[
* [Factors](https://r4ds.had.co.nz/factors.html)
* [Dates and times](https://r4ds.had.co.nz/dates-and-times.html)
* [{forcats} cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/factors.pdf)
* [{lubridate} cheatsheet](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf)
]