class: center, middle, inverse, title-slide # STATS 220 ## 🚀Factors & date-times📆 --- class: middle center inverse <img src="https://github.com/rstudio/hex-stickers/raw/master/SVG/forcats.svg" height="420px"> <img src="https://github.com/rstudio/hex-stickers/raw/master/SVG/lubridate.svg" height="420px"> ??? * forcats: my fav logo * lubridate: my fav pkg name --- ## Atomic vectors .pull-left[ <br> <br> .center[<img src="img/lego-vector.png" width="100%">] ] .pull-right[ .center[<img src="https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png" width="85%">] ] --- .left-column[ .center[[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/forcats.png" width="60%">](http://forcats.tidyverse.org)] ] .right-column[ ## Coerce to factors from one type ```r dept <- c("Physics", "Mathematics", "Statistics", "Computer Science") dept ``` ``` #> [1] "Physics" "Mathematics" "Statistics" #> [4] "Computer Science" ``` ```r library(tidyverse) # library(forcats) dept_fct <- as_factor(dept) dept_fct ``` ``` #> [1] Physics Mathematics Statistics #> [4] Computer Science *#> 4 Levels: Physics Mathematics ... Computer Science ``` ] ??? * diff displays * fct: w meta info --- .pull-left[ ```r typeof(dept) ``` ``` #> [1] "character" ``` ```r class(dept) ``` ``` #> [1] "character" ``` ```r as.integer(dept) ``` ``` #> [1] NA NA NA NA ``` ```r sort(dept) ``` ``` #> [1] "Computer Science" #> [2] "Mathematics" #> [3] "Physics" #> [4] "Statistics" ``` ] .pull-right[ ```r typeof(dept_fct) ``` ``` #> [1] "integer" ``` ```r class(dept_fct) ``` ``` #> [1] "factor" ``` ```r as.integer(dept_fct) ``` ``` #> [1] 1 2 3 4 ``` ```r sort(dept_fct) ``` ``` #> [1] Physics Mathematics #> [3] Statistics Computer Science #> 4 Levels: Physics ... Computer Science ``` ] --- ## Factors * Factors are used to represent a **categorical variable** in R. * There is a fixed and known set of possible values. * The fixed set of values is called the **levels** of the factor. .small[ ```r dept_fct ``` ``` #> [1] Physics Mathematics Statistics Computer Science #> Levels: Physics Mathematics Statistics Computer Science ``` ```r levels(dept_fct) ``` ``` #> [1] "Physics" "Mathematics" "Statistics" "Computer Science" ``` ```r rep(dept_fct, 3) ``` ``` #> [1] Physics Mathematics Statistics Computer Science #> [5] Physics Mathematics Statistics Computer Science #> [9] Physics Mathematics Statistics Computer Science #> Levels: Physics Mathematics Statistics Computer Science ``` ] --- ## Create factors * change the base level for modelling * display characters in a non-alphabetical order ```r dist_dept <- unique(dept) factor(dept, levels = dist_dept) # in first appearance order ``` ``` #> [1] Physics Mathematics Statistics Computer Science *#> Levels: Physics Mathematics Statistics Computer Science ``` ```r factor(dept, levels = rev(dist_dept)) # in reverse order ``` ``` #> [1] Physics Mathematics Statistics Computer Science *#> Levels: Computer Science Statistics Mathematics Physics ``` ??? * data values not changing, only meta info levels changed --- ## Reorder factor levels to easily perceive patterns ## .center[`sci_tbl` <i class="fas fa-table"></i>] .pull-left[ <img src="figure/fct-vis-1.png" width="420" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figure/fct-ror-1.png" width="420" style="display: block; margin: auto;" /> ] ??? * in default order, takes time to process info --- ## Reorder factor levels to easily perceive patterns ## .center[`movies` <i class="fas fa-table"></i>] .pull-left[ <img src="figure/box-movies-1.png" width="540" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figure/box-movies-med-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## `fct_reorder()` by sorting along another variable .pull-left[ .small[ ```r sci_tbl %>% * mutate(dept = fct_reorder(dept, count)) %>% ggplot(aes(dept, count)) + geom_col() ``` <img src="figure/fct-ror-p2-1.png" width="420" style="display: block; margin: auto;" /> ] ] .pull-right[ .small[ ```r sci_tbl %>% * mutate(dept = fct_reorder(dept, -count)) %>% ggplot(aes(dept, count)) + geom_col() ``` <img src="figure/fct-ror2-1.png" width="420" style="display: block; margin: auto;" /> ] ] --- ## `fct_reorder()` by sorting along another variable ```r fct_reorder(sci_tbl$dept, sci_tbl$count) ``` ``` #> [1] Physics Mathematics Statistics Computer Science *#> Levels: Mathematics Physics Statistics Computer Science ``` ```r fct_reorder(sci_tbl$dept, -sci_tbl$count) ``` ``` #> [1] Physics Mathematics Statistics Computer Science *#> Levels: Computer Science Statistics Physics Mathematics ``` --- ## `fct_reorder()` by sorting along another variable with `fun()` .pull-left[ <br> ```r movies %>% mutate( Major_Genre = fct_reorder( Major_Genre, Rotten_Tomatoes_Rating, * .fun = median, na.rm = TRUE)) %>% ggplot(aes( Rotten_Tomatoes_Rating, Major_Genre)) + geom_boxplot() ``` ] .pull-right[ <img src="figure/box-movies-med3-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## `fct_infreq()` by counting obs with each level (largest first) .pull-left[ <br> <br> <br> ```r movies %>% mutate(Major_Genre = fct_infreq( Major_Genre)) %>% ggplot(aes(y = Major_Genre)) + geom_bar() ``` ] .pull-right[ <img src="figure/unnamed-chunk-1-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## `fct_lump()` by lumping together factor levels into "other" .pull-left[ <br> <br> <br> ```r movies %>% mutate(Major_Genre = fct_infreq( * fct_lump(Major_Genre, n = 6))) %>% ggplot(aes(y = Major_Genre)) + geom_bar() ``` ] .pull-right[ <img src="figure/unnamed-chunk-2-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## Convert numerics to factors: UoA grade scales .pull-left[ ```r set.seed(220) scores_sim <- round( rnorm(309, mean = 70, sd = 10), digits = 2) scores_tbl <- tibble(score = scores_sim) scores_tbl ``` ``` #> # A tibble: 309 x 1 #> score #> <dbl> #> 1 58.2 #> 2 80.1 #> 3 51.4 #> 4 80.5 #> 5 63.8 #> 6 51.0 #> # … with 303 more rows ``` ] .pull-right[ ```r scores_tbl %>% ggplot(aes(x = score)) + geom_histogram() + geom_vline(xintercept = 70, colour = "red") ``` <img src="figure/hist-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## `cut()` numerics to factors ```r (rng <- c(0, seq(39, 89, by = 5), 100)) ``` ``` #> [1] 0 39 44 49 54 59 64 69 74 79 84 89 100 ``` ```r scores_tbl %>% mutate(range = cut(score, breaks = rng, include.lowest = TRUE)) ``` ``` #> # A tibble: 309 x 2 #> score range #> <dbl> <fct> #> 1 58.2 (54,59] #> 2 80.1 (79,84] #> 3 51.4 (49,54] #> 4 80.5 (79,84] #> 5 63.8 (59,64] #> 6 51.0 (49,54] #> # … with 303 more rows ``` ??? * underappreciated `cut()`, built-in function * include `0`, but doesn't matter for this data --- ## `fct_recode()` changes factor levels by hand ```r scores_schemes <- scores_tbl %>% mutate( range = cut(score, breaks = rng, include.lowest = TRUE), grade = fct_recode(range, # new_lvl = old_lvl "D-" = "[0,39]", "D" = "(39,44]", "D+" = "(44,49]", "C-" = "(49,54]", "C" = "(54,59]", "C+" = "(59,64]", "B-" = "(64,69]", "B" = "(69,74]", "B+" = "(74,79]", "A-" = "(79,84]", "A" = "(84,89]", "A+" = "(89,100]")) scores_schemes ``` ``` #> # A tibble: 309 x 3 #> score range grade #> <dbl> <fct> <fct> #> 1 58.2 (54,59] C #> 2 80.1 (79,84] A- #> 3 51.4 (49,54] C- #> 4 80.5 (79,84] A- #> 5 63.8 (59,64] C+ #> 6 51.0 (49,54] C- #> # … with 303 more rows ``` ??? * Here I show the `fct_recode()`, manual work * what's other way to do this quickly --- .pull-left[ ```r scores_schemes %>% ggplot(aes(x = range)) + geom_bar() ``` <img src="figure/bar-range-1.png" width="540" style="display: block; margin: auto;" /> ] .pull-right[ ```r scores_schemes %>% ggplot(aes(x = grade)) + geom_bar() ``` <img src="figure/bar-grade-1.png" width="540" style="display: block; margin: auto;" /> ] --- class: middle ## Your turn > What function can we use to replace `fct_recode()` for the `scores_tbl` data?
00
:
30
??? live demo: `fct_rev()` --- .left-column[ .center[[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/lubridate.png" width="60%">](http://lubridate.tidyverse.org)] ] .right-column[ ⬇️ {lubridate} is NOT part of the core {tidyverse}, so load with ```r library(lubridate) ``` Relative and exact time units: 1. An **instant** is a specific moment in time, such as January 1st, 2012. 2. An **interval** is a period of time that occurs between two specific instants. 3. A **duration** records the time span in seconds, it will have an exact length since seconds always have the same length. 4. A **period** records a time span in units larger than seconds, such as years, months, weeks, days, hours, and minutes. ] ??? * temporal data (recorded over time) is everywhere. Ur phone tracks your daily life. * instant: timestamp * leap seconds/years, time zones, DST * lab04, tz --- .pull-left[ ## 📆Dates ```r (td <- today()) ``` ``` #> [1] "2021-03-31" ``` ```r class(td) ``` ``` #> [1] "Date" ``` ```r typeof(td) ``` ``` #> [1] "double" ``` ```r as.integer(td) # 1970-01-01 ``` ``` #> [1] 18717 ``` ] .pull-right[ ## ⌚Date-times ```r (current <- now()) ``` ``` #> [1] "2021-03-31 12:22:35 NZDT" ``` ```r class(current) ``` ``` #> [1] "POSIXct" "POSIXt" ``` ```r typeof(current) ``` ``` #> [1] "double" ``` ```r as.integer(current) # 1970-01-01 00:00:00 ``` ``` #> [1] 1617146555 ``` ] --- ## Create date-times ```r make_date(2021, c(3, 6), c(31, 4)) ``` ``` #> [1] "2021-03-31" "2021-06-04" ``` ```r make_datetime(2021, c(3, 6), c(31, 4), c(16, 10)) ``` ``` #> [1] "2021-03-31 16:00:00 UTC" "2021-06-04 10:00:00 UTC" ``` ```r make_datetime(2021, c(3, 6), c(31, 4), c(16, 10), tz = "Pacific/Auckland") ``` ``` #> [1] "2021-03-31 16:00:00 NZDT" "2021-06-04 10:00:00 NZST" ``` --- ## Available time zones (~ 600‼️) ```r set.seed(220) OlsonNames()[sample(1:length(OlsonNames()), 32)] ``` ``` #> [1] "Pacific/Midway" "Africa/Asmera" #> [3] "Africa/Lusaka" "ROK" #> [5] "America/Montreal" "Europe/Dublin" #> [7] "Asia/Irkutsk" "Africa/Cairo" #> [9] "Asia/Dubai" "America/Yellowknife" #> [11] "Asia/Tbilisi" "America/Menominee" #> [13] "Atlantic/Azores" "GMT-0" #> [15] "America/Louisville" "Europe/Astrakhan" #> [17] "Pacific/Fakaofo" "America/Nome" #> [19] "Etc/GMT+10" "Pacific/Efate" #> [21] "GB-Eire" "Asia/Thimphu" #> [23] "US/Eastern" "Europe/Busingen" #> [25] "Australia/NSW" "America/Hermosillo" #> [27] "MET" "Pacific/Enderbury" #> [29] "America/Argentina/Rio_Gallegos" "Asia/Ashgabat" #> [31] "Africa/Dakar" "Canada/Atlantic" ``` --- ## Parse date-times ```r ymd(c("2021/03/31", "2021-June-04")) ``` ``` #> [1] "2021-03-31" "2021-06-04" ``` ```r ymd_h(c("2021-03-31 16", "2021-June-04 10")) ``` ``` #> [1] "2021-03-31 16:00:00 UTC" "2021-06-04 10:00:00 UTC" ``` ```r (dttm <- ymd_h(c("2021-03-31 16", "2021-June-04 10"), tz = "Pacific/Auckland")) ``` ``` #> [1] "2021-03-31 16:00:00 NZDT" "2021-06-04 10:00:00 NZST" ``` .pull-left[ * `ymd()`, `ymd_h()`, `ymd_hm()`, `ymd_hms()` * `dmy()`, `dmy_h()`, `dmy_hm()`, `dmy_hms()` ] .pull-right[ * `mdy()`, `mdy_h()`, `mdy_hm()`, `mdy_hms()` ] --- ## Extract components of date-times .pull-left[ ```r date(dttm) ``` ``` #> [1] "2021-03-31" "2021-06-04" ``` ```r year(dttm) ``` ``` #> [1] 2021 2021 ``` ```r yday(dttm) ``` ``` #> [1] 90 155 ``` ```r week(dttm) ``` ``` #> [1] 13 23 ``` ] .pull-right[ ```r day(dttm) # mday(dttm) ``` ``` #> [1] 31 4 ``` ```r hour(dttm) ``` ``` #> [1] 16 10 ``` ```r minute(dttm) ``` ``` #> [1] 0 0 ``` ```r second(dttm) ``` ``` #> [1] 0 0 ``` ] --- ## Extract months/weekdays of date-times .pull-left[ * month ```r month(dttm) ``` ``` #> [1] 3 6 ``` ```r month(dttm, label = TRUE) ``` ``` #> [1] Mar Jun #> 12 Levels: Jan < Feb < Mar < ... < Dec ``` ] .pull-right[ * weekday ```r wday(dttm, week_start = 1) ``` ``` #> [1] 3 5 ``` ```r wday(dttm, label = TRUE) ``` ``` #> [1] Wed Fri #> 7 Levels: Sun < Mon < Tue < ... < Sat ``` ```r wday(dttm, label = TRUE, week_start = 1) ``` ``` #> [1] Wed Fri #> 7 Levels: Mon < Tue < Wed < ... < Sun ``` ] --- ## Round, floor and ceiling date-times ```r floor_date(dttm, "3 hours") ``` ``` #> [1] "2021-03-31 15:00:00 NZDT" "2021-06-04 09:00:00 NZST" ``` ```r ceiling_date(dttm, "2 days") ``` ``` #> [1] "2021-04-02 NZDT" "2021-06-05 NZST" ``` ```r round_date(dttm, "1 month") ``` ``` #> [1] "2021-04-01 NZDT" "2021-06-01 NZST" ``` --- ## Perform accurate math on date-times .small[ .pull-left[ ```r dttm + 1 ``` ``` #> [1] "2021-03-31 16:00:01 NZDT" #> [2] "2021-06-04 10:00:01 NZST" ``` ```r dttm + minutes(2) ``` ``` #> [1] "2021-03-31 16:02:00 NZDT" #> [2] "2021-06-04 10:02:00 NZST" ``` ```r dttm + hours(3) ``` ``` #> [1] "2021-03-31 19:00:00 NZDT" #> [2] "2021-06-04 13:00:00 NZST" ``` ```r dttm + days(4) ``` ``` *#> [1] "2021-04-04 16:00:00 NZST" #> [2] "2021-06-08 10:00:00 NZST" ``` ] .pull-right[ ```r dttm + weeks(5) ``` ``` #> [1] "2021-05-05 16:00:00 NZST" #> [2] "2021-07-09 10:00:00 NZST" ``` ```r dttm + months(6) ``` ``` *#> [1] NA #> [2] "2021-12-04 10:00:00 NZDT" ``` ```r dttm + years(7) ``` ``` #> [1] "2028-03-31 16:00:00 NZDT" #> [2] "2028-06-04 10:00:00 NZST" ``` ] ] --- ## Format date-times (also coerce to characters) .pull-left[ .small[ ```r format(dttm) ``` ``` #> [1] "2021-03-31 16:00:00" "2021-06-04 10:00:00" ``` ```r format(dttm, "%Y/%b/%d") ``` ``` #> [1] "2021/Mar/31" "2021/Jun/04" ``` ```r format(dttm, "%y/%b/%d %H:%M:%S") ``` ``` #> [1] "21/Mar/31 16:00:00" "21/Jun/04 10:00:00" ``` ```r format(dttm, "on %d %B (%a)") ``` ``` #> [1] "on 31 March (Wed)" "on 04 June (Fri)" ``` ] ] .pull-right[ * `a`/`A`: Abbreviated/full weekday name. * `b`/`B`: Abbreviated or full month name. * `m`: Month as decimal number (01-12 or 1-12). * `d`: Day of the month as decimal number (01-31 or 0-31) * `w`: Weekday as decimal number (0-6, Sunday is 0). * `y`/`Y`: Year without/with century. * more on [`?parse_date_time()`](https://lubridate.tidyverse.org/reference/parse_date_time.html) ] --- class: middle .pull-left[ ## 📽`movies` .small[ ```r movies$Release_Date[c(38:39, 268)] ``` ``` #> [1] "18-Oct-06" "1963-01-01" NA ``` ```r movies %>% mutate( * Release_Date = parse_date_time( * Release_Date, c("%d-%b-%y", "%Y-%m-%d")), * Year = year(Release_Date) ) %>% filter(Year < 2012) %>% ggplot(aes(Year, IMDB_Rating)) + geom_hex() ``` ] ] .pull-right[ <img src="figure/unnamed-chunk-3-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## Reading .pull-left[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="520px">](https://r4ds.had.co.nz)] ] .pull-right[ * [Factors](https://r4ds.had.co.nz/factors.html) * [Dates and times](https://r4ds.had.co.nz/dates-and-times.html) * [{forcats} cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/factors.pdf) * [{lubridate} cheatsheet](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf) ]