Lab 11 - STATS 220 Data Technology

This lab exercise is due 23:59 Monday 7 June (NZST).

You should submit an R file (i.e. file extension .R) containing R code that assigns the appropriate values to the appropriate symbols.
Your R file will be executed in order and checked against the values that have been assigned to the symbols using an automatic grading system. Marks will be fully deducted for non-identical results.
Intermediate steps to achieve the final results will NOT be checked.
Each question is worth 0.2 points.
You should submit your R file on Canvas.
Late assignments are NOT accepted unless prior arrangement for medical/compassionate reasons.

In this lab exercise, you are going to work with bus status data that was collected on 2017-03-09 from Auckland Transport. You shall use the following code snippet (and include them upfront in your R file) for this lab session:

library(fs)
library(lubridate)
library(tidyverse)
csv_files <- dir_ls("data/aklbus2017", glob = "*.csv")
head(csv_files)

#> data/aklbus2017/2017-03-09-05-14.csv
#> data/aklbus2017/2017-03-09-05-24.csv
#> data/aklbus2017/2017-03-09-05-34.csv
#> data/aklbus2017/2017-03-09-05-44.csv
#> data/aklbus2017/2017-03-09-05-54.csv
#> data/aklbus2017/2017-03-09-06-04.csv

Suppose that you have created an Rproj for this course. You need to download and unzip aklbus2017.zip here to data/ under your Rproj folder.

Each file represents a snapshot of bus status information, where each snapshot was taken approximately 10 minutes apart. Within each file, each row of data gives the status of one bus. The delay is how many seconds ahead or behind schedule the bus is (a positive value is behind schedule and a negative value is ahead of schedule). The stop.id identifies which bus stop the delay applies to. The stop.sequence is the order of the bus stop within the bus route. The route identifies the bus route.

You’re required to use csv_files to import these data. NO marks will be given to this lab for using URL links or different file paths.

Question 1

Import all CSV files from the data/aklbus2017, with their unique identifiers as "path".

You should end up with a tibble called aklbus.

aklbus

#> # A tibble: 43,793 x 5
#>    path                   delay stop.id stop.sequence route           
#>    <chr>                  <dbl>   <dbl>         <dbl> <chr>           
#>  1 data/aklbus2017/2017-…   150    5050            28 15401-201702141…
#>  2 data/aklbus2017/2017-…    97    3093             6 83401-201702141…
#>  3 data/aklbus2017/2017-…    50    4810            13 98105-201702141…
#>  4 data/aklbus2017/2017-…    58    6838             5 36201-201702141…
#>  5 data/aklbus2017/2017-…  -186    2745            11 37301-201702141…
#>  6 data/aklbus2017/2017-…   183    2635            14 03301-201702141…
#>  7 data/aklbus2017/2017-…  -144    3360             4 10001-201702141…
#>  8 data/aklbus2017/2017-…  -151    2206            20 38011-201702141…
#>  9 data/aklbus2017/2017-…   -46    2419            38 38011-201702141…
#> 10 data/aklbus2017/2017-…  -513    6906            42 38012-201702141…
#> # … with 43,783 more rows

Question 2

Transform aklbus:

replace route with their first 5 digits as the route identifier
create a new column datetime that contains the recorded date-times, derived from the path
remove the path column
change the column positions as shown in aklbus_time

You should end up with a tibble called aklbus_time.

aklbus_time

#> # A tibble: 43,793 x 5
#>    datetime            delay stop.id stop.sequence route
#>    <dttm>              <dbl>   <dbl>         <dbl> <chr>
#>  1 2017-03-09 05:14:00   150    5050            28 15401
#>  2 2017-03-09 05:14:00    97    3093             6 83401
#>  3 2017-03-09 05:14:00    50    4810            13 98105
#>  4 2017-03-09 05:14:00    58    6838             5 36201
#>  5 2017-03-09 05:14:00  -186    2745            11 37301
#>  6 2017-03-09 05:14:00   183    2635            14 03301
#>  7 2017-03-09 05:14:00  -144    3360             4 10001
#>  8 2017-03-09 05:14:00  -151    2206            20 38011
#>  9 2017-03-09 05:14:00   -46    2419            38 38011
#> 10 2017-03-09 05:14:00  -513    6906            42 38012
#> # … with 43,783 more rows

mean(aklbus_time$datetime)

#> [1] "2017-03-09 13:30:23 NZDT"

Question 3

Compute the percentage of on-time bus status (as ontime_prop) for each route given that day. If the bus is ahead or behind schedule within 5 minutes, it is classified as on-time; otherwise late. Then arrange their rows in ascending order by ontime_prop.

You should end up with a tibble called aklbus_ontime.

aklbus_ontime

#> # A tibble: 1,082 x 2
#>    route ontime_prop
#>    <chr>       <dbl>
#>  1 00152           0
#>  2 00154           0
#>  3 00164           0
#>  4 00170           0
#>  5 00452           0
#>  6 00553           0
#>  7 00651           0
#>  8 00652           0
#>  9 00653           0
#> 10 00660           0
#> # … with 1,072 more rows

mean(aklbus_ontime$ontime_prop)

#> [1] 0.5356871

Question 4

Exclude these bus routes from aklbus_time, if their on-time percentages are either 0 or 1 during the day. Then split the rest of bus routes into separate tibbles in a list.

You should end up with a list of tibbles called aklbus_lst.

aklbus_lst[[220]]

#> # A tibble: 32 x 5
#>    datetime            delay stop.id stop.sequence route
#>    <dttm>              <dbl>   <dbl>         <dbl> <chr>
#>  1 2017-03-09 05:54:00     6    4935            13 06001
#>  2 2017-03-09 06:04:00   154    4890            18 06001
#>  3 2017-03-09 06:14:00    97    4872            27 06001
#>  4 2017-03-09 06:14:00  -104    4904             2 06001
#>  5 2017-03-09 06:24:00   208    4862            32 06001
#>  6 2017-03-09 06:24:00   121    4908            14 06001
#>  7 2017-03-09 06:34:00   802    4885            34 06001
#>  8 2017-03-09 06:34:00   197    4890            18 06001
#>  9 2017-03-09 06:34:00   -31    4904             2 06001
#> 10 2017-03-09 06:45:00   907    5640            42 06001
#> # … with 22 more rows

length(aklbus_lst)

#> [1] 817

Question 5

Get the dimension (the number of rows and colums) of each tibble in the aklbus_lst.

You should end up with a list of integers called aklbus_size. Each element of the list is an integer vector of length two.

head(aklbus_size)

#> [[1]]
#> [1] 64  5
#> 
#> [[2]]
#> [1] 88  5
#> 
#> [[3]]
#> [1] 12  5
#> 
#> [[4]]
#> [1] 24  5
#> 
#> [[5]]
#> [1] 5 5
#> 
#> [[6]]
#> [1] 3 5

typeof(aklbus_size[[1]])

#> [1] "integer"