Assignment 1 Solution
This assignment is due 23:59 Friday 26 March (NZDT).
- You should submit an R file (i.e. file extension
.R
) containing R code that assigns the appropriate values to the appropriate symbols. - Your R file will be executed in order and checked against the values that have been assigned to the relevant symbols using an automatic grading system. Marks will be fully deducted for non-identical results.
- Intermediate steps to achieve the final results will NOT be checked.
- Each question is worth 1 point.
- You should submit your R file on Canvas.
- Late assignments are NOT accepted unless prior arrangement for medical/compassionate reasons.
In this assignment, your are going to work with 2018 Citi Bike trip data
in New York City (2018-citibike-tripdata.csv
). The data includes:
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth
You shall use the following packages for this assignment:
library(tidyverse)
- Make sure to include the snippet above upfront in your R file.
- DO NOT include
install.packages()
in your R file.
Suppose that you have created an Rproj
for this course. You need to
download 2018-citibike-tripdata.csv
here
to data/
under your Rproj
.
- You’re required to use relative file paths
data/2018-citibike-tripdata.csv
to import the data. - NO marks will be given for using URL links or different file paths.
- DO NOT apply any
theme()
and aesthetics other than what I asked to your plots. - DO NOT print any R objects and plots.
Question 1
Read data/2018-citibike-tripdata.csv
into R. You should end up with a
tibble called nycbikes18_raw
.
nycbikes18_raw <- read_csv("data/2018-citibike-tripdata.csv")
nycbikes18_raw
#> # A tibble: 333,687 x 15
#> tripduration starttime stoptime
#> <dbl> <dttm> <dttm>
#> 1 932 2018-01-01 07:06:17 2018-01-01 07:21:50
#> 2 550 2018-01-01 17:06:18 2018-01-01 17:15:28
#> 3 510 2018-01-01 17:06:56 2018-01-01 17:15:27
#> 4 354 2018-01-01 19:53:10 2018-01-01 19:59:05
#> 5 250 2018-01-01 22:34:30 2018-01-01 22:38:40
#> 6 613 2018-01-02 03:05:05 2018-01-02 03:15:19
#> 7 290 2018-01-02 17:13:51 2018-01-02 17:18:42
#> 8 381 2018-01-02 17:50:03 2018-01-02 17:56:24
#> 9 318 2018-01-02 18:55:58 2018-01-02 19:01:16
#> 10 1852 2018-01-02 21:55:29 2018-01-02 22:26:22
#> # … with 333,677 more rows, and 12 more variables:
#> # start_station_id <dbl>, start_station_name <chr>,
#> # start_station_latitude <dbl>, start_station_longitude <dbl>,
#> # end_station_id <dbl>, end_station_name <chr>,
#> # end_station_latitude <dbl>, end_station_longitude <dbl>,
#> # bikeid <dbl>, usertype <chr>, birth_year <dbl>, gender <dbl>
Question 2
Regarding nycbikes18_raw
, you are interested in the total number of
bike trips ridden by each age group and user type. Plot a bar chart to
address the question of interest. You should end up with a ggplot
object called p1
, with
colour = "white"
.
p1 <- nycbikes18_raw %>%
ggplot(aes(x = birth_year, fill = usertype)) +
geom_bar(colour = "white")
p1
Question 3
From the above plot p1
, it is noted that there are a few trips done by
users who were born before 1900. These users possibly don’t want to
reveal their ages. You need to remove these observations with
birth_year
greater than 1900
for the rest of the analysis. You
should end up with a tibble called nycbikes18
.
#nycbikes18 <- nycbikes18_raw[nycbikes18_raw$birth_year > 1900,]
nycbikes18 <- nycbikes18_raw %>%
filter(birth_year > 1900)
nycbikes18
#> # A tibble: 333,557 x 15
#> tripduration starttime stoptime
#> <dbl> <dttm> <dttm>
#> 1 932 2018-01-01 07:06:17 2018-01-01 07:21:50
#> 2 550 2018-01-01 17:06:18 2018-01-01 17:15:28
#> 3 510 2018-01-01 17:06:56 2018-01-01 17:15:27
#> 4 354 2018-01-01 19:53:10 2018-01-01 19:59:05
#> 5 250 2018-01-01 22:34:30 2018-01-01 22:38:40
#> 6 613 2018-01-02 03:05:05 2018-01-02 03:15:19
#> 7 290 2018-01-02 17:13:51 2018-01-02 17:18:42
#> 8 381 2018-01-02 17:50:03 2018-01-02 17:56:24
#> 9 318 2018-01-02 18:55:58 2018-01-02 19:01:16
#> 10 1852 2018-01-02 21:55:29 2018-01-02 22:26:22
#> # … with 333,547 more rows, and 12 more variables:
#> # start_station_id <dbl>, start_station_name <chr>,
#> # start_station_latitude <dbl>, start_station_longitude <dbl>,
#> # end_station_id <dbl>, end_station_name <chr>,
#> # end_station_latitude <dbl>, end_station_longitude <dbl>,
#> # bikeid <dbl>, usertype <chr>, birth_year <dbl>, gender <dbl>
‼️ You shall work with nycbikes18
for the rest of the assignment.
Question 4
Calculate the total trip durations over the year. You should end up with
a double called ttl_tripd
.
ttl_tripd <- sum(nycbikes18$tripduration)
ttl_tripd
#> [1] 226912929
Question 5
Find out the number of Citi bikes used in 2018. You should end up with
an integer called n_bikes
. (HINTS: You may find unique()
useful.)
n_bikes <- length(unique(nycbikes18$bikeid))
n_bikes
#> [1] 900
Question 6
You’d like to know if Citi bike subscribers ride more often than
one-time customers. Present a bar chart for the tallies of trips by each
usertype
. You should end up with a ggplot object called p2
.
p2 <- nycbikes18 %>%
ggplot(aes(x = usertype)) +
geom_bar()
p2
Question 7
You’re interested in riding behaviours of users of different genders
based on their user types. Produce a side-by-side bar charts to display
the tallies of trips by each gender
, grouped by usertype
. You should
end up with a ggplot object called p3
.
p3 <- nycbikes18 %>%
ggplot(aes(x = gender, fill = usertype)) +
geom_bar(position = "dodge")
p3
Question 8
Do younger users ride for longer trips? Generate a scatter plot with
birth_year
on x axis and tripduration
on y axis, faceted by
usertype
on rows and gender
on columns. You should end up with a
ggplot object named p4
with
size = 0.5
.
p4 <- nycbikes18 %>%
ggplot(aes(x = birth_year, y = tripduration)) +
geom_point(size = 0.5) +
facet_grid(vars(usertype), vars(gender))
p4
Question 9
Let’s take a look at where Citi bike stations are located. Plot the
following layered graphics: (1) overlaying points indicate all
geographical locations of start stations; (2) one more layer of points
represent all end stations, on top of the first layer. You should end up
with a ggplot object named p5
.
p5 <- nycbikes18 %>%
ggplot() +
geom_point(aes(
start_station_longitude,
start_station_latitude)) +
geom_point(aes(
end_station_longitude,
end_station_latitude))
p5
Question 10
To get a picture of how Citi bikes flow from one place to another, you
need to plot all bike trips with arrows pointing from start to end
stations. You should end up with a ggplot object named p6
, with
arrow = arrow(length = unit(0.01, "npc"))
,alpha = 0.3
.
HINTS: check out the
?geom_segment
for examples.
p6 <- nycbikes18 %>%
ggplot(aes(start_station_longitude, start_station_latitude)) +
geom_segment(aes(
xend = end_station_longitude,
yend = end_station_latitude),
arrow = arrow(length = unit(0.01, "npc")), alpha = 0.3)
p6