Lab 09 Solution
This lab exercise is due 23:59 Monday 24 May (NZST).
- You should submit an R file (i.e. file extension
.R
) containing R code that assigns the appropriate values to the appropriate symbols. - Your R file will be executed in order and checked against the values that have been assigned to the symbols using an automatic grading system. Marks will be fully deducted for non-identical results.
- Intermediate steps to achieve the final results will NOT be checked.
- Each question is worth 0.2 points.
- You should submit your R file on Canvas.
- Late assignments are NOT accepted unless prior arrangement for medical/compassionate reasons.
In this lab exercise, you are going to scrape top 50 horror films rated by users from IMDB. You shall use the following code snippet (and include them upfront in your R file) for this lab session:
library(rvest)
library(tidyverse)
link <- "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=horror&sort=user_rating,desc&view=simple&sort=user_rating"
horror <- read_html(link)
horror
#> {html_document}
#> <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
#> [2] <body id="styleguide-v2" class="fixed">\n <img heigh ...
Question 1
Scrape top 50 horror films’ posters.
You should end up with a character vector of length 50, called
film_poster
.
film_poster <- horror %>%
html_elements(".loadlate") %>%
html_attr("loadlate")
head(film_poster)
#> [1] "https://m.media-amazon.com/images/M/MV5BNTQwNDM1YzItNDAxZC00NWY2LTk0M2UtNDIwNWI5OGUyNWUxXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UX34_CR0,0,34,50_AL_.jpg"
#> [2] "https://m.media-amazon.com/images/M/MV5BZWFlYmY2MGEtZjVkYS00YzU4LTg0YjQtYzY1ZGE3NTA5NGQxXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_UX34_CR0,0,34,50_AL_.jpg"
#> [3] "https://m.media-amazon.com/images/M/MV5BMmQ2MmU3NzktZjAxOC00ZDZhLTk4YzEtMDMyMzcxY2IwMDAyXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UX34_CR0,0,34,50_AL_.jpg"
#> [4] "https://m.media-amazon.com/images/M/MV5BYmQxNmU4ZjgtYzE5Mi00ZDlhLTlhOTctMzJkNjk2ZGUyZGEwXkEyXkFqcGdeQXVyMzgxMDA0Nzk@._V1_UY50_CR0,0,34,50_AL_.jpg"
#> [5] "https://m.media-amazon.com/images/M/MV5BNDkxMzk2ODU4N15BMl5BanBnXkFtZTgwNTM4NjIzMjE@._V1_UY50_CR0,0,34,50_AL_.jpg"
#> [6] "https://m.media-amazon.com/images/M/MV5BNGViZWZmM2EtNGYzZi00ZDAyLTk3ODMtNzIyZTBjN2Y1NmM1XkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_UX34_CR0,0,34,50_AL_.jpg"
Question 2
Scrape top 50 horror films’ titles.
You should end up with a character vector of length 50, called
movie
.
movie <- horror %>%
html_elements(".col-title a") %>%
html_text2()
head(movie)
#> [1] "Psycho" "The Shining" "Alien"
#> [4] "Tumbbad" "The Blue Elephant" "The Thing"
Question 3
Scrape top 50 horror films’ release years.
You should end up with a double vector of length 50, called year
.
HINTS
- You may find one of {readr}’s
parse_*()
functions useful for extracting numbers.
year <- horror %>%
html_elements(".text-muted") %>%
html_text2() %>%
parse_number()
year
#> [1] 1960 1980 1979 2018 2014 1982 1962 1955 1920 1973 1968 2008 2004
#> [14] 1978 1968 1933 1932 1922 2010 1961 1935 1931 2017 2014 2000 1987
#> [27] 1978 1965 1963 1960 1960 1956 1933 2018 2016 2011 2009 2004 2002
#> [40] 2001 1986 1975 1954 2019 2018 2010 2013 2007 1994 1992
Question 4
Scrape top 50 horror films’ user ratings.
You should end up with a double vector of length 50, called
rating
.
rating <- horror %>%
html_elements(".col-imdb-rating") %>%
html_text2() %>%
parse_number()
rating
#> [1] 8.5 8.4 8.4 8.3 8.2 8.1 8.1 8.1 8.1 8.0 8.0 7.9 7.9 7.9 7.9 7.9
#> [17] 7.9 7.9 7.8 7.8 7.8 7.8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7
#> [33] 7.7 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.5 7.5 7.5 7.5 7.5
#> [49] 7.5 7.5
Question 5
Create a tibble that contains these scraped films’ information ordered
by their ranks. The column names are Rank
, Poster
, Movie
, Year
,
Rating
respectively.
You should end up with a tibble, called top50_horror
.
NOTE: the Rank
column is of integers.
top50_horror <- tibble(
Rank = seq_along(film_poster),
Poster = film_poster,
Movie = movie,
Year = year,
Rating = rating)
top50_horror
#> # A tibble: 50 x 5
#> Rank Poster Movie Year Rating
#> <int> <chr> <chr> <dbl> <dbl>
#> 1 1 https://m.media-amazon.com/ima… Psycho 1960 8.5
#> 2 2 https://m.media-amazon.com/ima… The Shining 1980 8.4
#> 3 3 https://m.media-amazon.com/ima… Alien 1979 8.4
#> 4 4 https://m.media-amazon.com/ima… Tumbbad 2018 8.3
#> 5 5 https://m.media-amazon.com/ima… The Blue Elepha… 2014 8.2
#> 6 6 https://m.media-amazon.com/ima… The Thing 1982 8.1
#> 7 7 https://m.media-amazon.com/ima… What Ever Happe… 1962 8.1
#> 8 8 https://m.media-amazon.com/ima… Les diaboliques 1955 8.1
#> 9 9 https://m.media-amazon.com/ima… Das Cabinet des… 1920 8.1
#> 10 10 https://m.media-amazon.com/ima… The Exorcist 1973 8
#> # … with 40 more rows
Question4fun (NO marks)
Turn top50_horror
into a searchable paged HTML table as follows.
library(reactable)
library(htmltools)
get_rating_colour <- function(rating) {
orange_pal <- function(x)
rgb(colorRamp(c("#fdae6b", "#d94801"))(x), maxColorValue = 255)
normalized <- rating / 10
orange_pal(normalized)
}
col_img <- function() {
colDef(
maxWidth = 70,
cell = function(value) { div(img(src = value)) })
}
col_text <- function() {
colDef(minWidth = 70, maxWidth = 90, align = "center")
}
col_rate <- function() {
colDef(
defaultSortOrder = "desc",
cell = JS("function(cellInfo) {
const sliceColor = cellInfo.row['rating_colour']
const sliceLength = 2 * Math.PI * 24
const sliceOffset = sliceLength * (1 - cellInfo.value / 10)
const donutChart = (
'<svg width=60 height=60 style=\"transform: rotate(-90deg)\">' +
'<circle cx=30 cy=30 r=24 fill=none stroke-width=4 stroke=rgba(0,0,0,0.1)></circle>' +
'<circle cx=30 cy=30 r=24 fill=none stroke-width=4 stroke=' + sliceColor +
' stroke-dasharray=' + sliceLength + ' stroke-dashoffset=' + sliceOffset + '></circle>' +
'</svg>'
)
const label = '<div style=\"position: absolute; top: 50%; left: 50%; ' +
'transform: translate(-50%, -50%)\">' + cellInfo.value + '</div>'
return '<div style=\"display: inline-flex; position: relative\">' + donutChart + label + '</div>'
}"),
html = TRUE,
align = "center",
width = 140)
}
top50_horror %>%
mutate(
Year = as.character(Year),
rating_colour = get_rating_colour(Rating)
) %>%
reactable(
defaultColDef = colDef(headerStyle = list(background = "#f7f7f8")),
defaultSorted = "Rank",
searchable = TRUE,
columns = list(
Rank = col_text(),
Poster = col_img(),
Movie = colDef(maxWidth = 300),
Year = col_text(),
Rating = col_rate(),
rating_colour = colDef(show = FALSE)
),
highlight = TRUE,
width = 690,
theme = reactableTheme(
cellStyle = list(
display = "flex",
flexDirection = "column",
justifyContent = "center")
))