In this lab exercise, you are going to scrape top 50 horror films rated by users from IMDB. You shall use the following code snippet (and include them upfront in your R file) for this lab session:

link <- ",&genres=horror&sort=user_rating,desc&view=simple&sort=user_rating"
horror <- read_html(link)
#> {html_document}
#> <html xmlns:og="" xmlns:fb="">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
#> [2] <body id="styleguide-v2" class="fixed">\n            <img heigh ...

Question 1

Scrape top 50 horror films’ posters.

You should end up with a character vector of length 50, called film_poster.

film_poster <- horror %>% 
  html_elements(".loadlate") %>% 
#> [1] ",0,34,50_AL_.jpg"
#> [2] ",0,34,50_AL_.jpg"
#> [3] ",0,34,50_AL_.jpg"
#> [4] ",0,34,50_AL_.jpg"
#> [5] ",0,34,50_AL_.jpg"                                
#> [6] ",0,34,50_AL_.jpg"

Question 2

Scrape top 50 horror films’ titles.

You should end up with a character vector of length 50, called movie.

movie <- horror %>% 
  html_elements(".col-title a") %>% 
#> [1] "Psycho"            "The Shining"       "Alien"            
#> [4] "Tumbbad"           "The Blue Elephant" "The Thing"

Question 3

Scrape top 50 horror films’ release years.

You should end up with a double vector of length 50, called year.

  1. You may find one of {readr}’s parse_*() functions useful for extracting numbers.

year <- horror %>% 
  html_elements(".text-muted") %>% 
  html_text2() %>% 
#>  [1] 1960 1980 1979 2018 2014 1982 1962 1955 1920 1973 1968 2008 2004
#> [14] 1978 1968 1933 1932 1922 2010 1961 1935 1931 2017 2014 2000 1987
#> [27] 1978 1965 1963 1960 1960 1956 1933 2018 2016 2011 2009 2004 2002
#> [40] 2001 1986 1975 1954 2019 2018 2010 2013 2007 1994 1992

Question 4

Scrape top 50 horror films’ user ratings.

You should end up with a double vector of length 50, called rating.

rating <- horror %>% 
  html_elements(".col-imdb-rating") %>% 
  html_text2() %>% 
#>  [1] 8.5 8.4 8.4 8.3 8.2 8.1 8.1 8.1 8.1 8.0 8.0 7.9 7.9 7.9 7.9 7.9
#> [17] 7.9 7.9 7.8 7.8 7.8 7.8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7
#> [33] 7.7 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.6 7.5 7.5 7.5 7.5 7.5
#> [49] 7.5 7.5

Question 5

Create a tibble that contains these scraped films’ information ordered by their ranks. The column names are Rank, Poster, Movie, Year, Rating respectively.

You should end up with a tibble, called top50_horror.

NOTE: the Rank column is of integers.

top50_horror <- tibble(
  Rank = seq_along(film_poster),
  Poster = film_poster,
  Movie = movie,
  Year = year,
  Rating = rating)
#> # A tibble: 50 x 5
#>     Rank Poster                          Movie             Year Rating
#>    <int> <chr>                           <chr>            <dbl>  <dbl>
#>  1     1… Psycho            1960    8.5
#>  2     2… The Shining       1980    8.4
#>  3     3… Alien             1979    8.4
#>  4     4… Tumbbad           2018    8.3
#>  5     5… The Blue Elepha…  2014    8.2
#>  6     6… The Thing         1982    8.1
#>  7     7… What Ever Happe…  1962    8.1
#>  8     8… Les diaboliques   1955    8.1
#>  9     9… Das Cabinet des…  1920    8.1
#> 10    10… The Exorcist      1973    8  
#> # … with 40 more rows

Question4fun (NO marks)

Turn top50_horror into a searchable paged HTML table as follows.


get_rating_colour <- function(rating) {
  orange_pal <- function(x) 
    rgb(colorRamp(c("#fdae6b", "#d94801"))(x), maxColorValue = 255)
  normalized <- rating / 10

col_img <- function() {
    maxWidth = 70,
    cell = function(value) { div(img(src = value)) })

col_text <- function() {
  colDef(minWidth = 70, maxWidth = 90, align = "center")

col_rate <- function() {
    defaultSortOrder = "desc",
    cell = JS("function(cellInfo) {
      const sliceColor = cellInfo.row['rating_colour']
      const sliceLength = 2 * Math.PI * 24
      const sliceOffset = sliceLength * (1 - cellInfo.value / 10)
      const donutChart = (
        '<svg width=60 height=60 style=\"transform: rotate(-90deg)\">' +
          '<circle cx=30 cy=30 r=24 fill=none stroke-width=4 stroke=rgba(0,0,0,0.1)></circle>' +
          '<circle cx=30 cy=30 r=24 fill=none stroke-width=4 stroke=' + sliceColor +
          ' stroke-dasharray=' + sliceLength + ' stroke-dashoffset=' + sliceOffset + '></circle>' +
      const label = '<div style=\"position: absolute; top: 50%; left: 50%; ' +
        'transform: translate(-50%, -50%)\">' + cellInfo.value + '</div>'
      return '<div style=\"display: inline-flex; position: relative\">' + donutChart + label + '</div>'
    html = TRUE,
    align = "center",
    width = 140)

top50_horror %>% 
    Year = as.character(Year),
    rating_colour = get_rating_colour(Rating)
  ) %>% 
    defaultColDef = colDef(headerStyle = list(background = "#f7f7f8")),
    defaultSorted = "Rank",
    searchable = TRUE,
    columns = list(
      Rank = col_text(),
      Poster = col_img(),
      Movie = colDef(maxWidth = 300),
      Year = col_text(),
      Rating = col_rate(),
      rating_colour = colDef(show = FALSE)
    highlight = TRUE,
    width = 690,
    theme = reactableTheme(
      cellStyle = list(
        display = "flex", 
        flexDirection = "column", 
        justifyContent = "center")