--- title: "STATS 220" subtitle: "Web scraping`r emo::ji('globe_with_meridians')`" type: "lecture" date: "" output: xaringan::moon_reader: css: ["assets/remark.css"] lib_dir: libs nature: ratio: 16:9 highlightStyle: github highlightLines: true countIncrementalSlides: false --- class: inverse middle ```{r initial, echo = FALSE, cache = FALSE, results = 'hide'} library(knitr) options(htmltools.dir.version = FALSE, htmltools.preserve.raw = FALSE, tibble.width = 60, tibble.print_min = 6) opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, comment = "#>", fig.path = 'figure/', cache.path = 'cache/', cache = FALSE, fig.retina = 3, fig.align = 'center', fig.width = 4.5, fig.height = 4, fig.show = 'hold', dpi = 120 ) ``` ```{r xaringan-panelset, echo = FALSE} xaringanExtra::use_panelset() ``` ```{r external, include = FALSE, cache = FALSE} read_chunk('R/09-web-scrape.R') ``` ## Web technology .footnote[I thank [Dr Emi Tanaka](https://emitanaka.org/about.html) for this part, adapted from her "Communicating with Data" course.] --- class: middle ## World Wide Web (WWW) WWW (or the **Web**) is the information system where documents (web pages) are identified by Uniform Resource Locators (**URL**s) A web page consists of: * **HTML** provides the basic structure of the web page * **CSS** controls the look of the web page (optional) * **JS** is a programming language that can modify the behaviour of elements of the web page (optional) --- ## Hypertext Markup Language (HTML) * with the extension `.html`. * rendered using a web browser via an URL. * text files that follows a special syntax that alerts web browsers how to render it. .pull-left[ .center[**via a web browser** ] ] .pull-right[ .center[**via a text editor** ] ] --- ## HTML structure ```html STATS 220 Data Technology

I'm a first level header

This is a paragraph.

``` ??? * servr::httd() to serve * HTML: hier str: elements (``) and optional attributes, and contents * > 100 elements: each html page must have `` and ``. (rich format -> md) * block tags: h1, p * inline tags: bold a --- ## HTML syntax .center[`Author content` Author content]
start tag:<span style="color:blue;">Author content</span>
end tag: <span style="color:blue;">Author content</span>
content: <span style="color:blue;">Author content</span>
element name: <span style="color:blue;">Author content</span>
attribute: <span style="color:blue;">Author content</span>
attribute name: <span style="color:blue;">Author content</span>
attribute value: <span style="color:blue;">Author content</span>

.center[Not all HTML tags have an end tag:] .center[ `` ] --- ## HTML elements
block element:<div>content</div>
inline element:<span>content</span>
paragraph:<p>content</p>
header level 1:<h1>content</h1>
header level 2:<h2>content</h2>
italic:<i>content</i>
emphasised text:<em>content</em>
strong importance:<strong>content</strong>
link:<a href="https://stats220.earo.me/">content</a>
unordered list:<ul>
<li>item 1</li>
<li>item 2</li>
</ul>
??? How these are rendered to the browser depends on the browser default style values, style attribute or CSS... --- ## Cascading Style Sheet (CSS) * with the extension `.css` * 3 ways to style elements in HTML: * **inline** by using the `style` attribute inside HTML start tag:
<h1 style="color:blue;">Blue Header</h1>
+ **externally** by using the `` element:
<link rel="stylesheet" href="styles.css">
+ **internally** by defining within ` ``` By convention, the `

This is a header

``` ]

This is a header

selector:h1 { color: blue; }
property:h1 { color: blue; }
property name:h1 { color: blue; }
property value:h1 { color: blue; }
.pull-left[ You may have multiple properties for a single selector.`r emo::ji("arrow_right")` ] .pull-right[ ```css h1 { color: blue; font-size: 16pt; } ``` ] --- ## CSS properties .center[ ```html
Sample text
``` ]
background color: div { background-color: yellow; }
Sample text
text color: div { color: purple; }
Sample text
border: div { border: 1px dashed brown; }
Sample text
left border only: div { border-left: 10px solid pink; }
Sample text
text size: div { font-size: 10pt; }
Sample text
padding: div { background-color: yellow;
    padding: 10px; }
Sample text
margin: div { background-color: yellow;
    margin: 10px; }
Sample text
--- ## CSS properties .center[ ```html
Sample text
``` ]
center align text: div { background-color: yellow;
    padding-top: 20px;
    text-align: center; }
Sample text
font family: div { font-family: Marker Felt, times; }
Sample text
strike: div { text-decoration: line-through; }
Sample text
underline: div { text-decoration: underline; }
Sample text
opacity: div { opacity: 0.3 }
Sample text
--- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
blockquote  selects all <blockquote> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</span>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
p div  selects all <div> within <p>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

]
Ignores inline elements like span, i, b,...
--- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

]
Ignores inline elements like span, i, b,...
--- count: false .pull-left[ ## CSS selector
*   selects all elements
div  selects all <div> elements
div, p  selects all <div> and <p> elements
div p  selects all <p> within <div>
div > p  selects all <p> one level deep in <div>
div + p  selects all <p> immediately after a <div>
div ~ p  selects all <p> preceded by a <div>
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
.classname  selects all elements with the attribute class="classname".
.c1.c2  selects all elements with both c1 and c2 within its class attribute.
.c1 .c2  selects all elements with class c2 that is a descendant of an element with class c1.
#idname  selects all elements with the attribute id="idname".
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
.parent  selects all elements with the attribute class="parent".
.c1.c2  selects all elements with both c1 and c2 within its class attribute.
.c1 .c2  selects all elements with class c2 that is a descendant of an element with class c1.
#idname  selects all elements with the attribute id="idname".
Note some offspring do not inherit class from their parents.
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you? 
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
.classname  selects all elements with the attribute class="classname".
.child.rebel  selects all elements with both child and rebel within its class attribute.
.c1 .c2  selects all elements with class c2 that is a descendant of an element with class c1.
#idname  selects all elements with the attribute id="idname".
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
.classname  selects all elements with the attribute class="classname".
.c1.c2  selects all elements with both c1 and c2 within its class attribute.
.parent .rebel  selects all elements with class rebel that is a descendant of an element with class parent.
#idname  selects all elements with the attribute id="idname".
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

] --- count: false .pull-left[ ## CSS selector
.classname  selects all elements with the attribute class="classname".
.c1.c2  selects all elements with both c1 and c2 within its class attribute.
.c1 .c2  selects all elements with class c2 that is a descendant of an element with class c1.
#p1  selects all elements with the attribute id="p1".
] .pull-right[
<h1>This is a sample html</h1>

<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>

<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
  <p>Hello!</p>
</div>
</div>

<p>Household 1</p>

<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
  <p>Don't talk to me!</p>
</blockquote>
</div>

<span class="child">
<span class="parent child rebel">
  <p>Clean your room!</p>
</span>
</span>

<p>End of households</p>

]
Unlike class, you can only have one id value and must be unique in the whole HTML document.
--- ## JavaScript (JS)* * JS is a programming language and enable interactive components in HTML documents. * 2 ways to insert JS into a HTML document: + **internally** by defining within ` ``` + **externally** by using the `src` attribute to refer to the external file: ```html ``` --- class: inverse middle ## Web scraping `r emo::ji("spider_web")` --- ```{r opts0, echo = FALSE} op <- options(width = 40) ``` .pull-left[ .center[] ] .pull-right[ Use {rvest} `>= v1.0.0` (if not, update) ```{r read-html} ``` ] --- .pull-left[ ## Inspect elements

.center[] ] .pull-right[ ## [CSS selectors](https://selectorgadget.com/)

.center[] ] --- .pull-left[


.center[] ] .pull-right[ `html_element()` select element ```{r html-el} ``` `html_name()` get element name ```{r html-nm} ``` ] --- .pull-left[


.center[] ] .pull-right[ `html_children()` get element children ```{r html-child} ``` `html_name()` get element name ```{r html-child-nm} ``` ] --- .pull-left[


.center[] ] .pull-right[ `html_text2()` get element text ```{r html-text} ``` `html_attr()` get element attributes ```{r html-attr} ``` ] --- ## `url_absolute()`: turn relative urls to absolute urls ```{r html-url} ``` --- .pull-left[


.center[] ] .pull-right[ `html_elements()` select elements ```{r html-el-i} ``` `r emo::ji("arrow_up")` [fontawesome](http://fontawesome.com) icons ] --- .pull-left[
```{r html-info} ``` .center[] ] .pull-right[ select all `

` elements ```{r html-h3} ``` select `#timetable` id ```{r html-id} ``` ] --- .pull-left[


.center[] ] .pull-right[ select the first `` element ```{r html-table} ``` ] ??? * table on the web primarily for presentation purpose, not data storage * isn't clean from web scraping --- ## Download pdf slides at once ```{r links} ``` --- ## Download pdf slides at once ```{r pdf-links} ``` ```{r pdf-links-dl, eval = FALSE} ``` --- class: inverse middle ## REST API --- ## [Github REST API](https://docs.github.com/en/rest) .pull-left[ * Each URL is called a **request**. * The data sent back to you is called an HTTP **response** that consists of headers and a body. The root-endpoint of Github's API is . ```{r root-endpoint, eval = FALSE} ``` ] .pull-right[ ```{r ref.label = "root-endpoint", echo = FALSE, highlight.output = 2:5} ``` ] --- ## HTTP methods
* .brown[`GET`] to **retrieve** resource data/information only, and NO change in state of the resource * .brown[`POST`] to **create** new subordinate resources, e.g. upload a file * .brown[`PUT`] to **update/replace** an existing resource in its entirety * .brown[`DELETE`] to **delete** resources * .brown[`PATCH`] to make **partial update** on a resource (not all browsers support `PATCH`) --- ## Path .pull-left[ The **path** determines the resource you’re requesting for. [`GET /repos/{owner}/{repo}`](https://docs.github.com/en/rest/reference/repos#get-a-repository ) ```{r endpoint, eval = FALSE} ``` ] .pull-right[ ```{r ref.label = "endpoint", echo = FALSE} ``` ] --- ## Parse the response .pull-left[ ```{r http-type} ``` Content type of a response * `"image/png"` * `"application/text"` * `"application/csv"` * `...` ] .pull-right[ ```{r content} ``` ] --- ## Status code ```{r status} ```
.pull-left[ ```{r status-lst1, eval = FALSE} ``` ] .pull-right[ ```{r status-lst2, eval = FALSE} ``` ] .footnote[] --- ## Auckland Transport Open GIS Data: [bus stop](https://data-atgis.opendata.arcgis.com/datasets/bus-stop?geometry=173.281%2C-37.229%2C176.247%2C-36.459) ```{r bus-stop} ``` --- class: middle .pull-left[ ```{r bus-stop-plot, eval = FALSE} ``` ] .pull-right[ ```{r bus-stop-plot2, ref.label = "bus-stop-plot", echo = FALSE} ``` ] --- ## Reading .pull-left[

.center[ ]] .pull-right[ * [Get started with {rvest}](https://rvest.tidyverse.org/articles/rvest.html) * [{httr} quickstart guide](https://httr.r-lib.org/articles/quickstart.html) ]