---
title: "STATS 220"
subtitle: "Web scraping`r emo::ji('globe_with_meridians')`"
type: "lecture"
date: ""
output:
xaringan::moon_reader:
css: ["assets/remark.css"]
lib_dir: libs
nature:
ratio: 16:9
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse middle
```{r initial, echo = FALSE, cache = FALSE, results = 'hide'}
library(knitr)
options(htmltools.dir.version = FALSE, htmltools.preserve.raw = FALSE,
tibble.width = 60, tibble.print_min = 6)
opts_chunk$set(
echo = TRUE, warning = FALSE, message = FALSE, comment = "#>",
fig.path = 'figure/', cache.path = 'cache/', cache = FALSE, fig.retina = 3,
fig.align = 'center', fig.width = 4.5, fig.height = 4, fig.show = 'hold',
dpi = 120
)
```
```{r xaringan-panelset, echo = FALSE}
xaringanExtra::use_panelset()
```
```{r external, include = FALSE, cache = FALSE}
read_chunk('R/09-web-scrape.R')
```
## Web technology
.footnote[I thank [Dr Emi Tanaka](https://emitanaka.org/about.html) for this part, adapted from her "Communicating with Data" course.]
---
class: middle
## World Wide Web (WWW)
WWW (or the **Web**) is the information system where documents (web pages) are identified by Uniform Resource Locators (**URL**s)
A web page consists of:
* **HTML** provides the basic structure of the web page
* **CSS** controls the look of the web page (optional)
* **JS** is a programming language that can modify the behaviour of elements of the web page (optional)
---
## Hypertext Markup Language (HTML)
* with the extension `.html`.
* rendered using a web browser via an URL.
* text files that follows a special syntax that alerts web browsers how to render it.
.pull-left[
.center[**via a web browser**
]
]
.pull-right[
.center[**via a text editor**
]
]
---
## HTML structure
```html
This is a paragraph.
``` ??? * servr::httd() to serve * HTML: hier str: elements (`start tag: | <span style="color:blue;">Author content</span> |
end tag: | <span style="color:blue;">Author content</span> |
content: | <span style="color:blue;">Author content</span> |
element name: | <span style="color:blue;">Author content</span> |
attribute: | <span style="color:blue;">Author content</span> |
attribute name: | <span style="color:blue;">Author content</span> |
attribute value: | <span style="color:blue;">Author content</span> |
block element: | <div>content</div> |
inline element: | <span>content</span> |
paragraph: | <p>content</p> |
header level 1: | <h1>content</h1> |
header level 2: | <h2>content</h2> |
italic: | <i>content</i> |
emphasised text: | <em>content</em> |
strong importance: | <strong>content</strong> |
link: | <a href="https://stats220.earo.me/">content</a> |
unordered list: | <ul> <li>item 1</li> <li>item 2</li> </ul> |
selector: | h1 { color: blue; } |
property: | h1 { color: blue; } |
property name: | h1 { color: blue; } |
property value: | h1 { color: blue; } |
background color: | div { background-color: yellow; } |
Sample text
|
text color: | div { color: purple; } |
Sample text
|
border: | div { border: 1px dashed brown; } |
Sample text
|
left border only: | div { border-left: 10px solid pink; } |
Sample text
|
text size: | div { font-size: 10pt; } |
Sample text
|
padding: | div { background-color: yellow; padding: 10px; } |
Sample text
|
margin: | div { background-color: yellow; margin: 10px; } |
Sample text
|
center align text: | div { background-color: yellow; padding-top: 20px; text-align: center; } |
Sample text
|
font family: | div { font-family: Marker Felt, times; } |
Sample text
|
strike: | div { text-decoration: line-through; } |
Sample text
|
underline: | div { text-decoration: underline; } |
Sample text
|
opacity: | div { opacity: 0.3 } |
Sample text
|
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
* | selects all elements | |
blockquote | selects all <blockquote> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> |
<h1>This is a sample html</h1>
<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>
<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
<p>Hello!</p>
</div>
</div>
<p>Household 1</p>
<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
<p>Don't talk to me!</p>
</blockquote>
</div>
<span class="child">
<span class="parent child rebel">
<p>Clean your room!</p>
</span>
</span>
<p>End of households</p>
]
---
count: false
.pull-left[
## CSS selector
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> | |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </span> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> | |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
p div | selects all <div> within <p> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> | |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> | |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>]
span
, i
, b
,...
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> | |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>]
span
, i
, b
,...
* | selects all elements | |
div | selects all <div> elements | |
div, p | selects all <div> and <p> elements | |
div p | selects all <p> within <div> | |
div > p | selects all <p> one level deep in <div> | |
div + p | selects all <p> immediately after a <div> | |
div ~ p | selects all <p> preceded by a <div> |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
.classname | selects all elements with the attribute class="classname". | |
.c1.c2 | selects all elements with both c1 and c2 within its class attribute. | |
.c1 .c2 | selects all elements with class c2 that is a descendant of an element with class c1. | |
#idname | selects all elements with the attribute id="idname". |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
.parent | selects all elements with the attribute class="parent". | |
.c1.c2 | selects all elements with both c1 and c2 within its class attribute. | |
.c1 .c2 | selects all elements with class c2 that is a descendant of an element with class c1. | |
#idname | selects all elements with the attribute id="idname". |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
.classname | selects all elements with the attribute class="classname". | |
.child.rebel | selects all elements with both child and rebel within its class attribute. | |
.c1 .c2 | selects all elements with class c2 that is a descendant of an element with class c1. | |
#idname | selects all elements with the attribute id="idname". |
<h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p>] --- count: false .pull-left[ ## CSS selector
.classname | selects all elements with the attribute class="classname". | |
.c1.c2 | selects all elements with both c1 and c2 within its class attribute. | |
.parent .rebel | selects all elements with class rebel that is a descendant of an element with class parent. | |
#idname | selects all elements with the attribute id="idname". |
<h1>This is a sample html</h1>
<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>
<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
<p>Hello!</p>
</div>
</div>
<p>Household 1</p>
<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
<p>Don't talk to me!</p>
</blockquote>
</div>
<span class="child">
<span class="parent child rebel">
<p>Clean your room!</p>
</span>
</span>
<p>End of households</p>
]
---
count: false
.pull-left[
## CSS selector
.classname | selects all elements with the attribute class="classname". | |
.c1.c2 | selects all elements with both c1 and c2 within its class attribute. | |
.c1 .c2 | selects all elements with class c2 that is a descendant of an element with class c1. | |
#p1 | selects all elements with the attribute id="p1". |
<h1>This is a sample html</h1>
<blockquote>
<p>Maybe stories are just data with a soul.</p>
<footer>—Brene Brown</footer>
</blockquote>
<div id="p1" class="parent">
Hmm
<p>Hi!</p>
How are you?
<div class="child nice">
<p>Hello!</p>
</div>
</div>
<p>Household 1</p>
<div class="parent">
<p>Hi!</p>
<blockquote class="child rebel">
<p>Don't talk to me!</p>
</blockquote>
</div>
<span class="child">
<span class="parent child rebel">
<p>Clean your room!</p>
</span>
</span>
<p>End of households</p>
]
class
, you can only have one id
value and must be unique in the whole HTML document.