class: center, middle, inverse, title-slide # STATS 220 ## Web scraping🌐 --- class: inverse middle ## Web technology <i class="fab fa-html5 orange"></i> <i class="fab fa-css3-alt blue"></i> <i class='fab fa-js-square yellow'></i> .footnote[I thank [Dr Emi Tanaka](https://emitanaka.org/about.html) for this part, adapted from her "Communicating with Data" course.] --- class: middle ## World Wide Web (WWW) WWW (or the **Web**) is the information system where documents (web pages) are identified by Uniform Resource Locators (**URL**s) A web page consists of: * <i class="fab fa-html5 orange"></i> **HTML** provides the basic structure of the web page * <i class="fab fa-css3-alt blue"></i> **CSS** controls the look of the web page (optional) * <i class='fab fa-js-square yellow'></i></span> **JS** is a programming language that can modify the behaviour of elements of the web page (optional) --- ## <i class="fab fa-html5 orange"></i></span> Hypertext Markup Language (HTML) * with the extension `.html`. * rendered using a web browser via an URL. * text files that follows a special syntax that alerts web browsers how to render it. .pull-left[ .center[**via a web browser** <img src = "img/browser-220.png", width = "100%"></img> ] ] .pull-right[ .center[**via a text editor** <img src = "img/code-220.png", width = "100%"></img> ] ] --- ## <i class="fab fa-html5 orange"></i> HTML structure ```html <!DOCTYPE html> <html> <!--This is a comment and ignored by web client.--> <head> <!--This section contains web page metadata.--> <title>STATS 220 Data Technology</title> <meta name="author" content="Earo Wang"> <link rel="stylesheet" href="css/styles.css"> </head> <body> <!--This section contains what you want to display on your web page.--> <h1>I'm a first level header</h1> <p>This is a <b>paragraph</b>.</p> </body> </html> ``` ??? * servr::httd() to serve * HTML: hier str: elements (`<tags>`) and optional attributes, and contents * > 100 elements: each html page must have `<head>` and `<body>`. (rich format -> md) * block tags: h1, p * inline tags: bold a --- ## <i class="fab fa-html5 orange"></i> HTML syntax .center[`<span style="color:blue;">Author content</span>` <i class="fas fa-arrow-right"></i> <span style="color:blue;">Author content</span>] <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">start tag:</td><td><span class="remark-code" style="font-size:16pt"><span class="red"><span style="color:blue;"></span><span class="gray">Author content</span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">end tag: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><span style="color:blue;">Author content<span class="red"></span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">content: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><span style="color:blue;"></span><span class="red">Author content</span><span class="gray"></span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">element name: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><</span><span class="red">span</span><span class="gray"> style="color:blue;">Author content</span></span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><span <span class="red">style="color:blue;"</span><span class="gray">>Author content</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute name: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><span <span class="red">style</span><span class="gray">="color:blue;">Author content</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">attribute value: </td><td> <span class="remark-code" style="font-size:16pt"><span class="gray"><span style=</span><span class="red">"color:blue;"</span><span class="gray">>Author content</span></span> </td> </tr> </table> <hr> .center[Not all HTML tags have an end tag:] .center[ <span style="font-size:18pt;">`<img height="40px" src="https://tinyurl.com/rlogo-svg">`</span> <i class="fas fa-arrow-right"></i> <img height="40px" src="https://tinyurl.com/rlogo-svg"> ] --- ## <i class="fab fa-html5 orange"></i> HTML elements <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">block element:</td><td><span class="remark-code red" style="font-size:16pt"><div><span class="gray">content</span></div></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">inline element:</td><td><span class="remark-code red" style="font-size:16pt"><span><span class="gray">content</span></span></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">paragraph:</td><td><span class="remark-code red" style="font-size:16pt"><p><span class="gray">content</span></p></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">header level 1:</td><td><span class="remark-code red" style="font-size:16pt"><h1><span class="gray">content</span></h1></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">header level 2:</td><td><span class="remark-code red" style="font-size:16pt"><h2><span class="gray">content</span></h2></span></td></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">italic:</td><td><span class="remark-code red" style="font-size:16pt"><i><span class="gray">content</span></i></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">emphasised text:</td><td><span class="remark-code red" style="font-size:16pt"><em><span class="gray">content</span></em></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">strong importance:</td><td><span class="remark-code red" style="font-size:16pt"><strong><span class="gray">content</span></strong></span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">link:</td><td><span class="remark-code red" style="font-size:16pt"><a href="https://stats220.earo.me/"><span class="gray">content</span></a></span></td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">unordered list:</td><td><span class="remark-code red" style="font-size:16pt"><ul><br><li><span class="gray">item 1</span></li><Br><li><span class="gray">item 2</span></li><Br></ul></span></td> </tr> </table> ??? How these are rendered to the browser depends on the browser default style values, style attribute or CSS... --- ## <i class="fab fa-css3-alt blue"></i> Cascading Style Sheet (CSS) * with the extension `.css` * 3 ways to style elements in HTML: * **inline** by using the `style` attribute inside HTML start tag: <center> <span class="remark-code gray" style="font-size:14pt;"><h1 <span class="red">style="color:blue;"</span>>Blue Header</h1></span> </center> + **externally** by using the `<link>` element: <center> <span class="remark-code red" style="font-size:14pt;"><link rel="stylesheet" href="styles.css"></span> </center> + **internally** by defining within `<style>` element: <div style="margin-left:35%; width:350px;"> ```html <style type="text/css"> h1 { color: blue; } </style> ``` </div> By convention, the `<style>` and `<link>` elements tend to go into the `<head>` section of the HTML document. --- ## <i class="fab fa-css3-alt blue"></i> CSS syntax .pull-left[ ```html <style type="text/css"> h1 { color: blue; } </style> <h1>This is a header</h1> ``` ] <div style="margin-left:55%; width:350px;"> <br> <h2 style="color:blue">This is a header</h2> </div> <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">selector:</td><td><span class="remark-code" style="font-size:16pt"><span class="red">h1</span><span class="gray"> { color: blue; }</span></span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property:</td><td><span class="remark-code gray" style="font-size:16pt;">h1 { <span class="red">color: blue;</span> }</span> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property name:</td><td><span class="remark-code gray" style="font-size:16pt;">h1 { <span class="red">color</span>: blue; } </span></td> </tr> <tr> <td style="text-align:right;padding-right:30px;">property value:</td><td><span class="remark-code gray" style="font-size:16pt;">h1 { color: <span class="red">blue</span>; } </span></td> </tr> </table> .pull-left[ You may have multiple properties for a single selector.➡️ ] .pull-right[ ```css h1 { color: blue; font-size: 16pt; } ``` ] --- ## <i class="fab fa-css3-alt blue"></i> CSS properties .center[ ```html <div>Sample text</div> ``` ] <table style="width:100%"> <tr> <td style="text-align:right;padding-right:30px;">background color:</td> <td><span class="remark-code gray" style="font-size:16pt">div { <span class="red">background-color: yellow;</span> }</span> </td> <td> <div style="background-color: yellow;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">text color:</td> <td><span class="remark-code gray" style="font-size:16pt">div { <span class="red">color: purple;</span> }</span> </td> <td> <div style="color: purple;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">border:</td> <td><span class="remark-code gray" style="font-size:16pt">div { <span class="red">border: 1px dashed brown;</span> }</span> </td> <td> <div style="border: 1px dashed brown;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">left border only:</td> <td><span class="remark-code gray" style="font-size:16pt">div { <span class="red">border-left: 10px solid pink;</span> }</span> </td> <td> <div style="border-left: 10px solid pink;">Sample text</div> </td> </tr> <tr> <td style="text-align:right;padding-right:30px;">text size:</td> <td><span class="remark-code gray" style="font-size:16pt">div { <span class="red">font-size: 10pt;</span> }</span> </td> <td> <div style="font-size:10pt;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">padding:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { background-color: yellow; <br>     <span class="red">padding: 10px;</span> }</span> </td> <td> <div style="background-color: yellow;padding:10px;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">margin:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { background-color: yellow; <br>     <span class="red">margin: 10px;</span> }</span> </td> <td> <div style="background-color: yellow;margin:10px;">Sample text</div> </td> </tr> </table> --- ## <i class="fab fa-css3-alt blue"></i> CSS properties .center[ ```html <div>Sample text</div> ``` ] <table style="width:100%"> <tr> <td valign="top" style="text-align:right;padding-right:30px;">center align text:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { background-color: yellow; <br>     padding-top: 20px;<br>     <span class="red">text-align: center;</span> }</span> </td> <td> <div style="background-color: yellow;text-align: center;padding-top: 20px;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">font family:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { <span class="red">font-family: Marker Felt, times;</span> }</span> </td> <td> <div style="font-family: Marker Felt, times;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">strike:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { <span class="red">text-decoration: line-through;</span> }</span> </td> <td> <div style="text-decoration: line-through;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">underline:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { <span class="red">text-decoration: underline;</span> }</span> </td> <td> <div style="text-decoration: underline;">Sample text</div> </td> </tr> <tr> <td valign="top" style="text-align:right;padding-right:30px;">opacity:</td> <td valign="top"><span class="remark-code gray" style="font-size:16pt">div { <span class="red">opacity: 0.3</span> }</span> </td> <td> <div style="opacity: 0.3;">Sample text</div> </td> </tr> </table> --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr class="red"> <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt;" class="red"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr class="red"> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div></span> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr class="red"> <td class="remark-code">blockquote</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><blockquote></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <span class="red"><blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote></span> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr class="red"> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <span class="red"><p>Maybe stories are just data with a soul.</p></span> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <span class="red"><p>Household 1</p></span> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </span> </div></span> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <span class="red"><p>End of households</p></span> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr class="red"> <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <span class="red"><p>Hi!</p></span> How are you? <div class="child nice"> <span class="red"><p>Hello!</p></span> </div> </div> <p>Household 1</p> <div class="parent"> <span class="red"><p>Hi!</p></span> <blockquote class="child rebel"> <span class="red"><p>Don't talk to me!</p></span> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr class="red"> <td class="remark-code">p div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> within <span class="remark-code" style="font-size:16pt"><p></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr class="red"> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <span class="red"><p>Hi!</p></span> How are you? <div class="child nice"> <span class="red"><p>Hello!</p></span> </div> </div> <p>Household 1</p> <div class="parent"> <span class="red"><p>Hi!</p></span> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Ignores inline elements like <code>span</code>, <code>i</code>, <code>b</code>,... </div> --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr valign="top" class="red"> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <span class="red"><p>Household 1</p></span> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Ignores inline elements like <code>span</code>, <code>i</code>, <code>b</code>,... </div> --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr > <td class="remark-code">* </td><td> </td><td>selects all elements</td> </tr> <tr> <td class="remark-code">div</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> elements</td> </tr> <tr> <td class="remark-code">div, p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><div></span> and <span class="remark-code" style="font-size:16pt"><p></span> elements</td> </tr> <tr > <td class="remark-code">div p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> within <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div > p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> one level deep in <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr> <td class="remark-code">div + p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> immediately after a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> <tr class="red"> <td class="remark-code">div ~ p</td><td> </td><td>selects all <span class="remark-code" style="font-size:16pt"><p></span> preceded by a <span class="remark-code" style="font-size:16pt"><div></span></td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <span class="red"><p>Household 1</p></span> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <span class="red"><p>Clean your room!</p></span> </span> </span> <span class="red"><p>End of households</p></span> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"></span> <p>Clean your room!</p> </span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr class="red"> <td class="remark-code" valign="top">.parent</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="parent"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Note some offspring do not inherit class from their parents. </div> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <span class="red"><div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div></span> <span class="child"> <span class="red"><span class="parent child rebel"></span> <p>Clean your room!</p> <span class="red"></span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">.child.rebel</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">child</span> and <span class="remark-code" style="font-size:16pt">rebel</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <span class="red"><blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote></span> </div> <span class="child"> <span class="red"><span class="parent child rebel"></span> <p>Clean your room!</p> <span class="red"></span></span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <tr> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">.parent .rebel</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">rebel</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">parent</span>. </td> </tr> <tr> <td class="remark-code" valign="top">#idname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="idname"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <span class="red"><blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote></span> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] --- count: false .pull-left[ ## <i class="fab fa-css3-alt blue"></i> CSS selector <table class="gray" style="width:98%;margin-left:10px;margin-right:10px;"> <td class="remark-code" valign="top">.classname</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">class="classname"</span>. </td> </tr> <tr> <td class="remark-code" valign="top">.c1.c2</td><td> </td><td>selects all elements with <em>both</em> <span class="remark-code" style="font-size:16pt">c1</span> and <span class="remark-code" style="font-size:16pt">c2</span> within its class attribute. </td> </tr> <tr> <td class="remark-code" valign="top">.c1 .c2</td><td> </td><td>selects all elements with class <span class="remark-code" style="font-size:16pt">c2</span> that is a descendant of an element with class <span class="remark-code" style="font-size:16pt">c1</span>. </td> </tr> <tr class="red"> <td class="remark-code" valign="top">#p1</td><td> </td><td>selects all elements with the attribute <span class="remark-code" style="font-size:16pt">id="p1"</span>. </td> </tr> </table> ] .pull-right[ <pre style="font-size: 13pt"> <h1>This is a sample html</h1> <blockquote> <p>Maybe stories are just data with a soul.</p> <footer>—Brene Brown</footer> </blockquote> <span class="red"><div id="p1" class="parent"> Hmm <p>Hi!</p> How are you? <div class="child nice"> <p>Hello!</p> </div> </div></span> <p>Household 1</p> <div class="parent"> <p>Hi!</p> <blockquote class="child rebel"> <p>Don't talk to me!</p> </blockquote> </div> <span class="child"> <span class="parent child rebel"> <p>Clean your room!</p> </span> </span> <p>End of households</p> </pre> ] <div style="position:absolute;top:10px;left:900px;width:300px;background-color:white;border:1px solid black;font-size:16pt;padding:2px;"> <i class="fas fa-exclamation-triangle"></i> Unlike <code style="font-size:16pt">class</code>, you can only have one <code style="font-size:16pt">id</code> value and must be unique in the whole HTML document. </div> --- ## <i class='fab fa-js-square yellow'></i> JavaScript (JS)* * JS is a programming language and enable interactive components in HTML documents. * 2 ways to insert JS into a HTML document: + **internally** by defining within `<script>` element: ```html <script> document.getElementById("p1").innerHTML = "content"; </script> ``` + **externally** by using the `src` attribute to refer to the external file: ```html <script src="js/myjs.js"></script> ``` --- class: inverse middle ## Web scraping 🕸 --- .pull-left[ .center[<img src = "img/browser-220.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ Use {rvest} `>= v1.0.0` (if not, update) <img src="img/rvest.png" width="100%"> ```r library(rvest) course <- "https://stats220.earo.me" stats220 <- read_html(course) stats220 ``` ``` #> {html_document} #> <html> #> [1] <head>\n<meta http-equiv="Content ... #> [2] <body>\n\n <header class="site ... ``` ] --- .pull-left[ ## <i class="fas fa-search-plus"></i> Inspect elements <br> <br> .center[<img src = "img/rvest/developer-tool.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ ## <i class="fas fa-puzzle-piece"></i> [CSS selectors](https://selectorgadget.com/) <br> <br> .center[<img src = "img/rvest/css-selector.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] --- .pull-left[ <br> <br> <br> .center[<img src = "img/rvest/navbar-right.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ `html_element()` select element ```r navbar <- stats220 %>% html_element(".navbar-right") navbar ``` ``` #> {html_node} #> <ul class="nav navbar-nav navbar-right"> #> [1] <li><a href="/pages/info/"><i cla ... #> [2] <li><a href="/pages/data/"><i cla ... #> [3] <li class="dropdown">\n ... ``` `html_name()` get element name ```r navbar %>% html_name() ``` ``` #> [1] "ul" ``` ] --- .pull-left[ <br> <br> <br> .center[<img src = "img/rvest/navbar-right.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ `html_children()` get element children ```r navbar %>% html_children() ``` ``` #> {xml_nodeset (3)} #> [1] <li><a href="/pages/info/"><i cla ... #> [2] <li><a href="/pages/data/"><i cla ... #> [3] <li class="dropdown">\n ... ``` `html_name()` get element name ```r navbar %>% html_children() %>% html_name() ``` ``` #> [1] "li" "li" "li" ``` ] --- .pull-left[ <br> <br> <br> .center[<img src = "img/rvest/navbar-right.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ `html_text2()` get element text ```r navbar %>% html_children() %>% html_text2() ``` ``` #> [1] "Info" #> [2] "Data" #> [3] "Assignment\nAssignment 1\nAssignment 2\nAssignment 3\nAssignment 1\nAssignment 2" ``` `html_attr()` get element attributes ```r navbar %>% html_elements("a") %>% html_attr("href") ``` ``` #> [1] "/pages/info/" #> [2] "/pages/data/" #> [3] "#" #> [4] "/assignments/assignment1/" #> [5] "/assignments/assignment2/" #> [6] "/assignments/assignment3/" #> [7] "/assignments/assignment1-sol/" #> [8] "/assignments/assignment2-sol/" ``` ] --- ## `url_absolute()`: turn relative urls to absolute urls ```r navbar %>% html_elements("a") %>% html_attr("href") %>% url_absolute(course) ``` ``` #> [1] "https://stats220.earo.me/pages/info/" #> [2] "https://stats220.earo.me/pages/data/" #> [3] "https://stats220.earo.me#" #> [4] "https://stats220.earo.me/assignments/assignment1/" #> [5] "https://stats220.earo.me/assignments/assignment2/" #> [6] "https://stats220.earo.me/assignments/assignment3/" #> [7] "https://stats220.earo.me/assignments/assignment1-sol/" #> [8] "https://stats220.earo.me/assignments/assignment2-sol/" ``` --- .pull-left[ <br> <br> <br> .center[<img src = "img/rvest/navbar-right.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ `html_elements()` select elements ```r navbar %>% html_elements("i") ``` ``` #> {xml_nodeset (8)} #> [1] <i class="fas fa-info-circle"></i> #> [2] <i class="fas fa-database"></i> #> [3] <i class="fas fa-laptop-code"></i> #> [4] <i class="fas fa-laptop-code"></i> #> [5] <i class="fas fa-laptop-code"></i> #> [6] <i class="fas fa-laptop-code"></i> #> [7] <i class="fas fa-key"></i> #> [8] <i class="fas fa-key"></i> ``` ⬆️ [fontawesome](http://fontawesome.com) icons ] --- .pull-left[ <br> ```r stats220_info <- read_html( "https://stats220.earo.me/pages/info/") ``` .center[<img src = "img/rvest/info-h3.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ select all `<h3>` elements ```r stats220_info %>% html_elements("h3") %>% html_text() ``` ``` #> [1] "Timetable" "Software" "Textbook" #> [4] "Reading" "Credits" ``` select `#timetable` id ```r stats220_info %>% html_elements("#timetable") %>% html_text() ``` ``` #> [1] "Timetable" ``` ] --- .pull-left[ <br> <br> <br> .center[<img src = "img/rvest/info-table.png", width = "90%", style = "box-shadow: 3px 5px 3px 1px #00000080;"></img>] ] .pull-right[ select the first `<table>` element ```r stats220_info %>% html_element("table") %>% html_table() ``` ``` #> # A tibble: 8 x 4 #> `` Day Time Venue #> <chr> <chr> <chr> <chr> #> 1 "Lecture" Wed 16-17 Eng1.439 #> 2 "" Fri 10-11 Eng1.439 #> 3 "Lab" Wed 09-10 303S.175 #> 4 "" Wed 12-13 302.G40 #> 5 "" Wed 13-14 302.G40 #> 6 "" Fri 11-12 302.G40 #> 7 "" Thu 16-17 offshore #> 8 "Office hour" Thu 14-15 303.323 ``` ] ??? * table on the web primarily for presentation purpose, not data storage * isn't clean from web scraping --- ## Download pdf slides at once ```r stats220_urls <- stats220 %>% html_elements(".panel-body .btn") %>% html_attr("href") stats220_urls ``` ``` #> [1] "/objs/obj01" #> [2] "/R/01-intro.R" #> [3] "/01-intro.Rmd" #> [4] "/01-intro.pdf" #> [5] "/labs/lab01" #> [6] "/labs/lab01-sol" #> [7] "/objs/obj02" #> [8] "/R/02-import-export.R" #> [9] "/02-import-export.Rmd" #> [10] "/02-import-export.pdf" #> [11] "/labs/lab02" #> [12] "/labs/lab02-sol" #> [13] "/objs/obj03" #> [14] "/R/03-data-vis.R" #> [15] "/03-data-vis.Rmd" #> [16] "/03-data-vis.pdf" #> [17] "/labs/lab03" #> [18] "/labs/lab03-sol" #> [19] "/objs/obj04" #> [20] "/R/04-data-wrangle.R" #> [21] "/04-data-wrangle.Rmd" #> [22] "/04-data-wrangle.pdf" #> [23] "/labs/lab04" #> [24] "/labs/lab04-sol" #> [25] "/objs/obj05" #> [26] "/R/05-fcts-dates.R" #> [27] "/05-fcts-dates.Rmd" #> [28] "/05-fcts-dates.pdf" #> [29] "/labs/lab05" #> [30] "/labs/lab05-sol" #> [31] "/objs/obj06" #> [32] "/R/06-tidy-data.R" #> [33] "/06-tidy-data.Rmd" #> [34] "/06-tidy-data.pdf" #> [35] "/labs/lab06" #> [36] "/labs/lab06-sol" #> [37] "/objs/obj07" #> [38] "/R/07-data-vis2.R" #> [39] "/07-data-vis2.Rmd" #> [40] "/07-data-vis2.pdf" #> [41] "/labs/lab07" #> [42] "/objs/obj08" #> [43] "/R/08-rmd.R" #> [44] "/08-rmd.Rmd" #> [45] "/08-rmd.pdf" #> [46] "/labs/lab08" #> [47] "/objs/obj09" #> [48] "/R/09-web-scrape.R" #> [49] "/09-web-scrape.Rmd" #> [50] "/09-web-scrape.pdf" #> [51] "/" #> [52] "/R/" #> [53] "/" #> [54] "/" #> [55] "/R/" #> [56] "/" #> [57] "/" #> [58] "/R/" #> [59] "/" ``` --- ## Download pdf slides at once ```r library(stringr) # manipulate strings in week 10 (pdf_urls <- stats220_urls[str_detect(stats220_urls, "pdf")]) ``` ``` #> [1] "/01-intro.pdf" #> [2] "/02-import-export.pdf" #> [3] "/03-data-vis.pdf" #> [4] "/04-data-wrangle.pdf" #> [5] "/05-fcts-dates.pdf" #> [6] "/06-tidy-data.pdf" #> [7] "/07-data-vis2.pdf" #> [8] "/08-rmd.pdf" #> [9] "/09-web-scrape.pdf" ``` ```r pdf_files <- str_remove(pdf_urls, "/") purrr::walk2( # below for week 11 url_absolute(pdf_urls, course), pdf_files, ~ download.file(url = .x, destfile = .y)) ``` --- class: inverse middle ## REST API --- ## <i class="fab fa-github"></i> [Github REST API](https://docs.github.com/en/rest) .pull-left[ * Each URL is called a **request**. * The data sent back to you is called an HTTP **response** that consists of headers and a body. The root-endpoint of Github's API is <https://api.github.com>. ```r library(httr) endpoint <- "https://api.github.com" GET(endpoint) ``` ] .pull-right[ ``` #> Response [https://api.github.com] *#> Date: 2021-05-12 00:20 *#> Status: 200 *#> Content-Type: application/json; charset=utf-8 *#> Size: 2.31 kB #> { #> "current_user_url": "https://api.gi... #> "current_user_authorizations_html_u... #> "authorizations_url": "https://api.... #> "code_search_url": "https://api.git... #> "commit_search_url": "https://api.g... #> "emails_url": "https://api.github.c... #> "emojis_url": "https://api.github.c... #> "events_url": "https://api.github.c... #> "feeds_url": "https://api.github.co... #> ... ``` ] --- ## HTTP methods <br> * .brown[`GET`] to **retrieve** resource data/information only, and NO change in state of the resource * .brown[`POST`] to **create** new subordinate resources, e.g. upload a file * .brown[`PUT`] to **update/replace** an existing resource in its entirety * .brown[`DELETE`] to **delete** resources * .brown[`PATCH`] to make **partial update** on a resource (not all browsers support `PATCH`) --- ## Path .pull-left[ The **path** determines the resource you’re requesting for. [`GET /repos/{owner}/{repo}`](https://docs.github.com/en/rest/reference/repos#get-a-repository ) ```r path <- "/repos/STATS-UOA/stats220" resp <- GET(modify_url(endpoint, path = path)) resp ``` ] .pull-right[ ``` #> Response [https://api.github.com/repos/STATS-UOA/stats220] #> Date: 2021-05-12 00:20 #> Status: 200 #> Content-Type: application/json; charset=utf-8 #> Size: 6.41 kB #> { #> "id": 242031925, #> "node_id": "MDEwOlJlcG9zaXRvcnkyNDI... #> "name": "stats220", #> "full_name": "STATS-UOA/stats220", #> "private": false, #> "owner": { #> "login": "STATS-UOA", #> "id": 62915494, #> "node_id": "MDEyOk9yZ2FuaXphdGlvb... #> ... ``` ] --- ## Parse the response .pull-left[ ```r http_type(resp) ``` ``` #> [1] "application/json" ``` Content type of a response * `"image/png"` * `"application/text"` * `"application/csv"` * `...` ] .pull-right[ ```r content(resp) ``` ``` #> $id #> [1] 242031925 #> #> $node_id #> [1] "MDEwOlJlcG9zaXRvcnkyNDIwMzE5MjU=" #> #> $name #> [1] "stats220" #> #> $full_name #> [1] "STATS-UOA/stats220" #> #> $private #> [1] FALSE #> #> $owner #> $owner$login #> [1] "STATS-UOA" #> #> $owner$id #> [1] 62915494 #> #> $owner$node_id #> [1] "MDEyOk9yZ2FuaXphdGlvbjYyOTE1NDk0" #> #> $owner$avatar_url #> [1] "https://avatars.githubusercontent.com/u/62915494?v=4" #> #> $owner$gravatar_id #> [1] "" #> #> $owner$url #> [1] "https://api.github.com/users/STATS-UOA" #> #> $owner$html_url #> [1] "https://github.com/STATS-UOA" #> #> $owner$followers_url #> [1] "https://api.github.com/users/STATS-UOA/followers" #> #> $owner$following_url #> [1] "https://api.github.com/users/STATS-UOA/following{/other_user}" #> #> $owner$gists_url #> [1] "https://api.github.com/users/STATS-UOA/gists{/gist_id}" #> #> $owner$starred_url #> [1] "https://api.github.com/users/STATS-UOA/starred{/owner}{/repo}" #> #> $owner$subscriptions_url #> [1] "https://api.github.com/users/STATS-UOA/subscriptions" #> #> $owner$organizations_url #> [1] "https://api.github.com/users/STATS-UOA/orgs" #> #> $owner$repos_url #> [1] "https://api.github.com/users/STATS-UOA/repos" #> #> $owner$events_url #> [1] "https://api.github.com/users/STATS-UOA/events{/privacy}" #> #> $owner$received_events_url #> [1] "https://api.github.com/users/STATS-UOA/received_events" #> #> $owner$type #> [1] "Organization" #> #> $owner$site_admin #> [1] FALSE #> #> #> $html_url #> [1] "https://github.com/STATS-UOA/stats220" #> #> $description #> [1] "STATS 220 Data Technology @ the University of Auckland" #> #> $fork #> [1] FALSE #> #> $url #> [1] "https://api.github.com/repos/STATS-UOA/stats220" #> #> $forks_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/forks" #> #> $keys_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/keys{/key_id}" #> #> $collaborators_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/collaborators{/collaborator}" #> #> $teams_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/teams" #> #> $hooks_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/hooks" #> #> $issue_events_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/issues/events{/number}" #> #> $events_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/events" #> #> $assignees_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/assignees{/user}" #> #> $branches_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/branches{/branch}" #> #> $tags_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/tags" #> #> $blobs_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/git/blobs{/sha}" #> #> $git_tags_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/git/tags{/sha}" #> #> $git_refs_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/git/refs{/sha}" #> #> $trees_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/git/trees{/sha}" #> #> $statuses_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/statuses/{sha}" #> #> $languages_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/languages" #> #> $stargazers_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/stargazers" #> #> $contributors_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/contributors" #> #> $subscribers_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/subscribers" #> #> $subscription_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/subscription" #> #> $commits_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/commits{/sha}" #> #> $git_commits_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/git/commits{/sha}" #> #> $comments_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/comments{/number}" #> #> $issue_comment_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/issues/comments{/number}" #> #> $contents_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/contents/{+path}" #> #> $compare_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/compare/{base}...{head}" #> #> $merges_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/merges" #> #> $archive_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/{archive_format}{/ref}" #> #> $downloads_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/downloads" #> #> $issues_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/issues{/number}" #> #> $pulls_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/pulls{/number}" #> #> $milestones_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/milestones{/number}" #> #> $notifications_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/notifications{?since,all,participating}" #> #> $labels_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/labels{/name}" #> #> $releases_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/releases{/id}" #> #> $deployments_url #> [1] "https://api.github.com/repos/STATS-UOA/stats220/deployments" #> #> $created_at #> [1] "2020-02-21T01:53:12Z" #> #> $updated_at #> [1] "2021-05-11T23:22:45Z" #> #> $pushed_at #> [1] "2021-05-11T23:22:42Z" #> #> $git_url #> [1] "git://github.com/STATS-UOA/stats220.git" #> #> $ssh_url #> [1] "git@github.com:STATS-UOA/stats220.git" #> #> $clone_url #> [1] "https://github.com/STATS-UOA/stats220.git" #> #> $svn_url #> [1] "https://github.com/STATS-UOA/stats220" #> #> $homepage #> [1] "https://stats220.earo.me" #> #> $size #> [1] 164407 #> #> $stargazers_count #> [1] 4 #> #> $watchers_count #> [1] 4 #> #> $language #> [1] "HTML" #> #> $has_issues #> [1] TRUE #> #> $has_projects #> [1] TRUE #> #> $has_downloads #> [1] TRUE #> #> $has_wiki #> [1] TRUE #> #> $has_pages #> [1] FALSE #> #> $forks_count #> [1] 0 #> #> $mirror_url #> NULL #> #> $archived #> [1] FALSE #> #> $disabled #> [1] FALSE #> #> $open_issues_count #> [1] 5 #> #> $license #> NULL #> #> $forks #> [1] 0 #> #> $open_issues #> [1] 5 #> #> $watchers #> [1] 4 #> #> $default_branch #> [1] "master" #> #> $temp_clone_token #> NULL #> #> $organization #> $organization$login #> [1] "STATS-UOA" #> #> $organization$id #> [1] 62915494 #> #> $organization$node_id #> [1] "MDEyOk9yZ2FuaXphdGlvbjYyOTE1NDk0" #> #> $organization$avatar_url #> [1] "https://avatars.githubusercontent.com/u/62915494?v=4" #> #> $organization$gravatar_id #> [1] "" #> #> $organization$url #> [1] "https://api.github.com/users/STATS-UOA" #> #> $organization$html_url #> [1] "https://github.com/STATS-UOA" #> #> $organization$followers_url #> [1] "https://api.github.com/users/STATS-UOA/followers" #> #> $organization$following_url #> [1] "https://api.github.com/users/STATS-UOA/following{/other_user}" #> #> $organization$gists_url #> [1] "https://api.github.com/users/STATS-UOA/gists{/gist_id}" #> #> $organization$starred_url #> [1] "https://api.github.com/users/STATS-UOA/starred{/owner}{/repo}" #> #> $organization$subscriptions_url #> [1] "https://api.github.com/users/STATS-UOA/subscriptions" #> #> $organization$organizations_url #> [1] "https://api.github.com/users/STATS-UOA/orgs" #> #> $organization$repos_url #> [1] "https://api.github.com/users/STATS-UOA/repos" #> #> $organization$events_url #> [1] "https://api.github.com/users/STATS-UOA/events{/privacy}" #> #> $organization$received_events_url #> [1] "https://api.github.com/users/STATS-UOA/received_events" #> #> $organization$type #> [1] "Organization" #> #> $organization$site_admin #> [1] FALSE #> #> #> $network_count #> [1] 0 #> #> $subscribers_count #> [1] 1 ``` ] --- ## Status code ```r status_code(resp) ``` ``` #> [1] 200 ``` <br> .pull-left[ ```r http_status(200) # OK http_status(201) # Created http_status(204) # NO CONTENT ``` ] .pull-right[ ```r http_status(400) # BAD REQUEST http_status(403) # Forbidden http_status(404) # NOT FOUND ``` ] .footnote[<https://en.wikipedia.org/wiki/List_of_HTTP_status_codes>] --- ## Auckland Transport Open GIS Data: [bus stop](https://data-atgis.opendata.arcgis.com/datasets/bus-stop?geometry=173.281%2C-37.229%2C176.247%2C-36.459) ```r endpoint <- "https://services2.arcgis.com" path <- "/JkPEgZJGxhSjYOo0/arcgis/rest/services/BusService/FeatureServer/0/" query <- "query?where=1%3D1&outFields=*&outSR=4326&f=geojson" bus_url <- parse_url(paste0(endpoint, path, query)) resp <- GET(bus_url) cnt <- geojson::as.geojson(content(resp)) cnt ``` ``` #> <geojson> #> type: FeatureCollection #> features (n): 1000 #> features (geometry / length) [first 5]: #> Point / 2 #> Point / 2 #> Point / 2 #> Point / 2 #> Point / 2 ``` --- class: middle .pull-left[ ```r library(tidyverse) library(sf) bus_sf <- geojsonsf::geojson_sf(cnt) ggplot() + geom_sf(data = bus_sf, pch = 1, colour = "#3182bd") ``` ] .pull-right[ <img src="figure/bus-stop-plot2-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## Reading .pull-left[ <br> <br> .center[ <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/rvest.png" width="240px"> ]] .pull-right[ * [Get started with {rvest}](https://rvest.tidyverse.org/articles/rvest.html) * [{httr} quickstart guide](https://httr.r-lib.org/articles/quickstart.html) ]