+ - 0:00:00
Notes for current slide
Notes for next slide
  • phd at the end of 2019, moved to Auckland last Feb
  • I do research in stats computing and graphics. develop new graphical methods, interactive graphics, software for data scientists.
  • a regular contributor and dev to R. One of my most pop packages has been dl millions time last year
  • Amateur tennis player, and make my own flat white.

STATS 220

Data technology

1 / 46

Kia Ora!

  • πŸŽ“ I earned my PhD (Stats) @ Monash University, Australia.
  • ❀️ My research interests lie in exploratory data analysis, data visualisation, software design, ...
  • πŸ‘©β€πŸ’» I turn β˜• into > 10 #rstats πŸ“¦.
  • Outside of work, I play 🎾 and make β˜•.
2 / 46
  • phd at the end of 2019, moved to Auckland last Feb
  • I do research in stats computing and graphics. develop new graphical methods, interactive graphics, software for data scientists.
  • a regular contributor and dev to R. One of my most pop packages has been dl millions time last year
  • Amateur tennis player, and make my own flat white.

Contact

3 / 46
  • Any r-related questions post on piazza, so others can benefit
  • I'll run my office hours every thursday from this week onwards, same zoom link.
  • If you've got any downloading and installing issues, please drop by my office hours.
4 / 46

I'm looking for 2 class rep. Please nominate yourself over the chat.

Data + Technology

https://stats220.earo.me

5 / 46
  • 2nd time to teach. revamped for modern data and modern tech. It's not an easy A+ course.
  • I've made all course materials available on this website instead of canvas
  • give a tour about the website

What I mean by "data"


πŸ₯«

  • Stale, uninteresting, convenient
  • Highly processed and archived
  • Example: student tests, titanic, wages


πŸ…

6 / 46
  • If you studied 20X, you def know and work with student test dataset?
  • smallish with a couple of data obs, highly processed for modelling purpose, like a veggie can
  • real datasets are much interesting to work with, rel to our lifes, we can make useful & meaningful decision from the data
  • predicting arrival time for akl bus in real-time

How I learn new technology

  • πŸ—£ Get hands dirty‼️
  • πŸ“– Documentation! Documentation! Documentation!
  • πŸ” (Not surprisingly) Learn to google: what that error message means (I google a lot 🀭)
7 / 46
  • run the code and see what's happening, and tweak yourself for a small project
  • read documentation!
  • learn to google
  • Search or ask questions on Stack Overflow and RStudio Community

You can't do data science in a GUI

reference: You can't do data science in a GUI

8 / 46
  • The first software you worked with data is probably excel. It's a GUI application. GUI stands for ...
  • If I wanna sort a column in excel, a window pops out to ask if sort a dataset or that column only.
  • What's wrong with this kind of point and click application.
  • You don't work on ur own on a ds proj. If your fellow students or future colleagues, or even the future of you wanna know how you process with your data, do you repeat your steps and share them with some recordings. It's not feasible to share and replicate your process.

Why programme for data science?

  • Programming languages are languages.
library(dplyr)
starwars %>%
group_by(species) %>%
summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
) %>%
filter(n > 1, mass > 50)
  • It's just text!
    • reproducible, readable, sharable
    • expressive
9 / 46
  • If cann't, can we do ds by programming
  • This is an R snippet. Even you don't know R now, we can still read the scripts and probably have a vague sense of what this code block is doing here.
  • Plain text, we can copy and paste.

Why

  • A general-purpose programming language
  • Originated by statisticians, a language for statistical analysis
  • 293964 + packages on CRAN (Comprehensive R Archive Network, the official repository), Github, etc.
  • The tidyverse, a domain specific language in R for data scientists
10 / 46
  • we learnt R for statistical modelling, but it's general-purpose.
  • on the other hand, specific-purpose language, e.g. SQL for manipulating database.
  • R was first originated from UoA in 1993, with a goal of doing statistical analysis
  • Why R has been thriving in past decades, bc a growing community with so many third-party packages/add-ons
  • CRAN
  • Hadley W is an Auckland Uni alumnus.

What R can do?

- for fun

πŸ“¦ {cowsay} for generating ASCII picture

library(cowsay)
say("Kia Ora!")
#>
#> --------------
#> Kia Ora!
#> --------------
#> \
#> \
#> \
#> |\___/|
#> ==) ^Y^ (==
#> \ ^ /
#> )=*=(
#> / \
#> | |
#> /| | | |\
#> \| | |_|/\
#> jgs //_// ___/
#> \_)
#>
11 / 46

What R can do?

- for fun

- for data

The data science workflow

12 / 46

We'll learn each of these modules through the semester.

What R can do?

- for fun

- for data

- for communication

R Markdown


R Markdown documents are fully reproducible: weaving narrative text and code together.

13 / 46
  • Rmd ecosystem
  • will boost your productivity

What R can do?

- for fun

- for data

- for communication

R shiny dashboard

  • Shiny is an R package that makes it easy to build interactive web apps straight from R.

πŸ‘† click the image above will take you to the web app, and try to interact with the app.
14 / 46
  • A shiny dashboard dev by ministry of business, innovation & employment

Textbook πŸ“š

15 / 46
  • Available online
  • One reason I like the R community, they like sharing, make works open.
  • open education
  • clicking images will take you to the book

At first, you may be like this...

16 / 46

At first, you may be like this...

But you can do it!

16 / 46
  • From my own experience, or other beginners, frustrating, like cloud over head
  • I can teach you bits and pieces, but it's you to compose these bits and pieces to solve real-world probs.
  • like lego
  • in the first two week, may look easy, I teach basics.

Assessments

  • 11 weekly labs 10% (best 10 out of 11)
  • 3 assignments 30% (each 10%)
  • 1 mid-term test 10% (TBD, possibly week 8)
  • 1 final exam 50%
17 / 46
  • 300, automatically grade your labs

Project-oriented workflow

18 / 46

If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.
-- Julie Lowndes

19 / 46

Hope you've downloaded r and rstudio

RStudio interface

image credit: Stuart Lee

20 / 46

live

Setting up RStudio (do this once)

Go to Tools > Global Options:





Uncheck Workspace and History, which helps to keep R working environment fresh and clean every time you switch between projects.

21 / 46

Your turn

Change the RStudio appearance up to your taste

01:00
22 / 46

1 minutes to choose your favourite theme

What is a project?

  • Each university course is a project, and get your work organised.
  • A self-contained project is a folder that contains all relevant files, for example my stats220/ πŸ“ includes:
    • stats220.Rproj
    • data/
      • *.csv, *.xlsx
    • lectures/
      • 01-intro.Rmd, 02-import-export.Rmd
    • labs/
      • lab01.R, lab02.R
  • All working files are relative to the project root (i.e. stats220/).
  • The project should just work on a different computer.
23 / 46

πŸ›‘ STOP DOING THIS!

Jenny Bryan will set your computer on fire πŸ”₯

  1. if the first line of your R script is
    setwd("C:\Users\jenny\path\that\only\I\have")
  2. if the first line of your R script is
    rm(list = ls())
24 / 46
25 / 46

Create an RStudio project .Rproj

  1. Click the Project icon on the top right corner



  2. New Directory/Existing Directory > New Project > Create Project


  3. Open the project

26 / 46

101: syntax and semantics

27 / 46

Get started

- assignment

akl_lon <- 174.76
akl_lat <- -36.85

⬆️ read as "assign the value of 174.76 to an object called akl_lon".

An assignment consists of:

  • left-hand side: variable names or symbols (akl_lon)
  • assignment operator: <- (RStudio shortcut: Alt + -)
  • right-hand side: values (174.76)
28 / 46

Get started

- assignment

- retrieval

akl_lon
#> [1] 174.76
akl_lat
#> [1] -36.85
  • Names are case sensitive.
akl_Lon
#> Error in eval(expr, envir, enclos): object 'akl_Lon' not found
29 / 46

Get started

- assignment

- retrieval

- operation

Perform calculations and comparisons

  • Infix operators:
    • +, -, *, /, ^, %% (modulo), %/% (integer division)
    • ==, !=, >, <, >=, <=, %in%
akl_lon_region <- akl_lon + c(-1, 1)
akl_lat_region <- akl_lat + c(-.5, .5)
akl_lon_region
#> [1] 173.76 175.76
akl_lat_region
#> [1] -37.35 -36.35
30 / 46

Coding style

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
-- The tidyverse style guide

R style guide

  • snake_case
  • camelCase (Javascript)
  • PascalCase (Python)
31 / 46

101: data structures

32 / 46

Atomic vectors

Scalars: length of 1

  • Logicals: TRUE or FALSE
  • Doubles: 174.76, 1.7476e2, Inf, -Inf, NaN (Not a Number)
  • Integers: 174L
  • Strings: "hello", 'world'

Vectors: values must all be the same type

lgl_vec <- c(TRUE, FALSE)
int_vec <- c(174L, -36L)
dbl_vec <- c(174.76, -36.85)
chr_vec <- c("long", "lat")
33 / 46

Special values

Missing values

NA # Not Applicable
#> [1] NA
c(174.76, NA, -36.85)
#> [1] 174.76 NA -36.85
length(NA)
#> [1] 1

The NULL object

NULL
#> NULL
c(174.76, NULL, -36.85)
#> [1] 174.76 -36.85
length(NULL)
#> [1] 0
34 / 46

Atomic vectors

35 / 46

Subsetting vectors with []

x <- c(akl_lon_region, akl_lat_region)
x
#> [1] 173.76 175.76 -37.35 -36.35

Positive indices

x[c(1, 3)]
#> [1] 173.76 -37.35

Negative indices

x[-c(3, 1)]
#> [1] 175.76 -36.35
36 / 46

Subsetting vectors with []

Logical indices

x[c(TRUE, FALSE, TRUE, FALSE)]
#> [1] 173.76 -37.35
x[lgl_vec] # recycling
#> [1] 173.76 -37.35
x[x > 0]
#> [1] 173.76 175.76

Special subsetting

x[0]
#> numeric(0)
x[]
#> [1] 173.76 175.76 -37.35 -36.35
37 / 46

Modifying vectors with [] on the LHS

y <- x
y
#> [1] 173.76 175.76 -37.35 -36.35
y[1:3] <- y[1:3] %/% 2
y
#> [1] 86.00 87.00 -19.00 -36.35
  • RHS [] subsets vector y
  • LHS [] modifies vector y
38 / 46

101: functions

39 / 46

Function

A function call consists of the function name followed by one or more argument within parentheses.

mean(x = x)
#> [1] 68.955
  • function name: mean(), a built-in R function to compute mean of a vector
  • argument: the first argument (LHS x) to specify the data (RHS x)
40 / 46
  • A function is a tool to do what you ask for. +
  • a pipe that takes some input and send back some output

Function help page

Check the function's help page with ?mean

mean(x, trim = 0, na.rm = FALSE, ...)
  • Read Usage section
    • What arguments have default values?
  • Read Arguments section
    • What does trim do?
  • Run Example code
01:00
41 / 46

Function arguments




Match by positions

mean(x, 0.1, TRUE)
#> [1] 68.955




Match by names

mean(x, na.rm = TRUE, trim = 0.1)
#> [1] 68.955
42 / 46
  • body implements the algorithm

Use functions from packages

# install.packages("dplyr")
library(dplyr)
cummean(x)
#> [1] 173.7600 174.7600 104.0567 68.9550
first(x)
#> [1] 173.76
last(x)
#> [1] -36.35




43 / 46

Write your own functions

# function_name <- function(arguments) {
# function_body
# }
my_mean <- function(x, na.rm = FALSE) {
summation <- sum(x, na.rm = na.rm)
summation / length(x)
}
my_mean(x)
#> [1] 68.955
44 / 46

Follow the #rstats community

45 / 46
  • keep up to date on twitter
  • rladies
  • follow hadley and jenny
  • resources: r4ds
  • r evolves rapidly in the past 5 yrs, out-of-date resources

Kia Ora!

  • πŸŽ“ I earned my PhD (Stats) @ Monash University, Australia.
  • ❀️ My research interests lie in exploratory data analysis, data visualisation, software design, ...
  • πŸ‘©β€πŸ’» I turn β˜• into > 10 #rstats πŸ“¦.
  • Outside of work, I play 🎾 and make β˜•.
2 / 46
  • phd at the end of 2019, moved to Auckland last Feb
  • I do research in stats computing and graphics. develop new graphical methods, interactive graphics, software for data scientists.
  • a regular contributor and dev to R. One of my most pop packages has been dl millions time last year
  • Amateur tennis player, and make my own flat white.
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow