STATS 220Data visualisation📊1 / 57

The greatest value of a picture is when it forces us to notice what we never expected to see.
-- John W. Tukey

2 / 57

John Tukey, one of the most well-known stat
invent boxplot, stem-and-leaf (pencil and paper)
coined the term EDA

numbers vs plotsdino

#> # A tibble: 142 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1  55.4  97.2
#> 2  51.5  96.0
#> 3  46.2  94.5
#> 4  42.8  91.4
#> 5  40.8  88.3
#> 6  38.7  84.9
#> # … with 136 more rows
3 / 57

numbers vs plots

dino

#> # A tibble: 142 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1  55.4  97.2
#> 2  51.5  96.0
#> 3  46.2  94.5
#> 4  42.8  91.4
#> 5  40.8  88.3
#> 6  38.7  84.9
#> # … with 136 more rows

3 / 57

humans digest info quicker through eyes than reading tables. -> ytb, tt
numbers on their own don't make sense to us.

numbers vs plots

image credit: Steph Locke

4 / 57

simple stats cannot reveal full pic.
all xy data tbl share the same mean/sd/corr, but different str in data...

Why data visualisation?📊

A picture is worth a thousand words. -- Henrik Ibsen

Data visualisation communicates information much quicker than numerical tables.
Data visualisation can reveal unexpected structures in data; it is not surprising that data visualisation is one of the key tools in exploratory data analysis.
Data plot is usually more eye-catching even if you lose accuracy of the information.

5 / 57

Charts 🥊 Graphics6 / 57

When we talk about data vis, we use plots, charts, graphics interchangablely
But when comes to stats, there's difference bt.
What do we mean by graphics, how to make graphics

A toy example

sci_tbl

#> # A tibble: 4 x 2
#>   dept             count
#>   <chr>            <int>
#> 1 Physics             12
#> 2 Mathematics          8
#> 3 Statistics          20
#> 4 Computer Science    23

dept: discrete/categorical
count: quantitative/numeric
What types of plots can we make?

bar plot for counts
pie chart for proportions

7 / 57

Named charts

Bar plot

barplot(as.matrix(sci_tbl$count), 
  legend = sci_tbl$dept)

Pie chart

pie(sci_tbl$count, 
  labels = sci_tbl$dept)

8 / 57

default r functions
one-off functions
What's the fundamental difference bt bar and pie

Seems convenient, but ...

a limited set of named charts
single purpose functions
inconsistent inputs

barplot(as.matrix(sci_tbl$count), 
  legend = sci_tbl$dept)

pie(sci_tbl$count, 
  labels = sci_tbl$dept)

9 / 57

Grammar makes language expressive. A language consisting of words and no grammar (statement = word) expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope.

10 / 57

a book blew my mind, changed the view to look at stat graphics.
we can easily run out of names
we can generate many types of new graphics by combine components following the grammar.
this books lays the theoretical foundation to {ggplot2}, {tab}, {vega-lite}

image credit: Thomas Lin Pederson

11 / 57

decomposed to

The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).{ggplot2} provides a cohesive system for declaratively creating elegant graphics, based on The Grammar of Graphics.12 / 57

extends gg, and layered gg.
provides a cohesive and declarative grammar to create graphics

library(ggplot2)
ggplot(data = sci_tbl) +
  geom_bar(
    aes(x = "", y = count, fill = dept),
    stat = "identity"
  )

ggplot(data = sci_tbl) +
  geom_bar(
    aes(x = "", y = count, fill = dept),
    stat = "identity"
  ) +
  coord_polar(theta = "y")

13 / 57

Back to our question, the difference
The difference is plotting bars on the polar coord

A graphing template

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  layer(geom = <GEOM>, stat = <STAT>, position = <POSITION>) +
  layer(geom = <GEOM>, stat = <STAT>, position = <POSITION>)

data: tibble/data.frame.
mapping: aesthetic mappings between data variables and visual elements, via aes().
layer(): a graphical layer is a combination of data, stat and geom with a potential position adjustment.
- geom: geometric elements to render each data observation.
- stat: statistical transformations applied to the data prior to plotting.
- position: position adjustment, such as "identity", "stack", "dodge" etc.

14 / 57

geom: points, bars, lines, text
stat: "identity", leave as is, boxplot, five numbers
+: layer + layer
When you think about a graphic to make:
- which geom to represent the data
- any stats to be used

Layers: a bar chart 📊

ggplot(data = sci_tbl, mapping = aes(x = dept, y = count)) +
  layer(geom = "bar", stat = "identity", position = "identity")

15 / 57

Aesthetic mapping: positional

p <- ggplot(sci_tbl, aes(x = dept, y = count))
p

16 / 57

ggplot() initialise the plot
save a ggplot obj to a symbol

Geoms (a shorthand to `layer()`)

p + 
  geom_bar(stat = "identity")

p + 
  geom_col()

stat = "identity" leaves data as is.
geom_col() is a shortcut to geom_bar(stat = "identity").

Generally, we use geom_*() instead of layer() in practice.

17 / 57

auto complete for geom_*()

Geoms

p +
  geom_point()

p +
  geom_segment(aes(xend = dept, y = 0, yend = count))

18 / 57

We don't have to stick with bar
we can use points/vertical lines
geom_segment(): more aes

Composite geoms: lollipop 🍭 = points + segments

p +
  geom_point() +
  geom_segment(aes(xend = dept, y = 0, yend = count))

19 / 57

Geom catalogue

geom	Description
geom_abline, geom_hline, geom_vline	Reference lines: horizontal, vertical, and diagonal
geom_bar, geom_col	Bar charts
geom_bin2d	Heatmap of 2d bin counts
geom_blank	Draw nothing
geom_boxplot	A box and whiskers plot (in the style of Tukey)

Previous1 2 3 4 5 6 7Next

source code: Emi Tanaka

20 / 57

StatsAggregated (pre-computed)

sci_tbl

#> # A tibble: 4 x 2
#>   dept             count
#>   <chr>            <int>
#> 1 Physics             12
#> 2 Mathematics          8
#> 3 Statistics          20
#> 4 Computer Science    23
Disaggregated

sci_tbl0

#> # A tibble: 63 x 1
#>   dept   
#>   <chr>  
#> 1 Physics
#> 2 Physics
#> 3 Physics
#> 4 Physics
#> 5 Physics
#> 6 Physics
#> # … with 57 more rows
21 / 57

Stats

ggplot(sci_tbl, aes(x = dept, y = count)) +
  geom_bar(stat = "identity")

ggplot(sci_tbl0, aes(x = dept)) +
  geom_bar(stat = "count")

22 / 57

Aesthetic mapping: visual

p +
  geom_col(aes(colour = dept))

p +
  geom_col(aes(fill = dept))

23 / 57

Mapping variables / Setting constants

p +
  geom_col(aes(fill = dept))

p +
  geom_col(fill = "#756bb1")

24 / 57

bar -> rect -> 2d
- stroke + fill

auto legend

Mapping variables + Setting constants

p +
  geom_col(aes(fill = dept), colour = "#000000")

25 / 57

Visual aesthetics

colour/color, fill:
- named colours, e.g. "red"
- RGB specification, e.g. "#756bb1"
alpha: opacity between 0 and 1
shape:
- an integer between 0 and 25
- a single string, e.g. "triangle open"
linetype:
- an integer between 0 and 6
- a single string, e.g. "dashed"
size, radius: a numerical value (in millimetres)

26 / 57

Your turn

Describe a bubble chart in terms of grammar of graphics.

27 / 57

gg: grammar of graphics {ggplot2}: the second version

Coords

Coordinate systems
- coord_cartesian() (default)
- ~~coord_flip()~~ (deprecated; now you can simply swap x and y)
- coord_map()
- coord_polar()

p +
  geom_col(aes(fill = dept)) +
  coord_polar(theta = "y")

28 / 57

live demo:

ggplot()
ggplot(data)
ggplot(data, aes())
inherit aes

layers

swap x and y

Themes: modify the look

Built-in ggplot themes
- theme_grey()/theme_gray()
- theme_bw(), theme_linedraw()
- theme_light(), theme_dark()
- theme_minimal(), theme_classic()
- theme_void()

p +
  geom_col(aes(fill = dept)) +
  theme_bw()

29 / 57

start with p11

Themes: modify the look

Many R packages provide themes.

library(ggthemes)
p +
  geom_col(aes(fill = dept)) +
  theme_economist()

30 / 57

to be able to modify the theme is the 1st step to make pub-ready plot
in an organisation, use a uniform theme across
if you want to make your first R package, contributing a theme would be a good start.

Modify the look of texts with `element_text()`

image credit: Emi Tanaka

31 / 57

tons of args in themes() for fine tune
I'll quickly go through these args, but dive more deeply in week 7 for effective data vis.

Modify the look of texts with `element_text()`

p +
  geom_col(aes(fill = dept)) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.1))

32 / 57

Modify the look of lines with `element_line()`

image credit: Emi Tanaka

33 / 57

Modify the look of regions with `element_rect()`

image credit: Emi Tanaka

34 / 57

Small multiples (or trellis/faceting plots)🌟 the idea of conditioning on the values taken on by one or more of the variables in a data set35 / 57

mpg data available from {ggplot2}

mpg

#> # A tibble: 234 x 11
#>   manufacturer model displ  year   cyl trans     drv     cty
#>   <chr>        <chr> <dbl> <int> <int> <chr>     <chr> <int>
#> 1 audi         a4      1.8  1999     4 auto(l5)  f        18
#> 2 audi         a4      1.8  1999     4 manual(m… f        21
#> 3 audi         a4      2    2008     4 manual(m… f        20
#> 4 audi         a4      2    2008     4 auto(av)  f        21
#> 5 audi         a4      2.8  1999     6 auto(l5)  f        16
#> 6 audi         a4      2.8  1999     6 manual(m… f        18
#> # … with 228 more rows, and 3 more variables: hwy <int>,
#> #   fl <chr>, class <chr>

36 / 57

p_mpg <- ggplot(mpg, aes(displ, cty)) + 
  geom_point(aes(colour = drv))
p_mpg

37 / 57

- `facet_grid()`

p_mpg +
  facet_grid(rows = vars(drv))
  # facet_grid(~ drv)

38 / 57

looking at conditional distribution
grid -> 2d matrix layout

- `facet_grid()`

p_mpg +
  facet_grid(cols = vars(drv))
  # facet_grid(drv ~ .)

39 / 57

- `facet_grid()`

p_mpg +
  facet_grid(rows = vars(drv), cols = vars(cyl))
  # facet_grid(cyl ~ drv)

40 / 57

- `facet_grid()`

- `facet_wrap()`

p_mpg +
  facet_wrap(vars(drv, cyl), ncol = 3)
  # facet_wrap(~ drv + cyl, ncol = 3)

41 / 57

Exploratory data visualisation

image credit: Emi Tanaka

42 / 57

case study
- import
movies <- as_tibble(jsonlite::read_json(
  "https://vega.github.io/vega-editor/app/data/movies.json",
  simplifyVector = TRUE))
movies

#> # A tibble: 3,201 x 16
#>   Title                US_Gross Worldwide_Gross US_DVD_Sales
#>   <chr>                   <int>           <dbl>        <int>
#> 1 The Land Girls         146083          146083           NA
#> 2 First Love, Last Ri…    10876           10876           NA
#> 3 I Married a Strange…   203134          203134           NA
#> 4 Let's Talk About Sex   373615          373615           NA
#> 5 Slam                  1009819         1087521           NA
#> 6 Mississippi Mermaid     24551         2624551           NA
#> # … with 3,195 more rows, and 12 more variables:
#> #   Production_Budget <int>, Release_Date <chr>,
#> #   MPAA_Rating <chr>, Running_Time_min <int>,
#> #   Distributor <chr>, Source <chr>, Major_Genre <chr>,
#> #   Creative_Type <chr>, Director <chr>,
#> #   Rotten_Tomatoes_Rating <int>, IMDB_Rating <dbl>,
#> #   IMDB_Votes <int>
43 / 57

case study
- import
- skim
skimr::skim(movies)

#> ── Data Summary ────────────────────────
#>                            Values
#> Name                       movies
#> Number of rows             3201  
#> Number of columns          16    
#> _______________________          
#> Column type frequency:           
#>   character                8     
#>   numeric                  8     
#> ________________________         
#> Group variables            None  
#> 
#> ── Variable type: character ────────────────────────────────────────────────────
#>   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
#> 1 Title                 1         1.00      1    66     0     3176          0
#> 2 Release_Date          7         0.998     8    11     0     1603          0
#> 3 MPAA_Rating         605         0.811     1     9     0        7          0
#> 4 Distributor         232         0.928     3    33     0      174          0
#> 5 Source              365         0.886     6    29     0       18          0
#> 6 Major_Genre         275         0.914     5    19     0       12          0
#> 7 Creative_Type       446         0.861     7    23     0        9          0
#> 8 Director           1331         0.584     7    27     0      550          0
#> 
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#>   skim_variable          n_missing complete_rate        mean           sd
#> 1 US_Gross                       7         0.998 44002085.    62555311.  
#> 2 Worldwide_Gross                7         0.998 85343400.   149947343.  
#> 3 US_DVD_Sales                2637         0.176 34901547.    45895122.  
#> 4 Production_Budget              1         1.00  31069171.    35585913.  
#> 5 Running_Time_min            1992         0.378      110.          20.2 
#> 6 Rotten_Tomatoes_Rating       880         0.725       54.3         28.1 
#> 7 IMDB_Rating                  213         0.933        6.28         1.25
#> 8 IMDB_Votes                   213         0.933    29909.       44938.  
#>         p0       p25        p50        p75         p100 hist 
#> 1      0   5493221.  22019466.  56091762.   760167650   ▇▁▁▁▁
#> 2      0   8031285.  31168926.  97283797   2767891499   ▇▁▁▁▁
#> 3 618454   9906211.  20331558.  37794216.   352582053   ▇▁▁▁▁
#> 4    218   6575000   20000000   42000000    300000000   ▇▁▁▁▁
#> 5     46        95        107        121          222   ▁▇▃▁▁
#> 6      1        30         55         80          100   ▅▆▆▇▇
#> 7      1.4       5.6        6.4        7.2          9.2 ▁▁▅▇▂
#> 8     18      4828.     15106      35810.      519541   ▇▁▁▁▁
44 / 57

case study

- import

- skim

- vis

Data analysis starts with questions (a.k.a. curiosity).

45 / 57

eda: wandering around a city as a tourist: sometimes have destinations, sometimes no.
once you started off with one question, more questions on the way.
There's no best single plot for the data.
Make as many plots as possible, as quick as possible to get most facets of the data
end up with nothing
for internal use, don't need to polish

case study

- import

- skim

- vis

Are movies ratings consistent b/t IMDB & Rotten Tomatoes

ggplot(movies, aes(x = IMDB_Rating, y = Rotten_Tomatoes_Rating)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "gam") +
  theme(aspect.ratio = 1)

46 / 57

case study

- import

- skim

- vis

Are movies ratings consistent b/t IMDB & Rotten Tomatoes

ggplot(movies, aes(x = IMDB_Rating, y = Rotten_Tomatoes_Rating)) +
  geom_hex() +
  theme(aspect.ratio = 1)

47 / 57

live demo: for rendering speed

case study

- import

- skim

- vis

The popularity of major genre

ggplot(movies, aes(y = Major_Genre)) +
  geom_bar()

48 / 57

case study

- import

- skim

- vis

The likeness of major genre

ggplot(movies) +
  geom_boxplot(aes(x = IMDB_Rating, y = Major_Genre))

49 / 57

case study

- import

- skim

- vis

The likeness of major genre

ggplot(movies) +
  geom_density(aes(x = IMDB_Rating, fill = Major_Genre))

50 / 57

overlapping
could use facet

case study

- import

- skim

- vis

The likeness of major genre

library(ggridges)
ggplot(movies, aes(x = IMDB_Rating, y = Major_Genre)) +
  geom_density_ridges(aes(fill = Major_Genre))

51 / 57

{ggplot2}-ext 📦

{ggplot2} now has an official extension mechanism. This means that others can now easily create their own stats, geoms and positions, and provide them in other packages. This should allow the ggplot2 community to flourish, even as less development work happens in ggplot2 itself.

➡️ https://exts.ggplot2.tidyverse.org/gallery/

52 / 57

{ggplot2} has been around for more than 10 yrs.
{ggplot2} is extensible

library(gganimate)
ggplot(mtcars, aes(factor(cyl), mpg)) +
  geom_boxplot() +
  # Here comes the gganimate code
  transition_states(
    gear,
    transition_length = 2,
    state_length = 1
  ) +
  enter_fade() +
  exit_shrink() +
  ease_aes('sine-in-out')

53 / 57

54 / 57

The R Graph Gallery

55 / 57

good resources for d3.js

To be continued ...

NEW: the Thursday 19 March update of our coronavirus mortality trajectories tracker

• Italy now has more Covid-19 deaths than China’s total
• UK remains on a steeper mortality curve than Italy, while Britain remains far from lockdown

Live version here: https://t.co/VcSZISFxzF pic.twitter.com/QvByzSj6QX
— John Burn-Murdoch (@jburnmurdoch) March 19, 2020

56 / 57

a tragic yr cos of covid-19
a blast yr for data vis
log scale

Reading

57 / 57

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

STATS 220

Data visualisation📊

numbers vs plots

numbers vs plots

numbers vs plots

Why data visualisation?📊

Charts 🥊 Graphics

A toy example

Named charts

Seems convenient, but ...

The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).

{ggplot2} provides a cohesive system for declaratively creating elegant graphics, based on The Grammar of Graphics.

A graphing template

Layers: a bar chart 📊

Aesthetic mapping: positional

Geoms (a shorthand to layer())

Geoms

Composite geoms: lollipop 🍭 = points + segments

Geom catalogue

Stats

Stats

Aesthetic mapping: visual

Mapping variables / Setting constants

Mapping variables + Setting constants

Visual aesthetics

Your turn

Coords

Themes: modify the look

Themes: modify the look

Modify the look of texts with element_text()

Modify the look of texts with element_text()

Modify the look of lines with element_line()

Modify the look of regions with element_rect()

Small multiples (or trellis/faceting plots)

🌟 the idea of conditioning on the values taken on by one or more of the variables in a data set

Facets

Facets

Facets

- facet_grid()

Facets

- facet_grid()

Facets

- facet_grid()

Facets

- facet_grid()

- facet_wrap()

Exploratory data visualisation

case study

- import

case study

- import

- skim

case study

- import

- skim

- vis

case study

- import

- skim

- vis

case study

- import

- skim

- vis

case study

- import

- skim

- vis

case study

- import

- skim

- vis

case study

- import

- skim

- vis

case study

- import

- skim

- vis

Geoms (a shorthand to `layer()`)

Modify the look of texts with `element_text()`

Modify the look of texts with `element_text()`

Modify the look of lines with `element_line()`

Modify the look of regions with `element_rect()`

- `facet_grid()`

- `facet_grid()`

- `facet_grid()`

- `facet_grid()`

- `facet_wrap()`