+ - 0:00:00
Notes for current slide
Notes for next slide
  • John Tukey, one of the most well-known stat
  • invent boxplot, stem-and-leaf (pencil and paper)
  • coined the term EDA

STATS 220

Data visualisationπŸ“Š

1 / 57


The greatest value of a picture is when it forces us to notice what we never expected to see.
-- John W. Tukey


2 / 57
  • John Tukey, one of the most well-known stat
  • invent boxplot, stem-and-leaf (pencil and paper)
  • coined the term EDA

numbers vs plots

dino
#> # A tibble: 142 x 2
#> x y
#> <dbl> <dbl>
#> 1 55.4 97.2
#> 2 51.5 96.0
#> 3 46.2 94.5
#> 4 42.8 91.4
#> 5 40.8 88.3
#> 6 38.7 84.9
#> # … with 136 more rows
3 / 57

numbers vs plots

dino
#> # A tibble: 142 x 2
#> x y
#> <dbl> <dbl>
#> 1 55.4 97.2
#> 2 51.5 96.0
#> 3 46.2 94.5
#> 4 42.8 91.4
#> 5 40.8 88.3
#> 6 38.7 84.9
#> # … with 136 more rows

3 / 57
  • humans digest info quicker through eyes than reading tables. -> ytb, tt
  • numbers on their own don't make sense to us.

numbers vs plots

image credit: Steph Locke

4 / 57
  • simple stats cannot reveal full pic.
  • all xy data tbl share the same mean/sd/corr, but different str in data...

Why data visualisation?πŸ“Š

A picture is worth a thousand words. -- Henrik Ibsen

  1. Data visualisation communicates information much quicker than numerical tables.
  2. Data visualisation can reveal unexpected structures in data; it is not surprising that data visualisation is one of the key tools in exploratory data analysis.
  3. Data plot is usually more eye-catching even if you lose accuracy of the information.
5 / 57

Charts πŸ₯Š Graphics

6 / 57
  • When we talk about data vis, we use plots, charts, graphics interchangablely
  • But when comes to stats, there's difference bt.
  • What do we mean by graphics, how to make graphics

A toy example


sci_tbl
#> # A tibble: 4 x 2
#> dept count
#> <chr> <int>
#> 1 Physics 12
#> 2 Mathematics 8
#> 3 Statistics 20
#> 4 Computer Science 23
  • dept: discrete/categorical
  • count: quantitative/numeric
    What types of plots can we make?
  1. bar plot for counts
  2. pie chart for proportions
7 / 57

Named charts

  • Bar plot
barplot(as.matrix(sci_tbl$count),
legend = sci_tbl$dept)

  • Pie chart
pie(sci_tbl$count,
labels = sci_tbl$dept)

8 / 57
  • default r functions
  • one-off functions
  • What's the fundamental difference bt bar and pie

Seems convenient, but ...


  • a limited set of named charts
  • single purpose functions
  • inconsistent inputs
barplot(as.matrix(sci_tbl$count),
legend = sci_tbl$dept)
pie(sci_tbl$count,
labels = sci_tbl$dept)
9 / 57



Grammar makes language expressive. A language consisting of words and no grammar (statement = word) expresses only as many ideas as there are words. By specifying how words are combined in statements, a grammar expands a language’s scope.

10 / 57
  • a book blew my mind, changed the view to look at stat graphics.
  • we can easily run out of names
  • we can generate many types of new graphics by combine components following the grammar.
  • this books lays the theoretical foundation to {ggplot2}, {tab}, {vega-lite}

image credit: Thomas Lin Pederson

11 / 57

decomposed to

The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).


{ggplot2} provides a cohesive system for declaratively creating elegant graphics, based on The Grammar of Graphics.

12 / 57
  • extends gg, and layered gg.
  • provides a cohesive and declarative grammar to create graphics
library(ggplot2)
ggplot(data = sci_tbl) +
geom_bar(
aes(x = "", y = count, fill = dept),
stat = "identity"
)

ggplot(data = sci_tbl) +
geom_bar(
aes(x = "", y = count, fill = dept),
stat = "identity"
) +
coord_polar(theta = "y")

13 / 57
  • Back to our question, the difference
  • The difference is plotting bars on the polar coord

A graphing template

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
layer(geom = <GEOM>, stat = <STAT>, position = <POSITION>) +
layer(geom = <GEOM>, stat = <STAT>, position = <POSITION>)
  1. data: tibble/data.frame.
  2. mapping: aesthetic mappings between data variables and visual elements, via aes().
  3. layer(): a graphical layer is a combination of data, stat and geom with a potential position adjustment.
    • geom: geometric elements to render each data observation.
    • stat: statistical transformations applied to the data prior to plotting.
    • position: position adjustment, such as "identity", "stack", "dodge" etc.
14 / 57
  • geom: points, bars, lines, text
  • stat: "identity", leave as is, boxplot, five numbers
  • +: layer + layer
  • When you think about a graphic to make:
    • which geom to represent the data
    • any stats to be used

Layers: a bar chart πŸ“Š

ggplot(data = sci_tbl, mapping = aes(x = dept, y = count)) +
layer(geom = "bar", stat = "identity", position = "identity")

15 / 57

Aesthetic mapping: positional

p <- ggplot(sci_tbl, aes(x = dept, y = count))
p

16 / 57
  • ggplot() initialise the plot
  • save a ggplot obj to a symbol

Geoms (a shorthand to layer())

p +
geom_bar(stat = "identity")

p +
geom_col()
  • stat = "identity" leaves data as is.
  • geom_col() is a shortcut to geom_bar(stat = "identity").

Generally, we use geom_*() instead of layer() in practice.

17 / 57
  • auto complete for geom_*()

Geoms

p +
geom_point()

p +
geom_segment(aes(xend = dept, y = 0, yend = count))

18 / 57
  • We don't have to stick with bar
  • we can use points/vertical lines
  • geom_segment(): more aes

Composite geoms: lollipop 🍭 = points + segments

p +
geom_point() +
geom_segment(aes(xend = dept, y = 0, yend = count))

19 / 57

Geom catalogue

source code: Emi Tanaka

20 / 57

Stats

  • Aggregated (pre-computed)
sci_tbl
#> # A tibble: 4 x 2
#> dept count
#> <chr> <int>
#> 1 Physics 12
#> 2 Mathematics 8
#> 3 Statistics 20
#> 4 Computer Science 23
  • Disaggregated
sci_tbl0
#> # A tibble: 63 x 1
#> dept
#> <chr>
#> 1 Physics
#> 2 Physics
#> 3 Physics
#> 4 Physics
#> 5 Physics
#> 6 Physics
#> # … with 57 more rows
21 / 57

Stats

ggplot(sci_tbl, aes(x = dept, y = count)) +
geom_bar(stat = "identity")

ggplot(sci_tbl0, aes(x = dept)) +
geom_bar(stat = "count")

22 / 57

Aesthetic mapping: visual

p +
geom_col(aes(colour = dept))

p +
geom_col(aes(fill = dept))

23 / 57

Mapping variables / Setting constants

p +
geom_col(aes(fill = dept))

p +
geom_col(fill = "#756bb1")

24 / 57
  • bar -> rect -> 2d
    • stroke + fill
  • auto legend

Mapping variables + Setting constants

p +
geom_col(aes(fill = dept), colour = "#000000")

25 / 57

Visual aesthetics

  • colour/color, fill:
    • named colours, e.g. "red"
    • RGB specification, e.g. "#756bb1"
  • alpha: opacity between 0 and 1
  • shape:
    • an integer between 0 and 25
    • a single string, e.g. "triangle open"
  • linetype:
    • an integer between 0 and 6
    • a single string, e.g. "dashed"
  • size, radius: a numerical value (in millimetres)



26 / 57

Your turn

Describe a bubble chart in terms of grammar of graphics.

27 / 57

gg: grammar of graphics {ggplot2}: the second version

Coords

  • Coordinate systems
    • coord_cartesian() (default)
    • coord_flip() (deprecated; now you can simply swap x and y)
    • coord_map()
    • coord_polar()
p +
geom_col(aes(fill = dept)) +
coord_polar(theta = "y")

28 / 57

live demo:

  • ggplot()
  • ggplot(data)
  • ggplot(data, aes())
  • inherit aes
  • layers
  • swap x and y

Themes: modify the look

  • Built-in ggplot themes
    • theme_grey()/theme_gray()
    • theme_bw(), theme_linedraw()
    • theme_light(), theme_dark()
    • theme_minimal(), theme_classic()
    • theme_void()
p +
geom_col(aes(fill = dept)) +
theme_bw()

29 / 57
  • start with p11

Themes: modify the look

library(ggthemes)
p +
geom_col(aes(fill = dept)) +
theme_economist()

30 / 57
  • to be able to modify the theme is the 1st step to make pub-ready plot
  • in an organisation, use a uniform theme across
  • if you want to make your first R package, contributing a theme would be a good start.

Modify the look of texts with element_text()

image credit: Emi Tanaka

31 / 57
  • tons of args in themes() for fine tune
  • I'll quickly go through these args, but dive more deeply in week 7 for effective data vis.

Modify the look of texts with element_text()

p +
geom_col(aes(fill = dept)) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.1))

32 / 57

Modify the look of lines with element_line()

image credit: Emi Tanaka

33 / 57

Modify the look of regions with element_rect()

image credit: Emi Tanaka

34 / 57

Small multiples (or trellis/faceting plots)

🌟 the idea of conditioning on the values taken on by one or more of the variables in a data set

35 / 57

Facets

mpg data available from {ggplot2}

mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int>
#> 1 audi a4 1.8 1999 4 auto(l5) f 18
#> 2 audi a4 1.8 1999 4 manual(m… f 21
#> 3 audi a4 2 2008 4 manual(m… f 20
#> 4 audi a4 2 2008 4 auto(av) f 21
#> 5 audi a4 2.8 1999 6 auto(l5) f 16
#> 6 audi a4 2.8 1999 6 manual(m… f 18
#> # … with 228 more rows, and 3 more variables: hwy <int>,
#> # fl <chr>, class <chr>
36 / 57

Facets

p_mpg <- ggplot(mpg, aes(displ, cty)) +
geom_point(aes(colour = drv))
p_mpg

37 / 57

Facets

- facet_grid()

p_mpg +
facet_grid(rows = vars(drv))
# facet_grid(~ drv)

38 / 57
  • looking at conditional distribution
  • grid -> 2d matrix layout

Facets

- facet_grid()

p_mpg +
facet_grid(cols = vars(drv))
# facet_grid(drv ~ .)

39 / 57

Facets

- facet_grid()

p_mpg +
facet_grid(rows = vars(drv), cols = vars(cyl))
# facet_grid(cyl ~ drv)

40 / 57

Facets

- facet_grid()

- facet_wrap()

p_mpg +
facet_wrap(vars(drv, cyl), ncol = 3)
# facet_wrap(~ drv + cyl, ncol = 3)

41 / 57

Exploratory data visualisation

image credit: Emi Tanaka

42 / 57

case study

- import

movies <- as_tibble(jsonlite::read_json(
"https://vega.github.io/vega-editor/app/data/movies.json",
simplifyVector = TRUE))
movies
#> # A tibble: 3,201 x 16
#> Title US_Gross Worldwide_Gross US_DVD_Sales
#> <chr> <int> <dbl> <int>
#> 1 The Land Girls 146083 146083 NA
#> 2 First Love, Last Ri… 10876 10876 NA
#> 3 I Married a Strange… 203134 203134 NA
#> 4 Let's Talk About Sex 373615 373615 NA
#> 5 Slam 1009819 1087521 NA
#> 6 Mississippi Mermaid 24551 2624551 NA
#> # … with 3,195 more rows, and 12 more variables:
#> # Production_Budget <int>, Release_Date <chr>,
#> # MPAA_Rating <chr>, Running_Time_min <int>,
#> # Distributor <chr>, Source <chr>, Major_Genre <chr>,
#> # Creative_Type <chr>, Director <chr>,
#> # Rotten_Tomatoes_Rating <int>, IMDB_Rating <dbl>,
#> # IMDB_Votes <int>
43 / 57

case study

- import

- skim

skimr::skim(movies)
#> ── Data Summary ────────────────────────
#> Values
#> Name movies
#> Number of rows 3201
#> Number of columns 16
#> _______________________
#> Column type frequency:
#> character 8
#> numeric 8
#> ________________________
#> Group variables None
#>
#> ── Variable type: character ────────────────────────────────────────────────────
#> skim_variable n_missing complete_rate min max empty n_unique whitespace
#> 1 Title 1 1.00 1 66 0 3176 0
#> 2 Release_Date 7 0.998 8 11 0 1603 0
#> 3 MPAA_Rating 605 0.811 1 9 0 7 0
#> 4 Distributor 232 0.928 3 33 0 174 0
#> 5 Source 365 0.886 6 29 0 18 0
#> 6 Major_Genre 275 0.914 5 19 0 12 0
#> 7 Creative_Type 446 0.861 7 23 0 9 0
#> 8 Director 1331 0.584 7 27 0 550 0
#>
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#> skim_variable n_missing complete_rate mean sd
#> 1 US_Gross 7 0.998 44002085. 62555311.
#> 2 Worldwide_Gross 7 0.998 85343400. 149947343.
#> 3 US_DVD_Sales 2637 0.176 34901547. 45895122.
#> 4 Production_Budget 1 1.00 31069171. 35585913.
#> 5 Running_Time_min 1992 0.378 110. 20.2
#> 6 Rotten_Tomatoes_Rating 880 0.725 54.3 28.1
#> 7 IMDB_Rating 213 0.933 6.28 1.25
#> 8 IMDB_Votes 213 0.933 29909. 44938.
#> p0 p25 p50 p75 p100 hist
#> 1 0 5493221. 22019466. 56091762. 760167650 ▇▁▁▁▁
#> 2 0 8031285. 31168926. 97283797 2767891499 ▇▁▁▁▁
#> 3 618454 9906211. 20331558. 37794216. 352582053 ▇▁▁▁▁
#> 4 218 6575000 20000000 42000000 300000000 ▇▁▁▁▁
#> 5 46 95 107 121 222 ▁▇▃▁▁
#> 6 1 30 55 80 100 β–…β–†β–†β–‡β–‡
#> 7 1.4 5.6 6.4 7.2 9.2 ▁▁▅▇▂
#> 8 18 4828. 15106 35810. 519541 ▇▁▁▁▁
44 / 57

case study

- import

- skim

- vis

  • Data analysis starts with questions (a.k.a. curiosity).

45 / 57
  • eda: wandering around a city as a tourist: sometimes have destinations, sometimes no.
  • once you started off with one question, more questions on the way.
  • There's no best single plot for the data.
  • Make as many plots as possible, as quick as possible to get most facets of the data
  • end up with nothing
  • for internal use, don't need to polish

case study

- import

- skim

- vis

Are movies ratings consistent b/t IMDB & Rotten Tomatoes

ggplot(movies, aes(x = IMDB_Rating, y = Rotten_Tomatoes_Rating)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "gam") +
theme(aspect.ratio = 1)

46 / 57

case study

- import

- skim

- vis

Are movies ratings consistent b/t IMDB & Rotten Tomatoes

ggplot(movies, aes(x = IMDB_Rating, y = Rotten_Tomatoes_Rating)) +
geom_hex() +
theme(aspect.ratio = 1)

47 / 57
  • live demo: for rendering speed

case study

- import

- skim

- vis

The popularity of major genre

ggplot(movies, aes(y = Major_Genre)) +
geom_bar()

48 / 57

case study

- import

- skim

- vis

The likeness of major genre

ggplot(movies) +
geom_boxplot(aes(x = IMDB_Rating, y = Major_Genre))

49 / 57

case study

- import

- skim

- vis

The likeness of major genre

ggplot(movies) +
geom_density(aes(x = IMDB_Rating, fill = Major_Genre))

50 / 57
  • overlapping
  • could use facet

case study

- import

- skim

- vis

The likeness of major genre

library(ggridges)
ggplot(movies, aes(x = IMDB_Rating, y = Major_Genre)) +
geom_density_ridges(aes(fill = Major_Genre))

51 / 57

{ggplot2}-ext πŸ“¦

{ggplot2} now has an official extension mechanism. This means that others can now easily create their own stats, geoms and positions, and provide them in other packages. This should allow the ggplot2 community to flourish, even as less development work happens in ggplot2 itself.

➑️ https://exts.ggplot2.tidyverse.org/gallery/

52 / 57
  • {ggplot2} has been around for more than 10 yrs.
  • {ggplot2} is extensible

library(gganimate)
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_boxplot() +
# Here comes the gganimate code
transition_states(
gear,
transition_length = 2,
state_length = 1
) +
enter_fade() +
exit_shrink() +
ease_aes('sine-in-out')




53 / 57




54 / 57

55 / 57
  • good resources for d3.js

To be continued ...

56 / 57
  • a tragic yr cos of covid-19
  • a blast yr for data vis
  • log scale


The greatest value of a picture is when it forces us to notice what we never expected to see.
-- John W. Tukey


2 / 57
  • John Tukey, one of the most well-known stat
  • invent boxplot, stem-and-leaf (pencil and paper)
  • coined the term EDA
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow