Data Representations
Part I

Dr. Mine Dogucu

Different Representations of Data

We can represent data using some of the following formats

  • visual
  • text
  • sound
  • tactile

Today we will cover data represented in visuals but throughout the week we will cover different data representations.

Accessibility

Data visualization is perhaps the most commonly used format for representing data.

Data visualization can convey a lot about data, however visualizations are not accessible to everyone. For instance, they are not accessible to those who are blind and visually impaired.

Different modes (e.g., sound) of representation are especially important for making the data representation accessible to all.

Data Visualization

Examples

How Common Is Your Birthday?

One Dataset Visualized 25 Ways

Mandatory Paid Vacation

Why are K-pop groups so big?

We will only touch the surface of data visualization in this class. It is a rich field and some of you may possibly consider a career in data visualization.

Data Visualizations

  • are graphical representations of data

  • use different colors, shapes, and the coordinate system to summarize data

  • can tell a story or can be useful for exploring data

Data

glimpse(babies)
Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…

?babies

case id number

bwt birthweight, in ounces

gestation length of gestation, in days

parity binary indicator for a first pregnancy (0 = first pregnancy)

age mother’s age in years

height mother’s height in inches

weight mother’s weight in pounds

smoke binary indicator for whether the mother smokes

Bar plot

  • When can we use a bar plot?
  • What does this bar plot convey?

Bar plot

ggplot(babies)

Bar plot

ggplot(babies, aes(x = smoke)) 

Bar plot

ggplot(babies, aes(x = smoke)) +
  geom_bar()

Histogram

  • When can we use an histogram?
  • What does this histogram convey?

Histogram

ggplot(babies)

Histogram

ggplot(babies, aes(x = bwt))

Histogram

ggplot(babies, aes(x = bwt)) +
  geom_histogram()

Histogram

Consider the height distribution in our class.

  • How would the distribution change if Michael Jordan (198.1 cm, 6’ 6’’) were to join our class?

  • How would the distribution change if Tyrion Lannister (Peter Dinklage) (135 cm, 4’ 5’’) were to join our class?

Think 💭 - Pair 👫🏽 - Share 💬

  • In right-skewed distributions mean > median, true or false?

  • In left-skewed distributions mean > median, true or false?

When data display a skewed distribution we rely on median rather than the mean to understand the center of the distribution.

More on Histograms

There is no “best” number of bins

Exploring Histograms Visually

Take a look at these for fun.

Looking at Relationships

So far we seen barplots and histograms both of which are useful for visualizing categorical and numerical variables respectively.

We are often interested in looking at relationships between two variables. We have statistical tests to examine such relationships. However, visualizations can often help us explore if such relationships are worth looking into.

Standardized Bar Plots

ggplot(data = babies,
       aes(x = smoke, 
           fill = parity)) + 
  geom_bar(position = "fill")

Note that the y axis still shows as a count. We will learn how to change the axis labels in the next lecture.

Dodged Bar Plot

ggplot(data = babies,
       aes(x = smoke, 
           fill = parity)) + 
  geom_bar(position = "dodge")

Side-by-Side Boxplots

ggplot(babies,
       aes(x = smoke,
           y = bwt))  +
  geom_boxplot() 

Understanding Each Box

  • The horizontal line in the box represents the median.
  • The box represents the middle 50% of the data with Q3 on the upper end and Q1 on the lower end.
  • Whiskers extend from the box. They can extend up to 1.5 IQR away from the box (i.e. away from Q1 and Q3).
  • The points are potential outliers that represent babies with really low or high birth weight.

Scatter plots

ggplot(babies,
       aes(x = gestation,
           y = bwt))  +
  geom_point()

Length of gestation can possibly eXplain a baby’s birth weight. Gestation is the eXplanatory variable and is shown on the x-axis. Birth weight is the response variable and is shown on the y-axis.

Linear Relationship

Later on we will start statistical modeling during which we will numerically define the relationship between gestation and birth weight. For now we can say that this relationship looks positive and moderate.

Meet Palmer Penguins1

Data

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Visualizing Three Variables

ggplot(penguins, 
       aes(x = body_mass_g, 
           y = bill_length_mm,
           color = species)) +
  geom_point()

code style

The tidyverse style guide has the following convention for writing ggplot2 code.

The plus sign for adding layers + always has a space before it and is followed by a new line.

The new line is indented by two spaces. RStudio does this automatically for you.

Practice

Using the penguins data frame ask a question that you are interested in answering. Visualize data to get a visual answer to the question. What is the visual telling you? Note all of this down in your lecture notes.