Describing Data with Numbers

Dr. Mine Dogucu

Getting to Know Data

Dear Mona, Which State Has the Worst Drivers?

library(fivethirtyeight)
library(tidyverse)
data("bad_drivers")

When you are given some code like this one in your lecture notes or assignments, you should run it first before beginning to code. As we progress in the course, you’ll have a deeper understanding of what the starter codes do.

Data Frames

Data Frame

The data frame has 8 variables (state, num_drivers, perc_speeding, perc_not_distracted, perc_no_previous, insurance_premiums, losses).
The data frame has 51 cases or observations. Each case represents a US state (or District of Columbia).

Data documentation

?bad_drivers

state State

num_drivers Number of drivers involved in fatal collisions per billion miles

perc_speeding Percentage of drivers involved in fatal collisions who were speeding

perc_alcohol Percentage of drivers involved in fatal collisions who were alcohol-impaired

perc_not_distracted Percentage of drivers involved in fatal collisions who were not distracted

perc_no_previous Percentage of drivers involved in fatal collisions who had not been involved in any previous accidents

insurance_premiums Car insurance premiums ($)

losses Losses incurred by insurance companies for collisions per insured driver ($)

Source National Highway Traffic Safety Administration 2012, National Highway Traffic Safety Administration 2009 & 2012, National Association of Insurance Commissioners 2010 & 2011.

Functions for Data Frames

head(bad_drivers)

# A tibble: 6 × 8
  state      num_drivers perc_speeding perc_alcohol perc_not_distracted
  <chr>            <dbl>         <int>        <int>               <int>
1 Alabama           18.8            39           30                  96
2 Alaska            18.1            41           25                  90
3 Arizona           18.6            35           28                  84
4 Arkansas          22.4            18           26                  94
5 California        12              35           28                  91
6 Colorado          13.6            37           28                  79
# ℹ 3 more variables: perc_no_previous <int>, insurance_premiums <dbl>,
#   losses <dbl>

tail(bad_drivers)

# A tibble: 6 × 8
  state         num_drivers perc_speeding perc_alcohol perc_not_distracted
  <chr>               <dbl>         <int>        <int>               <int>
1 Vermont              13.6            30           30                  96
2 Virginia             12.7            19           27                  87
3 Washington           10.6            42           33                  82
4 West Virginia        23.8            34           28                  97
5 Wisconsin            13.8            36           33                  39
6 Wyoming              17.4            42           32                  81
# ℹ 3 more variables: perc_no_previous <int>, insurance_premiums <dbl>,
#   losses <dbl>

glimpse(bad_drivers)

Rows: 51
Columns: 8
$ state               <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif…
$ num_drivers         <dbl> 18.8, 18.1, 18.6, 22.4, 12.0, 13.6, 10.8, 16.2, 5.…
$ perc_speeding       <int> 39, 41, 35, 18, 35, 37, 46, 38, 34, 21, 19, 54, 36…
$ perc_alcohol        <int> 30, 25, 28, 26, 28, 28, 36, 30, 27, 29, 25, 41, 29…
$ perc_not_distracted <int> 96, 90, 84, 94, 91, 79, 87, 87, 100, 92, 95, 82, 8…
$ perc_no_previous    <int> 80, 94, 96, 95, 89, 95, 82, 99, 100, 94, 93, 87, 9…
$ insurance_premiums  <dbl> 784.55, 1053.48, 899.47, 827.34, 878.41, 835.50, 1…
$ losses              <dbl> 145.08, 133.93, 110.35, 142.39, 165.63, 139.91, 16…

ncol(bad_drivers)

[1] 8

nrow(bad_drivers)

[1] 51

Getting to Know the Data Frame

ADD SCREENSHOT HERE

glimpse(candy_rankings)

Rows: 85
Columns: 6
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter",…
$ chocolate      <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FAL…
$ fruity         <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ sugarpercent   <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.313,…
$ pricepercent   <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.511,…
$ winpercent     <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.34…

glimpse(bob_ross)

Rows: 403
Columns: 71
$ episode            <chr> "S01E01", "S01E02", "S01E03", "S01E04", "S01E05", "…
$ season             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, …
$ episode_num        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3,…
$ title              <chr> "A WALK IN THE WOODS", "MT. MCKINLEY", "EBONY SUNSE…
$ apple_frame        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ aurora_borealis    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ barn               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ beach              <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ boat               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ bridge             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ building           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ bushes             <int> 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, …
$ cabin              <int> 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ cactus             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ circle_frame       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cirrus             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ cliff              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ clouds             <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, …
$ conifer            <int> 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, …
$ cumulus            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
$ deciduous          <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, …
$ diane_andre        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dock               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ double_oval_frame  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ farm               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fence              <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fire               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ florida_frame      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ flowers            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fog                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ framed             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ grass              <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, …
$ guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ half_circle_frame  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ half_oval_frame    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ hills              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lake               <int> 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, …
$ lakes              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ lighthouse         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mill               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ moon               <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mountain           <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, …
$ mountains          <int> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
$ night              <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ocean              <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
$ oval_frame         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ palm_trees         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ path               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ person             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ portrait           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ rectangle_3d_frame <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ rectangular_frame  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ river              <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ rocks              <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ seashell_frame     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ snow               <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …
$ snowy_mountain     <int> 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
$ split_frame        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ steve_ross         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ structure          <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ sun                <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
$ tomb_frame         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ tree               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
$ trees              <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
$ triple_frame       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ waterfall          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ waves              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ windmill           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ window_frame       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ winter             <int> 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ wood_framed        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

glimpse(mariokart)

Rows: 143
Columns: 12
$ id          <dbl> 150377422259, 260483376854, 320432342985, 280405224677, 17…
$ duration    <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1, 7…
$ n_bids      <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, 6, …
$ cond        <fct> new, used, new, new, new, new, used, new, used, used, new,…
$ start_pr    <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 19.9…
$ ship_pr     <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4.00…
$ total_pr    <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, 47…
$ ship_sp     <fct> standard, firstClass, firstClass, standard, media, standar…
$ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 4858,…
$ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, yes, …
$ wheels      <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2, 2…
$ title       <fct> "~~ Wii MARIO KART &amp; WHEEL ~ NINTENDO Wii ~ BRAND NEW …

Variables

Variables sugarpercent, pricepercent, and winpercent are numerical variables.

We can do certain analyses on these variables such as finding an average winpercent or the maximum or minimum winpercent.

Note: Not everything represented by numbers is a numeric variable. e.g. Student ID number is not a numeric variable.

Variables

Variables such as competitorname, chocolate, and fruity are categorical variables.

We cannot take averages or find maximum or minimum of these variables.

Categorical variables have levels. For instance chocolate and fruity both have two levels as TRUE and FALSE.

Categorical Variables

If the levels of the categorical variable has a comparable ordering then it is called an ordinal variable.

e.g. variable scholarship_status might have three levels as no scholarship, partial scholarship and full scholarship. We can order these levels from less to more or vice versa.

If there is no ordering then a categorical variable would be called a nominal variable. e.g. state names.

`candy_rankings` vs `bob_ross`

False - 0
True - 1

Numeric Variables

Consider a variable n_kids which represents number of kids somebody has. Then this variable can take values (0, 1, 2, …). Notice that this variable can take only integer values. This variable is said to be discrete since it does not take on infinitely many numbers that we are not able to count.

Numeric variables that can take infinitely many numbers are said to be continuous. Consider somebody’s height in cm. This is a continuous variable. Even though we might say somebody is 173 cm, in reality the height could be 170.612476314631 cm. So height can take infinitely many values.

In Statistics

(Some) Variable Types in R

character: takes string values (e.g. a person’s name, address)

integer: integer (single precision)

double: floating decimal (double precision)

numeric: integer or double

factor: categorical variables with different levels

logical: TRUE (1), FALSE (0)

As a data scientist it is .font30[your] job to check the type(s) of data that you are working with. Do .font30[not] assume you will work with clean data frames, with clean names, labels, and types.

Describing Data with Numbers

Data

glimpse(hotel_bookings)

Rows: 1,000
Columns: 32
$ hotel                          <fct> City Hotel, Resort Hotel, Resort Hotel,…
$ is_canceled                    <fct> 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, …
$ lead_time                      <dbl> 1, 19, 9, 110, 329, 212, 14, 58, 88, 37…
$ arrival_date_year              <dbl> 2015, 2016, 2017, 2016, 2017, 2017, 201…
$ arrival_date_month             <fct> September, March, August, November, Jul…
$ arrival_date_week_number       <dbl> 40, 12, 31, 46, 30, 35, 5, 41, 20, 14, …
$ arrival_date_day_of_month      <dbl> 30, 19, 1, 11, 27, 31, 29, 3, 17, 5, 22…
$ stays_in_weekend_nights        <dbl> 0, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 1, …
$ stays_in_week_nights           <dbl> 2, 4, 4, 1, 2, 8, 0, 5, 3, 3, 1, 1, 1, …
$ adults                         <dbl> 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, …
$ children                       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <fct> BB, BB, HB, BB, SC, BB, BB, BB, SC, BB,…
$ country                        <fct> PRT, PRT, ESP, BRA, GBR, GBR, PRT, DEU,…
$ market_segment                 <fct> Offline TA/TO, Online TA, Online TA, On…
$ distribution_channel           <fct> TA/TO, TA/TO, TA/TO, TA/TO, TA/TO, TA/T…
$ is_repeated_guest              <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <fct> A, D, A, A, A, A, A, A, A, A, A, A, A, …
$ assigned_room_type             <fct> A, D, A, A, A, A, D, A, A, A, A, A, A, …
$ booking_changes                <dbl> 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
$ deposit_type                   <fct> Non Refund, No Deposit, No Deposit, No …
$ agent                          <fct> 50, 240, 241, 314, 9, 143, 240, 240, 9,…
$ company                        <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <fct> Transient, Transient, Transient, Transi…
$ average_daily_rate             <dbl> 98.10, 70.17, 193.40, 36.24, 89.10, 89.…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
$ total_of_special_requests      <dbl> 0, 1, 1, 0, 1, 0, 2, 1, 0, 0, 1, 0, 1, …
$ reservation_status             <fct> Canceled, Canceled, Check-Out, Check-Ou…
$ reservation_status_date        <date> 2015-09-29, 2016-03-02, 2017-08-05, 20…

Data

Rows: 1,000
Columns: 2
$ lead_time          <dbl> 1, 19, 9, 110, 329, 212, 14, 58, 88, 373, 36, 60, 5…
$ reservation_status <fct> Canceled, Canceled, Check-Out, Check-Out, Check-Out…

What kind of variables are these two?

Categorical data are summarized with counts or proportions

count(hotel_bookings, reservation_status)

# A tibble: 3 × 2
  reservation_status     n
  <fct>              <int>
1 Canceled             357
2 Check-Out            634
3 No-Show                9

count(hotel_bookings, reservation_status, sort = TRUE)

# A tibble: 3 × 2
  reservation_status     n
  <fct>              <int>
1 Check-Out            634
2 Canceled             357
3 No-Show                9

tabyl(hotel_bookings, reservation_status)

 reservation_status   n percent
           Canceled 357   0.357
          Check-Out 634   0.634
            No-Show   9   0.009

Summarizing Numerical Data

Consider the following data which represents the number of hours slept for 10 people who were surveyed.

7.5

5.5

7.2

Mean

\[\bar x = \frac{7+7.5+8+5.5+10+7.2+7+8+9+8}{10} = 7.72\]

The mean is calculated by summing the observed values and then dividing by the number of observations.

\[\bar x = \frac{x_1 + x_2+.... x_n}{n}\]

where $\bar x$ represents the mean of observed values, $x_1$, $x_2$, … $x_n$ represent the n observed values.

Median

If all the observations are listed from smallest to largest (or vice versa), the median is the observation that falls in the middle.

5.5

7.2

7.5

In this case, we have two numbers in the middle 7.5 and 8. The average of these numbers would be the median. In this case, the median is 7.75.

\[\frac{7.5 + 8}{2} = 7.75\]

Median is also the 50th percentile indicating that 50% of the data fall below this value.

Q1, Q3, and Interquartile Range

First quartile (Q1) is the point at which 25% of the data fall below of.

Third quartile (Q3) is the point at which 75% of the data fall below of.

Q1 and Q3 can be considered 25th and 75th percentiles respectively.

Interquartile Range (IQR) = Q3 - Q1] which represents the middle 50% of the data.

In Pairs: Discuss

Consider Dr. Dogucu teaching three classes. All of these classes have 5 students. Below are exam results from these classes.

Class 1: 80 80 80 80 80
Class 2: 76 78 80 82 84
Class 3: 60 70 80 90 100

All of these classes have a mean of 80 points. Do you think the mean describes these classes well? Can you think of any other way to describe (in words not in numbers) how these classes differ?

Standard deviation and Variance

x_i	$x_i - \bar{x}$	$(x_i - \bar{x})^2$
5.5	5.5-7.72 = -2.22 hr	(-2.2 hr)² = 4.9284 hr ²
7	7-7.72 = -0.72 hr	(-0.72 hr)² = 0.5184 hr ²
7	7-7.72 = -0.72 hr	(-0.72 hr)² = 0.5184 hr ²
7.2	7.2-7.72 = -0.52 hr	(-0.52 hr)² = 0.2704 hr ²
7.5	7.5-7.72 = -0.22 hr	(-0.22 hr)² = 0.0484 hr ²
8	8-7.72 = 0.28 hr	(0.28 hr)² = 0.0784 hr ²
8	8-7.72 = 0.28 hr	(0.28 hr)² = 0.0784 hr ²
8	8-7.72 = 0.28 hr	(0.28 hr)² = 0.0784 hr ²
9	9-7.72 = 1.28 hr	(1.28 hr)² = 1.6384 hr ²
10	10-7.72 = 2.28 hr	(2.28 hr)² = 5.1984 hr ²

Total squared distance from the mean

$\Sigma_{i = 1}^{n} (x_i - \bar x )^2 =$

$4.9284 + 0.5184 + 0.5184 + 0.2704 + 0.0484 +$ $0.0784 + 0.0784 + 0.0784+ 1.6384 + 5.1984 = 13.356 \text{ hr}^2$

Note that $n$ represents the number of observations which means $n = 10$.

Sample variance

\[s^2 = \frac{\Sigma_{i = 1}^{n} (x_i - \bar x )^2}{n-1}\]

\[s^2= \frac{13.356}{10-1} = 1.484\text{ hr}^2\]

Sample standard deviation

\[s = \sqrt{\frac{\Sigma_{i = 1}^{n} (x_i - \bar x )^2}{n-1}}\]

\[s= \sqrt{1.484} = 1.218195 \text{ hr}\]

Mean

summarize(hotel_bookings, 
          mean(lead_time))

# A tibble: 1 × 1
  `mean(lead_time)`
              <dbl>
1              105.

mean(hotel_bookings$lead_time)

[1] 105.065

Median

summarize(hotel_bookings, 
          median(lead_time))

# A tibble: 1 × 1
  `median(lead_time)`
                <dbl>
1                69.5

median(hotel_bookings$lead_time)

[1] 69.5

Minimum

summarize(hotel_bookings, 
          min(lead_time))

# A tibble: 1 × 1
  `min(lead_time)`
             <dbl>
1                0

min(hotel_bookings$lead_time)

[1] 0

In a similar fashion maxiumum can be found by using the max() function.

Standard deviation

summarize(hotel_bookings, 
          sd(lead_time))

# A tibble: 1 × 1
  `sd(lead_time)`
            <dbl>
1            106.

sd(hotel_bookings$lead_time)

[1] 106.4455

Variance

summarize(hotel_bookings, 
          var(lead_time))

# A tibble: 1 × 1
  `var(lead_time)`
             <dbl>
1           11331.

var(hotel_bookings$lead_time)

[1] 11330.64

Quantiles / Percentiles / Quartiles

Quantile	Percentile	Special Name
0.25	25th	First quartile
0.5	50th	Median
0.75	75th	Third quartile

Quantiles

summarize(hotel_bookings, quantile(lead_time, c(0.25, 0.50, 0.75)))

# A tibble: 3 × 1
  `quantile(lead_time, c(0.25, 0.5, 0.75))`
                                      <dbl>
1                                      19  
2                                      69.5
3                                     157

We would expect 25% of the data to be less than 19.

We can get multiple summaries with one summarize() function.

summarize(hotel_bookings,
          mean(lead_time),
          median(lead_time))

# A tibble: 1 × 2
  `mean(lead_time)` `median(lead_time)`
              <dbl>               <dbl>
1              105.                69.5

Note how the variables names in this table is not easy to read.

In order to display the variable names more legibly in the output, we can assign variable names to numerical summaries (e.g. mean_lead_time).

summarize(hotel_bookings,
          mean_lead_time = mean(lead_time),
          med_lead_time = median(lead_time))

# A tibble: 1 × 2
  mean_lead_time med_lead_time
           <dbl>         <dbl>
1           105.          69.5

Describing Data with Numbers

Getting to Know Data

Data Frames

Data Frame

Data documentation

Functions for Data Frames

Getting to Know the Data Frame

Variables

Variables

Variables

Categorical Variables

candy_rankings vs bob_ross

Numeric Variables

In Statistics

(Some) Variable Types in R

Describing Data with Numbers

Data

Data

Summarizing Numerical Data

Mean

Median

Q1, Q3, and Interquartile Range

In Pairs: Discuss

Standard deviation and Variance

Total squared distance from the mean

Sample variance

Sample standard deviation

Mean

Median

Minimum

Standard deviation

Variance

Quantiles / Percentiles / Quartiles

Quantiles

`candy_rankings` vs `bob_ross`