Simple Linear Regression, Sampling, and Study Design

Dr. Mine Dogucu

Data babies in openintro package

Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, …

Baby Weights

       aes(x = gestation, y = bwt)) +

Baby Weights

       aes(x = gestation, y = bwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 

lm stands for linear model
se stands for standard error

y Response Birth weight Numeric
x Explanatory Gestation Numeric

Linear Equations Review

Recall from your previous math classes

\(y = mx + b\)

where \(m\) is the slope and \(b\) is the y-intercept

e.g. \(y = 2x -1\)

Notice anything different between baby weights plot and this one?

Math class

\(y = b + mx\)

\(b\) is y-intercept
\(m\) is slope

Stats class

\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)

\(\beta_0\) is y-intercept
\(\beta_1\) is slope
\(\epsilon_i\) is error/residual
\(i = 1, 2, ...n\) identifier for each point

model_g <- lm(bwt ~ gestation, data = babies)

lm stands for linear model. We are fitting a linear regression model. Note that the variables are entered in y ~ x order.

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -10.1      8.32       -1.21 2.27e- 1
2 gestation      0.464    0.0297     15.6  3.22e-50

\(\hat {y}_i = b_0 + b_1 x_i\)

\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i\)

\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)

Expected bwt for a baby with 300 days of gestation

\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)

\(\hat {\text{bwt}} = -10.1 + 0.464 \times 300\)

\(\hat {\text{bwt}} =\) 129.1

For a baby with 300 days of gestation the expected birth weight is 129.1 ounces.

Interpretation of estimates

\(b_1 = 0.464\) which means for one unit(day) increase in gestation period the expected increase in birth weight is 0.464 ounces.

\(b_0 = -10.1\) which means for gestation period of 0 days the expected birth weight is -10.1 ounces!!!!!!!! (does NOT make sense)


  • There is no such thing as 0 days of gestation.
  • Birth weight cannot possibly be -10.1 ounces.
  • Extrapolation happens when we use a model outside the range of the x-values that are observed. After all, we cannot really know how the model behaves (e.g. may be non-linear) outside of the scope of what we have observed.

Baby number 148

babies %>% 
  filter(case == 148) %>% 
  select(bwt, gestation)
# A tibble: 1 × 2
    bwt gestation
  <int>     <int>
1   160       300

Baby #148


\(\hat y_{148} = b_0 +b_1x_{148}\)

\(\hat y_{148} = -10.1 + 0.464\times300\)

\(\hat y_{148}\) = 129.1


\(y_{148} =\) 160

Residual for i = 148

\(y_{148} = 160\)

\(\hat y_{148}\) = 129.1

\(e_{148} = y_{148} - \hat y_{148}\)

\(e_{148} =\) 30.9

Least Squares Regression

The goal is to minimize

\[e_1^2 + e_2^2 + ... + e_n^2\]

which can be rewritten as

\[\sum_{i = 1}^n e_i^2\]

Conditions for Least Squares Regression

  • Linearity

  • Normality of Residuals

  • Constant Variance

  • Independence



Nearly normal

Not normal

Constant Variance

Non-constant variance


Harder to check because we need to know how the data were collected.

In the description of the dataset it says [a study]considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.

It is possible that babies born in the same hospital may have similar birth weight.

Correlated data examples: patients within hospitals, students within schools, people within neighborhoods, time-series data.

Inference: Confidence Interval (theoretical)

                  2.5 %    97.5 %
(Intercept) -26.3915884 6.2632199
gestation     0.4059083 0.5226169

Note that the 95% confidence interval for the slope does not contain zero and all the values in the interval are positive indicating a significant positive relationship between gestation and birth weight.

y Response Birth weight Numeric
x Explanatory Smoke Categorical


\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)

\(\beta_0\) is y-intercept
\(\beta_1\) is slope
\(\epsilon_i\) is error/residual
\(i = 1, 2, ...n\) identifier for each point

model_s <- lm(bwt ~ smoke, data = babies)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   123.       0.649    190.   0       
2 smoke          -8.94     1.03      -8.65 1.55e-17

\(\hat {y}_i = b_0 + b_1 x_i\)

\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ smoke}_i\)

\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)

Expected bwt for a baby with a non-smoker mother

\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)

\(\hat {\text{bwt}_i} = 123 + (-8.94\times 0)\)

\(\hat {\text{bwt}_i} = 123\)

\(E[bwt_i | smoke_i = 0] = b_0\)

Expected bwt for a baby with a smoker mother

\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)

\(\hat {\text{bwt}_i} = 123 + (-8.94\times 1)\)

\(\hat {\text{bwt}_i} = 114.06\)

\(E[bwt_i | smoke_i = 1] = b_0 + b_1\)

                2.5 %     97.5 %
(Intercept) 121.77391 124.320430
smoke       -10.96413  -6.911199

Note that the confidence interval for the “slope” does not contain 0 and all the values in the interval are negative.

Understanding Relationships

  • Just because we observe a significant relationship between \(x\) and \(y\), it does not mean that \(x\) causes \(y\).

  • Just because we observe a significant relationship in a sample that does not mean the findings will generalize to the population.

  • For these we need to understand sampling and study design.


Research question

Every research project aims to answer a research question (or multiple questions).


Do UCI students who exercise regularly have higher GPA?

We will use this research question throughout the examples in the lecture.


Each research question aims to examine a population.


Population for this research question is UCI students.

Data Collection

Data are collected to answer research questions. There are different methods to collect data. For instance, data can be collected

  • in-person or online (if collecting from human subjects)
  • on-site or off-site (e.g. rain fall measures vs. moon image tracking)
  • with different tools such as surveys, motion sensors (e.g. marathon finish lines)

Data Collection - Ethics

When collecting data from human and animal research subjects we need to consider ethics.

In universities, rights of the human and animal research subjects are protected by the Institutional Review Board (IRB) of each university. If interested (highly recommended) you can read about UCI’s Institutional Review Board)


Consider that we design a survey with the following questions to study the research question.

  • Do you exercise at least once every week?

  • What is your GPA?


A population is a collection of elements which the research question aims to study. However it is often costly and sometimes impossible to study the whole population. Often a subset of the population is selected to be studied. Sample is the the subset of the population that is studied. The goal is to have a sample that is representative of the population so that the findings of the study can generalize to the population.


Since it would be almost impossible to give the survey to ALL UCI students, we can give it to a sample of students.

There are different sampling methods to consider.

Convenience (Availability) Sampling

Convenience sampling occurs when a specific sample is selected because the sample is easy to access.


  • Stand in front of Langson Library
  • Give the survey to 100 UCI students

This could introduce (sampling) bias and the findings may not generalize to the population. It is possible that those in front of the library

  • may study more and thus may have higher GPA.
  • may be more active than those who study at home/dorm.

Additional Example

A scientist is interested in counting the number of different species of bacteria in San Diego Creek. She takes a bucket of water from San Diego Creek where she happens to be standing and counts the different specifies of bacteria. The bacteria in the bucket make up the sample and the bacteria in San Diego Creek make up the population. The scientist is using the convenience sampling method.

Simple Random Sample

When simple random sampling technique is used any element of the population has an equal chance of being selected to the sample.


The researcher can

  • reach out to the registrar to get student emails;
  • randomly select 100 students;
  • email them the survey.

Assume that the 100 selected students respond.

Population: All UCI students
Sample: 100 students who have responded

Simple Random Sampling in R

1sample(1:100, 3, replace = FALSE)
This allows us to sample 3 numbers from 1 to 100 without replacement, meaning a number can only be selected once.
[1] 56 71 82

To generalize:

sample(x = 1:N, size = n, replace = FALSE)

This code will take a random sample of size \(n\) from the population consisting of the numbers in the interval \([1, N]\)

Side note: This is not truly random but that’s beyond the scope of this class. Here is a fun (short) reading about it. Philosophers also discuss if true randomness exists or not.

Non-response Bias

Even when simple random sampling is used, if participants are unwilling to participate in studies then the results can have nonresponse bias.


It is unlikely that 100 students will respond. Assume that 86 respond.

It is possible that those 14 who did not respond

  • may be busy exercising and did not have the time to respond.
  • may be busy studying and did not have the time to respond.

Additional Example

A social media company shows a survey to some its users on the timeline. Many users ignore the survey and do not take it. There is a high non-response rate and thus the results cannot be generalized to the population.

Cluster Sampling

In cluster sampling the population is divided into group (i.e, clusters). The sample consists of elements in randomly selected clusters.


The researchers may get a list of classes taught at UCI. They randomly select 10 classes. All the students in those 10 classes will be in the sample.

Stratified Sampling

In stratified sampling the population is first divided into groups (i.e., stratas) and then the sample is selected randomly within each strata.


The researchers suspect that exercising patterns might be different across different class years. Thus they want to make sure that the sample includes first-years, sophomores, juniors, and seniors. They get a list of students with class year information from the registrar. They then randomly select 25 students who are first years, 25 sophomores, 25 juniors, and 25 seniors.

Study Design

Anecdotal Evidence

Anecdotal evidence is an observation that is not systematic and haphazard.


We might meet a junior student who got 100 points in all UCI exams, homework assignments, and quizzes that they have taken and they say that they exercise regularly. Even though the data are factually correct (i.e., high GPA and regular exercise routine.) this does not

Anecdotal evidence is not a scientific method to answer research questions. We need rigorously designed studies to make generalizations and/or to establish causal relationships.

Observational Study

In observational studies, researchers study the research question without exposing the cases (or subset of a sample) to any treatment or intervention. In observational studies causal relationships between variables cannot be established.


Based on the survey, even if we observe that UCI students who exercise regularly have higher GPA, we cannot conclude that exercising regularly increases GPA.

Relationship between two variables

If two variables are related to each other in some way we would call them associated.

If two variables are not related to each other in any way we would call them independent.

Relationship between two variables

When we examine the relationship between two variables, we often want to know if the relationship between them is causal. In other words, does one variable cause the other? For instance, is exercising the reason for higher GPA? We don’t know!

When we suspect that two variables have a causal relationship we can say

The explanatory variable (e.g. exercising) might causally affect the response variable (e.g. GPA).

Relationship between two variables does not imply one causes the other.

Relationship between two variables

Explanatory variables are denoted by \(x\) and the response variable is denoted by \(y\). You can remember this from eXplanatory variable is \(x\). Exercising may eXplain high GPA.

A confounding variable (e.g. time management skills) has a correlation with the the explanatory and the response variable.

Experiment Design

In experiments, researchers assign cases to treatments/interventions.

In randomized experiments, researchers randomly assign cases to treatments/interventions. In order to establish causal link between variables, we need randomized experiments.


Do UCI students who exercise regularly have higher GPA?

Does exercising regularly increase GPA for UCI students?

Image Copyright Derenik Haghverdian. Used with permission


Random sampling and random assignment (i.e., random allocation) serve different purposes.

Random sampling

  • method of choosing sample from the population

  • the goal is to establish generalizability

Random allocation

  • method of assigning the sample to different treatment groups

  • the goal is to establish causality.


A doctor has developed a drug called drug i.d.s. to treat some disease. She wants to know if patients who take drug i.d.s. is free of the disease for at least a year.

The doctor suspects that the drug may affect adults and kids differently.

If researchers suspect that the an additional variable that may influence the response variable then they may use blocks.


Image Copyright Federica Ricci. Used with permission.

More Vocabulary about Experiments

A placebo is a fake treatment. If a patient shows an improvement by taking a placebo then this is called a placebo effect.

In blind studies, patients do not know what treatment they receive. In double blind studies patients who receive and the doctors who provide the treatment do not know the type of the treatment.