
STAT 501 Applied
Regression
Text's Data
Sets (ASCII files) (Also see discussion in Week 1 on how to access
the data in a PC Lab on campus.)
Week
1 , Week 2, Week
3, Week
4, Week
5 , Week
6 , Week
7 , Week
8 , Week
9 , Week
10 , Week
11
Week
12, Week
13 , Week
14 , Week
15
Other Resources:
Regression
Demo
Correlation
Demo and more
Final Exam: Wed., May 6, 10:10-Noon,
117 THOMAS
Note: I will be in the office after 2:00
on Monday, May 4, if you have any questions about the material for the
final. You can bring a single sheet of notes to the final.
The final is cumulative. Logistic regression will not be on the final.
Instructor: Tom Hettmansperger
317 Thomas Bld
Phone: 865-2211
email: tph@stat.psu.edu
Office Hours: 3:30-4:30 M, W
Assistant: Mustafa Nadar
418 Thomas Bld
Phone: 865-3230
email: nadar@stat.psu.edu
Office Hours: 10:10-11:10 T, Th
Text: Applied Linear Statistical Models, 4th ed., by Neter,
Kutner, Nachtsheim, and Wasserman
Note: You can also use Applied Linear Regression Models, 3rd ed.,
by Neter, Kutner, Nachtsheim, and Wasserman. This is the first 15
chapter of Applied Linear Statistical Models, 4th ed. In fact, if
you are only planning to take Stat 501 and not Stat 502, then the Applied
Linear Regression Models is a lot easier to carry around. The
4th edition is on reserve in the Mathematics Library (in McAllister
Bld.) so you can cross check earlier editions if necessary.
We will cover most of the material in the first 11 chapters (essentially
parts 1 and 2 of the text), and selected material from later chapters if
time permits. The emphasis will be on the analysis of data and not
on theory. However, you will need to know some matrix manipulations
and I will discuss this topic in class. I will also take seriously
the 6 credits in statistics or Stat 451 prerequisite. See Appendix
A in the text for a review of the prerequisite material, especially sections
A.6 and A.7.
There will be 2 exams and a comprehensive final. The exams
will be worth 100 points each and the final will be worth 200 points.
Homework will be collected and graded but will only count if you are on
the borderline.
The first exam will be after chapter 3 or 4. The second exam will
be after chapter 7. This schedule is tentative and may be changed
depending on the pace of the class.
The computer will be an important part of the course. I
will discuss and use Minitab. You may use any good computer package
(such as SAS) that implements the methods discussed in the course.
If you have no experience
with a computer, see me at once.
If time permits, I will assign an optional project which will be due
sometime near the end of the semester.
I strongly suggest that you check the Stat 501 web page each
week. In addition, I will put comments on computing along with Minitab
programs on the page.
Week 1. Assignment:
Read sections 1.1-1.5. Optional exercises: p 36--1.6, 1.7,
1.11.
Flash: There is an error in the formula for b1 in
Wednesday's lecture. I will correct it on Friday.
Here is the data for the muscle mass exercise 1.27, p40.
Do this problem and hand it in on Monday, Jan. 19.
| age |
64 |
43 |
67 |
56 |
73 |
68 |
56 |
76 |
65 |
45 |
58 |
45 |
53 |
49 |
78 |
71 |
| mass |
91 |
100 |
68 |
87 |
73 |
78 |
80 |
65 |
84 |
116 |
76 |
97 |
100 |
105 |
77 |
82 |
Also, read sections 1.6-1.8 in the text.
Flash: For those of you who do not want to use the pc labs
for Minitab, I have included a link to all the data sets in the
text at the top of this page. You can copy and paste the data into
a Minitab worksheet and save it if you wish. This will be helpful
if you have Minitab at home or in a lab and do not have the disk that comes
with the book.
For those of you who will use Minitab in one of the pc labs:
1. Under Program Groups: click Spreadsheets and Statistics
2. Double click Minitab 11
3. In Minitab: File>open worksheet
4. Drives: select i:\\hammond\instruct
5. Under Stat directory: 462
6. Double click the data set you want and it should be loaded
into the Minitab worksheet.
Week 2
Assignment: By now you have hopefully read sections 1.6-1.8.
I won't talk about maximum likelihood. You should be aware that the
least squares methods are also maximum likehood methods under the normal
model. Due Monday, Jan. 19: Problem 1.27. Do these
computations using a calculator. I will talk about using Minitab
on Monday.
After discussing Minitab, I will begin Chapter
2. If you want to read ahead, look at section 2.1.
We will look at the bootstrap for simple regression. In particular,
on Wednesday we carried out one cycle of the bootstrap by hand. You
should do this for yourself. Use the Mercedes data and resample the
residuals (using slips of paper) and compute b1*. It would be a good
idea to do this a couple of times to get a feeling for how the bootstrap
works. On Friday I will give you a general purpose Minitab bootstrap
program and illustrate it.
Week 3. Here is
an exercise that will be due Wed. Jan. 28. Suppose we want
to investigate the relationship between driver test scores and the
number of beers that the driver has drunk. Let Y=test score and X=number
of beers. Data:
| X |
0 |
1 |
2 |
3 |
4 |
| Y |
80 |
84 |
76 |
70 |
72 |
The intercept in the model is beta0=expected score with no beers.
This a baseline score. Let M3=beta0+3beta1, this is the expected
score after 3 beers. Using the data:
a. Estimate M3
b. Estimate the standard deviation of M3
(Hint: Change the bootstrap program to compute M3 and run it
500 times.)
You might assign a score of M3/beta0. Then people with different
initial skills (no beer) would be more comparable.
c. Estimate R=M3/beta0
d. Estimate the standard deviation of R.
Week 4. Assignment
Due Wed., Feb. 4: Exercises 2.27 and 2.28 p90 in the text.
You should do these with a calculator and then check them on the computer
using Minitab or whatever program you are using. In addition, Due
Fri., Feb. 6: In the muscle mass problem, let P = the percent
reduction in the expected muscle mass over a 5 year period beginning with
the average age in the study. Estimate P and estimate the standard
deviation of P. Suppose you, as the researcher, believe that P will
exceed 3%. Does your analysis support this or not?
Read the rest of Chapter 2. I will discuss various topics this
week. After prediction intervals, we will look at analysis of variance,
the general linear test, and r-squared in detail.
Week 5. If you did
not hand in the problem concerning percent reduction on Friday, it is due
Monday. We will consider r-squared first and then move to Chapter
3, diagnostics and remedial measures. This is one of the most important
parts of the course. A lot of the difficulties with the sensitivity
of least squares can be demonstrated using the link at the top of this
page under Other Resources: Regression Demos. I will illustrate
it in class. Due Wed., Feb. 11: Exercise 2.29 in the book.
Week 6. Chapter
3, diagnostics and remedial measures this week. Note that leverage
is not taken up in the text until later, after the introduction of multiple
regression. We, however, will look at it closely for simple regression.
I will concentrate lectures on the following sections and material:
Sections 3.1, 3.2, 3.3 (plot of standardized resids vs x, boxplot of standardized
residuals, and boxplot of x values). Further, in section 3.3 look
at: Nonlinearity of Regression Function, Nonconstancy of Error Variance,
Presence of Outliers. Then section 3.7 on lack of fit. A little
of section 3.9 on transformations, skipping the Box-Cox transformations.
I may briefly discuss some of the material in section 3.10 on regression
smoothing.
I anticipate assigning the following problems as we cover the
relevant topics: p146...3.7 (a, b, c, only), 3.9, 3.13, 3.16
( a, c, d, e, and f only). Due Friday: 3.7, 3.9 and Levene's
test on the muscle mass data.
Week 7. We'll finish
chapter 3 this week. Due Friday, Feb. 27: Exercises 3.13,
3.15, 3.16 (a, c, d, e, and f). On Friday if there is time, I will
talk about simultaneous inference. Read pp152-159. Next
week, after the exam, we will begin matrix notation and some manipulations
in preparation for multiple regression.
Week 8. Exam week.
We will cover briefly the multiplicity problem. You should read
Sections 4.1, 4.2, and 4.3 in Chapter 4. The other topics in Chapter
4 are of specialized interest and individuals may wish to read them.
If you do and want to discuss some aspect of the material please make an
appointment to see me. I strongly recommend that you read
Section 4.7. This material is a brief introduction to issues in designing
an experiment and choosing the X values judiciously.
After break we will begin multiple regression. First we must introduce
some material from matrix theory. Matrices will be mainly used for
formulating the multiple regression problem. I suggest reading
Sections 5.1-5.7 and doing exercises 5.1 and 5.2 over the break.
These exercises will familiarize you with the basic matrix manipulations.
Week 9. We begin
Chapter 6 this week. Read Section 6.1 carefully. I will illustrate
many of the ideas on the following grandfather clock data.
You may wish to work with the data yourself on some of the examples from
class. If you wish to copy and paste it into a Minitab worksheet,
the columns are price, age, bidders, age-squared, bidders-squared, and
age times bidders:
1055 108 14 11664 196 1512
729 108 6 11664 36 648
1175 111 15 12321 225 1665
785 111 7 12321 49 777
946 113 9 12769 81 1017
1080 115 12 13225 144 1380
744 115 7 13225 49 805
1024 117 11 13689 121 1287
1152 117 13 13689 169 1521
1336 126 10 15876 100 1260
1235 127 13 16129 169 1651
845 127 7 16129 49 889
1253 132 10 17424 100 1320
1297 137 9 18769 81 1233
1713 137 15 18769 225 2055
1147 137 8 18769 64 1096
854 143 6 20449 36 858
1522 150 9 22500 81 1350
1092 153 6 23409 36 918
1047 156 6 24336 36 936
1822 156 12 24336 144 1872
1483 159 9 25281 81 1431
1884 162 11 26244 121 1782
1262 168 7 28224 49 1176
2131 170 14 28900 196 2380
1545 175 8 30625 64 1400
1792 179 9 32041 81 1611
1979 182 11 33124 121 2002
1550 182 8 33124 64 1456
2041 184 10 33856 100 1840
1593 187 8 34969 64 1496
1356 194 5 37636 25 970
Week 10. In addition
to the exercise given in class that is due Monday, March 23, you
should read Sections 6.1 through 6.5. On Monday I will assign
Exercise 6.15 parts a-f and it will be due Wednesday or Friday depending
on how far I get on Monday.
Read Sections 6.6 and 6.7 for a discussion of inference on the regression
coefficients, expected values, and predicted new values. Exercises
6.16 and 6.17 will be due sometime next week, probably Wednesday, April
1. You should also read 6.8 for a brief discussion of diagnostics.
Finally, 6.9 may be helpful since it is an extended worked out example.
Week 11. Exercises
6.16 and 6.17 are due Wednesday, April 1. In the meantime
we will begin with the extra sum of squares principle. Read Sections
7.1 and 7.2. Due Friday, April 3: Exercises 7.5 and
7.6 p317 (Patient Satisfaction data) on extra sums of squares methods.
Week 12. Exam this
week. Begin discussion of variance inflation factors. The discussion
of multicolinearity is in Sections 7.5, 7.6, and 9.5. Exercises due
Friday, April 17: 7.14, 7.18, and 9.17.
Week 13. We will
finish up the discussion of variance inflation factors and extend the diagnostic
to several predictors. We will apply this to polynomial models (especially
the clock data). Read sections 7.7 and 7.8 in the text. Much
of this will be familiar and I will not repeat all this material.
Week 14. End of
week 13: begin selection of variables. Read sections 8.1 and
8.2; they provide a good overview of the problem of selection of variables
and also provide an example. Then read section 8.3, especially pages
336-345. Then read pages 346 and 347 on Best Subsets. Be sure
you understand R-squared and adj R-squared along with Cp. Due
Wednesday, April 21: Look at exercises 8.9 and 8.11. In
these exercises use Minitab's best subsets regression to try to identify
a good model. You do not have to do all the other parts of the exercises.
You might put in some of the quadratic terms if the LOF test suggests that
you have curvature. Then the best subsets regression is applied
to the full set. Don't forget to center a variable if you put in
quadratic terms in that variable.
Week 15. Note:
For the exercise due Monday, April 27, in the second set of 10 rows, the
first 60 (Humidity) should be 40 in three places. We will continue
the discussion of weighted least squares. In preparation for robust
regression, read about Studentized residuals and studentized deleted residuals,
pp372-375. Then we will need DFFITS discussed on pp379-380.
Finally we will be ready for Section 10.3 pp416-424.
Exercise for Friday, May 1: I will discuss this data and
answer questions on Friday but you do not have to hand it in. Data
from a study of shelf life of packaged food: y = moisture content
of cereal and x = days on the shelf. The idea is that the soggier
the cereal the more unappetizing it is.
Here is the data: row, y, x (cut and paste this into a Minitab
worksheet)
1 2.8 0
2 3.0 3
3 3.1 6
4 3.2 8
5 3.4 10
6 3.4 13
7 3.5 16
8 3.1 20
9 3.8 24
10 4.0 27
11 4.1 30
12 4.3 34
13 4.4 37
14 5.9 41
Plot y vs x. Get the regular regression equation. Find leverage,
deleted t residuals, and DFFITS. Construct the weights and get the
robust regression. Compare the regular regression equation to the
robust regression equation. Plot t resids vs fits.
Back
to top of page.