|
THE CENTER FOR MULTIVARIATE ANALYSIS
The Center for Multivariate Analysis is an interdisciplinary
research unit within Penn State’s Department of Statistics. Established
in 1982 at the initiative of the Air Force Office of Scientific
Research, it is the first research center in the world with a primary
focus on multivariate analysis. C.R. Rao, Emeritus Professor of
statistics and Holder of the Eberly Family Chair, is the director
of the center.
One function of the Center for Multivariate
Analysis is to create opportunities for scholars from all over the
world to visit and conduct research. Each year, statistical researchers
from countries such as China, India, Japan, from Europe and the
United Kingdom visit the center for periods ranging from two weeks
to three months. The researchers collaborate with the center’s staff
on projects of mutual interests. Most of the basic research conducted
in the center is in response to new problems – those that cannot
be solved with existing methodologies.
Another mission of the center is to offer research
opportunities to graduate students. Although the center sponsors
postgraduate scientists primarily, it welcomes graduate students
to participate in research and gain work experience.
Abstracts of talks by Leo Breiman
STATISTICAL MODELING – THE TWO CULTURES
Suppose that the data consists of dependent variables
y and a number of predictor variables x. The dominant
approach to modeling in statistics is data modeling – the assumption
is made that the data are generated from a known parametric family
containing a stochastic element. That is, that the y’s are
generated as a specified function of the x’s, parameters
and noise variables. A small minority in statistics and many in
fields outside of statistics use algorithmic modeling (more loosely
called data mining). This approach makes no assumptions about how
the data are generated. Instead algorithms (f(x)) are constructed
so as the predict the y’s as accurately as possible from
the x’s. In my talk, I will discuss the advantages and disadvantages
of these two approaches.
RANDOM FORESTS – A MULTIPURPOSE TOOL
Random Forests is a prediction method developed
in Machine Learning that has state of the art accuracy and can cope
with thousands of input variables. Although the structure of the
predictor is complex, consisting of many decision trees amalgamated
together, a wealth of information can be derived from a single run.
For example, variable importance measures, intrinsic proximity distances
between cases to use in clustering, detection of outliers, and density
estimation. The talk discusses these and gives applications to a
number of data sets.
Key References:
Bagging Predictors. Machine Learning 26,
123-140 (1996).
Arcing classifiers (with discussion). Ann. Statist.
26, 801-849 (1998).
The BD-method for estimating multivariate functions
from noisy data. Technometrics 33, 125-160 (1991).
Random Forests – Random
Features, Tech. Report 567 (1999), University of California,
Berkeley. Also available by FTP (open www.stat.berkeley.edu)
Abstracts of talks by J. Sunil Rao
THE OUT-OF-BOOTSTRAP APPROACH FOR
MODEL AVERAGING AND SELECTION
We propose a bootstrap-based method for model
averaging and selection that focuses on training points that are
left out of individual bootstrap samples. This information can be
used to estimate optimal weighting factors for combining estimates
from different bootstrap samples, and also for finding the best
subsets in the linear model setting. These proposals provide alternatives
to Bayesian approaches to model averaging and selection, requiring
less computation and fewer subjective choices. The methodology is
illustrated in examples ranging from typical machine learning scenarios
to image processing for breast cancer detection. This is joint work
with Rob Tibshirani.
THE GIC FOR MODEL SELECTION:
A HYPOTHESIS TESTING APPROACH
We consider the model
(subset) selection problem for linear regression. Although hypothesis
testing and model selection are two different approaches, there
are similarities between them. In this article we combine these
two approaches together and propose a particular choice of the penalty
parameter in the generalized information criterion (GIC), which
leads to a model selection procedure that inherits good properties
from both approaches, i.e., its overfitting and underfitting probabilities
converge to 0 as the sample size n → ∞
and, when n is fixed, its overfitting probability is controlled
to be approximately under a pre-assigned level of significance.
This is joint work with Jun Shao.
Key references:
The GIC for model selection: A hypothesis testing
approach. J. Statist. Planning and Research 88, 251-281 (2000),
with J. Shao.
The out-of-bootstrap approach for model averaging.
(Tech. Rept.) Machine Learning (in press, 2001), with R.
Tibshirani.
Locally bagged classification and regression trees.
(Tech. Rept.) Biometrics (in press, 2001).
The out-of-bootstrap approach for model selection.
(Tech. Rept.) J. Computational and Graphical Statistics (in
press, 2001), with R. Tibshirani.
P.R. Krishnaiah C.G. Khatri
(7/15/32-8/1/87) (8/8/31-3/31/89)
P.R. Krishnaiah was the founder-director of the
Center for Multivariate Analysis. He received numerous honors and
international recognition for his outstanding contributions to the
theory and applications of Statistics.
C.G. Khatri was a frequent visitor to the Center
for Multivariate Analysis. He has authored or co-authored several
books and about two hundred research publications in prestigious
journals.
In order to perpetuate the memory of P.R. Krishnaiah
and C.G. Khatri, a Visiting Scholars Program has been started at
Penn State. Under this program, outstanding scholars are invited
to visit the CMA to give lectures and/or participate in research
work.
Donors are kindly requested to send their contributions
by check drawn in favor of "Penn State Krishnaiah Memorial Fund"
and/or "Penn State Khatri Memorial Fund" to Ms. Elaine
Robinson, Development Assistant, The Pennsylvania State University,
430 Thomas Building, University Park, PA 16802-2111. The Penn State
University will be the custodian of these funds and the donations
may be tax deductible.
This publication is available in alternative media
on request.
Penn State encourages persons with disabilities
to participate in its programs and activities. If you anticipate
needing any type of accommodation or have questions about the physical
access provided, please call 865-1348 in advance of your participation
or visit. Penn State is an affirmative action, equal opportunity
university.
U.Ed.SCI 01-151
Registration Form
Pattern Recognition and Prediction
Machine Learning and Data Mining
July 22, 2001
Name:
Address:
E-mail:
Telephone:
*Registration Fee: $15.00 for participants from
PSU
$25.00 for others
Lunch with
speakers (optional): $7.00
Checks should be drawn in favor of Penn State University,
Krishnaiah Memorial Fund.
Please send the registration form with check for
registration before July 15, 2001 to
Bonnie G. Cain
Statistics Department
326 Joab Thomas Building
The Pennsylvania State University
University Park, PA 16802-2111
Registration is needed for attending the seminar.
Fee for the late registration $30.00.
*Faculty & Graduate Students of the Statistics
Department, PSU have a different registration form obtainable from
Bonnie G. Cain.
|