Pattern Recognition and Prediction

Machine Learning and Data Mining

ONE DAY WORKSHOP

SUNDAY, JULY 22, 2001

Keynote Speaker

LEO BREIMAN

Emeritus Professor of Statistics

University of California at Berkeley

P.R. Krishnaiah Visiting Scholar: 2001

 

The Center for Multivariate Analysis

Department of Statistics


Pattern Recognition and Prediction
Machine Learning and Data Mining

One day Workshop: July 22, 2001
201 Joab Thomas Building

8:00-8:45 Registration (coffee and donuts)
8:45-9:00 Welcome by D.J. Larson
Dean, Eberly College of Science
9:00-10:00 Leo Breiman
Statistical Modeling – The Two Cultures
10:00-10:30 Discussion and coffee
10:30-11:30

J. Sunil Rao
The out-of-bootstrap approach for model averaging and selection

11:30-12:00 Discussion
12:00-1:30 Lunch
1:30-2:30 J. Sunil Rao
The GIC for model selection: A hypothesis testing approach
2:30-3:00 Discussion and coffee
3:00-4:00 Leo Breiman
Random Forests – a Multipurpose Tool
4:00-4:30 Discussion

Parachuri R. Krishnaiah Visiting Scholar
Professor Leo Breiman

Leo Breiman received his Ph.D. in mathematics in 1950 from the University of California, Berkeley. He is a member of the National Academy of Sciences and Fellow of the American Statistical Association and the Institute of Mathematical Statistics. He received Youden Prize, Technometrics, in 1992. He is the author of the textbooks, Probability and Stochastic Processes with a View Toward Applications, Statistics with a View Toward Applications, Probability, and co-author of Classification and Regression Trees.

Breiman is one of the leading statisticians in the world today. He made fundamental contributions to stochastic processes, information theory and mathematical statistics. A distinguished career as a statistical consultant, in traffic flow analysis, air pollution analysis, and computerized speech recognition led him to his development of the widely used algorithm for classification and prediction CART with Friedman, Olsen and Stone. More recently he made another original discovery, "bagging", a way of improving "supervised learning" by combining noisy predictions by resampling the original data. His current interests are in computationally intensive multivariate analysis including the use of nonlinear methods for pattern recognition and prediction in high dimensional spaces.

 


J. SUNIL RAO

Sunil Rao received his Masters of Science Degree in Biostatistics in 1991 from the University of Minnesota and Ph.D. in 1994 from the University of Toronto. He is on the Faculty of the Department of Epidemiological Biostatistics at Case Western Reserve University. He is an active research worker in both theory and applications of statistics.

His current interests are in model selection, bootstrap, decision trees, survival analysis, analysis of gene expression data.


THE CENTER FOR MULTIVARIATE ANALYSIS

The Center for Multivariate Analysis is an interdisciplinary research unit within Penn State’s Department of Statistics. Established in 1982 at the initiative of the Air Force Office of Scientific Research, it is the first research center in the world with a primary focus on multivariate analysis. C.R. Rao, Emeritus Professor of statistics and Holder of the Eberly Family Chair, is the director of the center.

One function of the Center for Multivariate Analysis is to create opportunities for scholars from all over the world to visit and conduct research. Each year, statistical researchers from countries such as China, India, Japan, from Europe and the United Kingdom visit the center for periods ranging from two weeks to three months. The researchers collaborate with the center’s staff on projects of mutual interests. Most of the basic research conducted in the center is in response to new problems – those that cannot be solved with existing methodologies.

Another mission of the center is to offer research opportunities to graduate students. Although the center sponsors postgraduate scientists primarily, it welcomes graduate students to participate in research and gain work experience.

 

Abstracts of talks by Leo Breiman

STATISTICAL MODELING – THE TWO CULTURES

Suppose that the data consists of dependent variables y and a number of predictor variables x. The dominant approach to modeling in statistics is data modeling – the assumption is made that the data are generated from a known parametric family containing a stochastic element. That is, that the y’s are generated as a specified function of the x’s, parameters and noise variables. A small minority in statistics and many in fields outside of statistics use algorithmic modeling (more loosely called data mining). This approach makes no assumptions about how the data are generated. Instead algorithms (f(x)) are constructed so as the predict the y’s as accurately as possible from the x’s. In my talk, I will discuss the advantages and disadvantages of these two approaches.

RANDOM FORESTS – A MULTIPURPOSE TOOL

Random Forests is a prediction method developed in Machine Learning that has state of the art accuracy and can cope with thousands of input variables. Although the structure of the predictor is complex, consisting of many decision trees amalgamated together, a wealth of information can be derived from a single run. For example, variable importance measures, intrinsic proximity distances between cases to use in clustering, detection of outliers, and density estimation. The talk discusses these and gives applications to a number of data sets.

Key References:

Bagging Predictors. Machine Learning 26, 123-140 (1996).

Arcing classifiers (with discussion). Ann. Statist. 26, 801-849 (1998).

The BD-method for estimating multivariate functions from noisy data. Technometrics 33, 125-160 (1991).

Random Forests – Random Features, Tech. Report 567 (1999), University of California, Berkeley. Also available by FTP (open www.stat.berkeley.edu)

Abstracts of talks by J. Sunil Rao

THE OUT-OF-BOOTSTRAP APPROACH FOR

MODEL AVERAGING AND SELECTION

We propose a bootstrap-based method for model averaging and selection that focuses on training points that are left out of individual bootstrap samples. This information can be used to estimate optimal weighting factors for combining estimates from different bootstrap samples, and also for finding the best subsets in the linear model setting. These proposals provide alternatives to Bayesian approaches to model averaging and selection, requiring less computation and fewer subjective choices. The methodology is illustrated in examples ranging from typical machine learning scenarios to image processing for breast cancer detection. This is joint work with Rob Tibshirani.

THE GIC FOR MODEL SELECTION:

A HYPOTHESIS TESTING APPROACH

We consider the model (subset) selection problem for linear regression. Although hypothesis testing and model selection are two different approaches, there are similarities between them. In this article we combine these two approaches together and propose a particular choice of the penalty parameter in the generalized information criterion (GIC), which leads to a model selection procedure that inherits good properties from both approaches, i.e., its overfitting and underfitting probabilities converge to 0 as the sample size n → and, when n is fixed, its overfitting probability is controlled to be approximately under a pre-assigned level of significance. This is joint work with Jun Shao.

Key references:

The GIC for model selection: A hypothesis testing approach. J. Statist. Planning and Research 88, 251-281 (2000), with J. Shao.

The out-of-bootstrap approach for model averaging. (Tech. Rept.) Machine Learning (in press, 2001), with R. Tibshirani.

Locally bagged classification and regression trees. (Tech. Rept.) Biometrics (in press, 2001).

The out-of-bootstrap approach for model selection. (Tech. Rept.) J. Computational and Graphical Statistics (in press, 2001), with R. Tibshirani.

 

P.R. Krishnaiah C.G. Khatri

(7/15/32-8/1/87) (8/8/31-3/31/89)

P.R. Krishnaiah was the founder-director of the Center for Multivariate Analysis. He received numerous honors and international recognition for his outstanding contributions to the theory and applications of Statistics.

C.G. Khatri was a frequent visitor to the Center for Multivariate Analysis. He has authored or co-authored several books and about two hundred research publications in prestigious journals.

In order to perpetuate the memory of P.R. Krishnaiah and C.G. Khatri, a Visiting Scholars Program has been started at Penn State. Under this program, outstanding scholars are invited to visit the CMA to give lectures and/or participate in research work.

Donors are kindly requested to send their contributions by check drawn in favor of "Penn State Krishnaiah Memorial Fund" and/or "Penn State Khatri Memorial Fund" to Ms. Elaine Robinson, Development Assistant, The Pennsylvania State University, 430 Thomas Building, University Park, PA 16802-2111. The Penn State University will be the custodian of these funds and the donations may be tax deductible.

This publication is available in alternative media on request.

Penn State encourages persons with disabilities to participate in its programs and activities. If you anticipate needing any type of accommodation or have questions about the physical access provided, please call 865-1348 in advance of your participation or visit. Penn State is an affirmative action, equal opportunity university.

U.Ed.SCI 01-151

Registration Form

Pattern Recognition and Prediction

Machine Learning and Data Mining

July 22, 2001

Name:

Address:

E-mail:

Telephone:

*Registration Fee: $15.00 for participants from PSU

$25.00 for others

Lunch with

speakers (optional): $7.00

Checks should be drawn in favor of Penn State University, Krishnaiah Memorial Fund.

Please send the registration form with check for registration before July 15, 2001 to

Bonnie G. Cain

Statistics Department

326 Joab Thomas Building

The Pennsylvania State University

University Park, PA 16802-2111

Registration is needed for attending the seminar.

Fee for the late registration $30.00.

*Faculty & Graduate Students of the Statistics Department, PSU have a different registration form obtainable from Bonnie G. Cain.