Welcome to the web-site for
Bioinformatics II, Spring 2008
Lectures: Tue-Thu 11.15am-12.30pm,
005
Instructor: Francesca Chiaromonte.
Coordinates: office 505 Wartik, ph 5-7075, email chiaro at stat.psu.edu
Office hours: 2.30-3.30 Wed afternoons in Wartik, or by appointment.
General description | Questions | Groups
This site will be constantly updated, and work as our main communication link outside of class: visit often!
ANNOUNCEMENTS
Class location has changed:
starting on Thu Jan 17 we are meeting in 005
Class will be cancelled on Tue Jan 22. Finish reading the notes in the "Review 1" file below on your own.
Group assignment file has been updated on Jan 27 (check on your group!)
On Thu Feb 7 we will visit the Microarray
Facility of the
Lecture of interest (highly recommended for Bioinformatics!): Tue Feb 5. Speaker: James Taylor.
Time and place: 10.00am, 333
Information Sciences &
Title: Making Sense of Genome-scale Data.
Planning for final projects: on Wed Mar 26 the regular office hour (in Wartik 505) will be extended to start planning for final projects. Tentative schedule: 1.30pm Group 1; 1.50pm Group 2; 2.10pm Group 3; 2.20pm Group 4; 2.40pm Group 5.
Presentations for final projects have been scheduled for Mon May 5, 1.00-3.00pm. Location: 327 Thomas. PLEASE EMAIL YOUR PROJECT TITLE to Francesca ASAP.
LECTURES and ASSIGNMENTS
A. Introduction to R and a General Statistics Refresher
The R system is a freely available “dialect” of S-plus. It allows users to easily perform most statistical computations, and create graphics. A large community of users has made R their system of choice, and the cumulative collection of R functions that are made available on the web by individual users or groups grows by the day.
Review 1. Getting familiar with R and review of basic descriptive statistics concepts.
Assignment: install R on a machine that you will have access to throughout the course. A useful starting point to install R is the Comprehensive R Archive Network: CRAN website. Once you have installed R, repeat the simple calculations we performed in class to start getting familiar with the system. Here are text files of the "toy" data sets: small_toy.txt, chicken_toy.txt.
Review 2. Review of inference concepts, standard errors, confidence intervals, the Bootstrap (more R).
Reading assignment: read the article Bootstrap methods for standard errors, confidence intervals and other measures of statistical accuracy, B. Efron and R. Tibshirani, Statistical Science 1(1) 54-75 (available through the JSTOR website). You do not need to understand all the details, but this gives you a good general introduction to the logic of the bootstrap, and introduces more sophisticated bootstrap-based CI's than the percentile CI's discussed in class.
Data analysis assignment: Using R and the data in chicken_toy.txt, compute bootstrap-based CI's (percentile CI's as in class, or more sophisticated if you want) for: (1) the 0.25 quantile of the log length ratio, and (2) the correlation between log length ratio and log insertion ratio. Finally, think about the following question: would bootstrap methods be effective for inference about extreme values? (e.g. the minimum of the log length ratio).
Review 3. More inference concepts, testing, random permutations.
Data analysis assignment: Using R and the data in chicken_toy.txt, compute bootstrap-based and permutation-based empirical p-values for testing: (1) equality of the median log length ratio between the two groups of windows defined by 1=micro+medium, 2=macro chicken chromosomes, and (2) that the correlation between log length ratio and log insertion ratio is 0. For alternatives in the two tests take, respectively: (1) the median for micro+medium is larger than that for macro, and (2) the correlation is positive. If you want to read more about bootstrap and permutation testing, a good reference is An Introduction to the Bootstrap (reference books below). This assignment will require searching for R functions on the internet and/or creating your own functions. Work in groups and prepare a brief write-up of your results. HAND IN: Tue Feb 12, in class. Solution.
B. Microarray (MA) Data, Generalities and Pre-processing
Introduction. MA data and a statistical roadmap.
Assignment: In future lectures, we will describe a few methods and algorithms implemented in R and distributed through Bioconductor, an open source and open development software project for the analysis of genomic data. Functions available through Bioconductor may be very useful for your research. Go to the Bioconductor site, download the relevant material, and spend some time browsing documentation and available options.
Preprocessing MA data. Normalization -- notions on non-parametric regression and robust methods.
Reading assignment: The following are three reference papers for the material presented in class, and much more: Yang Y.H., Dudoit S., Luu P., Speed T.P. (2001) Normalization for cDNA microarray data, SPIE BiOS 2001, San Jose CA; Yang Y.H., Dudoit S., Luu P., Lin D.M., Peng V., Ngai J., Speed T.P. (2002), Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Research 30(4); Bolstad B.M., Irizarry R. A., Astrand M., Speed T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance, Bioinformatics 19(2): 185-193 (see package "affy" available through Bioconductor). Read the papers, and explore the relevant Bioconductor materials. If you work with affymetrix data, you may also want to explore this RMAExpress site.
Remarks:
More on preprocessing MA data. Missing value imputation, other preliminary transformations, filtering.
Reading assignment: a paper that gives a good instance of how people think about missing value imputation for microarray data is Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001), Missing value estimation methods for DNA microarrays, Bioinformatics 17(6):520-525. Consider the procedure proposed to evaluate various imputation methods in this paper, and refer to the concepts of missing completely at random, at random, not at random (as introduced in the lecture). Do you see any problems with the proposed evaluation procedure? Another interesting paper is R. Jornsten et al. (2005), DNA microarray imputation and significance analysis of differential expression, Bioinformatics 21(22):4155-4161. This proposes a technique to combine imputation methods, adapting the weights given to local vs global imputation. The paper also contains an analysis of the impact of various imputation approaches (including SVD and KNN) on the detection of differentially expressed genes.
C. MA data, Differential Expression for Thousands of Genes
Differential expression.
Identifying differentially expressed genes -- notions on multiple testing and
p-value adjustments.
Reading assignment: the reference paper for the material presented in class, and more, is: Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002), Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica 12(1):111-139. More detail on multiple testing and p-value adjustments in MA data analysis can be found in: Dudoit S., Shaffer J.P., Boldrick J.C. (2003). Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1): 71-103. Another interesting paper on the identification of differentially expressed genes is Efron B., Tibshirani, R., Storey J.D., and Tusher V. (2001), Empirical Bayes analysis of a microarray experiment, Journal of the American Statistical Association 96:1151-1160.
D. Multivariate Statistics for MA Data
Basic patterns in data and (unsupervised) dimension reduction. Principal Components Analysis.
More details on principal components can be found in Multivariate Analysis text books (see list below).
Reading assignment: two papers introducing these techniques to the analysis of MA data are: N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar, and N.V. Fedoroff (2000), Fundamental patterns underlying gene expression profiles: Simplicity from complexity, PNAS 97: 8409-8414; O. Alter, P.O. Brown, and D. Botstein (2000), Singular value decomposition for genome-wide expression data processing and modeling, PNAS 97: 10101-10106.
Data analysis assignment: Instructions. Data set yeast_cycle.txt. Work in groups and prepare a brief write-up of your results. HAND IN: Thur Apr 3.
Characteristic patterns in data and (unsupervised) classification: Cluster Analysis.
Clustering and dimension reduction: An example.
Evaluating cluster solutions: How many clusters?
More details on cluster analysis can be found in Multivariate Analysis text books (see list below).
Data analysis assignment: Instructions. Data set yeast_shock.txt. (yeast_shock.xls also contains short descriptions for the genes). Work in groups and prepare a brief write-up of your results. HAND IN: Tue Apr 29.
More on gene clustering: Jacknifing, seeded clustering and replicates.
E. Supervised Analyses for MA and other Genomics Data
Working with a response, and supervised dimension reduction: Linear Discriminant Analysis or Sliced Inverse Regression applied after Principal Components Analysis in large-p-small-n problems.
Reading assignment: more details on the two-stage approach for supervised dimension reduction applied to microarray data can be found in Chiaromonte F. and Martinelli J.A. (2002) Dimension reduction strategies for analyzing global gene expression data with a response, Mathematical Biosciences 176 (1), 123-144. To learn more about how variable (e.g. gene) selection can be performed based on supervised dimension reduction, see Li L., Cook R.D. and Nachtsheim C.J. (2005) Model free variable selection, J.R. Statist. Soc. B 67(2), 285-299.
More on supervised dimension reduction in large-p-small-n problems: "Directing" the inversion of a variance-covariance matrix with an iterative approach.
Yet more on the analysis of under-sampled data: The Augmented Bootstrap approach – Guest lecture by S. Tyekucheva.
Reading assignment: The Augmented Bootstrap, and some applications to genomic data for supervised dimension reduction, network reconstruction and CART trees are described in Tyekucheva S. and Chiaromonte F. (2008) Augmenting the bootstrap to analyze high dimensional genomic data, Test 17, 1-18, and the Rejoinder, Test 17, 47-55. More details on the X-chromosome inactivation study can be found in Carrel L., Park C., Tyekucheva S., Dunn J., Chiaromonte F. and Makova K.D. (2006) Genomic Environment Predicts Expression Patterns on the Human Inactive X Chromosome, PLoS Genetics.
A data set to practice with supervised classification/dimension reduction: Data_description. Data set leukemia_886.xls.
FINAL
PROJECTS
Each group should prepare a presentation lasting approximately 25 minutes. All group members should be involved in describing the work (i.e. take turns in speaking) and be ready to answer questions. A hard copy of the presentation (or if you want an extended description of what you did) should be handed in to Francesca Chiaromonte right before your presentation. If you want a pdf file of your presentation to be posted on the class web-site, email it to Francesca the evening before your presentation date.
Presentation Schedule: Monday
May 5, 327 Thomas (times are
approximate)
·
1.00pm,
Group 1: Dinatale Brett C., Kumar Swathi
A., Chen Kuan-Bei. G1E microarray analysis: comparison of
methods.
·
1.25pm,
Group 3: Moktali Venkatesh
P., Samorodnitsky Eric, Kim Eun
Kyoung.
·
1.50pm,
Group 4: Jiao Yuannian , Paranich Gary, Zhou Xiaofan, Pussey Barbara. Arabidopsis Cold Stress Network.
·
2.15pm,
Group 5: Ma Zhaorong, Polato
Nicholas R., Chang Ti-Cheng, Gowda Aghalaya Shyama Sundar Deepika. Differential gene expression in Melitaea cinxia in relation to
population age and metabolic rate.
·
2.40pm,
Group 2: Tian Donglan, Hariharan Charanya, Zhang Zhenhai, Zhang Yao. Yeast responses to water deficit and
rehydration.
An (evolving) list of useful links (in addition to the ones given for specific lectures)
T. Speed's group | G. Churchill's group | Stanford Stat's group | W. Li's bibliographic reference list |
Info on multiple imputation methods |
An (evolving) list of Reference Books
Statistical Analysis of Microarray Data:
Statistical Analysis of Gene Expression Microarray Data. Speed (ed.). Chapman & Hall.
The Analysis of Gene Expression Data: Methods and Software. Parmigiani, Garrett, Irizarry and Zeger (eds). Springer NY.
Analyzing
Microarray Gene Expression Data. McLachlan, Do and
Ambroise. Wiley NY.
Statistics for Microarrays. Wit
and McClure. Wiley NY.
General Statistics:
Probability and Statistical
Inference (5th ed). Hogg and
R and S-plus:
Data Analysis
and Graphics Using R. Maindonald and Braun.
Introductory Statistics with R. Dalgaard. Springer-Verlag.
Programming with Data, a Guide to the S Language. Chambers. Springer-Verlag.
S programming. Venables and Ripley. Springer-Verlag.
Modern Applied Statistics with S (4th ed). Venables and Ripley. Springer-Verlag.
Computational Statistics:
An Introduction to the Bootstrap. Efron and Tibshirani. Chapman & Hall CRC.
Permutation Tests (2nd ed). Good. Springer-Verlag.
Regression Methods (and related topics):
Applied Regression Including Computing and Graphics. Cook and Weisberg. Wiley NY.
Applied Regression Analysis. Draper and Smith. Wiley NY.
Multivariate Analysis:
Methods for Statistical Data Analysis of Multivariate Observations (2nd ed). Gnanadesikan. Wiley NY.
Multivariate Observations. Seber. Wiley NY.
Clustering Algorithms. Hartigan. Wiley NY.
Self Organizing Maps (2nd ed). Kohonen. Springer-Verlag.
Finding Groups in Data: An Introduction to Cluster Analysis. Kaufman and Rousseeuw. Wiley NY.