Welcome to the web-site for

 

Bioinformatics II, Spring 2008

 

Lectures: Tue-Thu 11.15am-12.30pm, 005 Ferguson.

 

Instructor: Francesca Chiaromonte.

Coordinates: office 505 Wartik, ph 5-7075, email chiaro at stat.psu.edu

Office hours: 2.30-3.30 Wed afternoons in Wartik, or by appointment.

 

General description | Questions | Groups

 

This site will be constantly updated, and work as our main communication link outside of class: visit often!

 


 

ANNOUNCEMENTS

 

Class location has changed: starting on Thu Jan 17 we are meeting in 005 Ferguson.

 

Class will be cancelled on Tue Jan 22. Finish reading the notes in the "Review 1" file below on your own.

 

Group assignment file has been updated on Jan 27 (check on your group!)

 

On Thu Feb 7 we will visit the Microarray Facility of the Pennsylvania State University during the class period (11.15am-12.30pm). This is located in the basement level of Thomas Building (you need to take the elevator to the basement). We will visit in two shifts, one staring at 11.15am, and one starting at ~11.50am. Your shift for the visit ("1" or "2") is indicated along with your working group in the "Group" file above.

 

Lecture of interest (highly recommended for Bioinformatics!): Tue Feb 5. Speaker: James Taylor.

Time and place: 10.00am, 333 Information Sciences & Technology Building

Title: Making Sense of Genome-scale Data.

 

Planning for final projects: on Wed Mar 26 the regular office hour (in Wartik 505) will be extended to start planning for final projects. Tentative schedule: 1.30pm Group 1; 1.50pm Group 2; 2.10pm Group 3; 2.20pm Group 4; 2.40pm Group 5.

 

Presentations for final projects have been scheduled for Mon May 5, 1.00-3.00pm. Location: 327 Thomas. PLEASE EMAIL YOUR PROJECT TITLE to Francesca ASAP.

 


 

LECTURES and ASSIGNMENTS

 

A. Introduction to R and a General Statistics Refresher

 

The R system is a freely available “dialect” of S-plus. It allows users to easily perform most statistical computations, and create graphics. A large community of users has made R their system of choice, and the cumulative collection of R functions that are made available on the web by individual users or groups grows by the day.

 

Review 1. Getting familiar with R and review of basic descriptive statistics concepts.

 

Assignment: install R on a machine that you will have access to throughout the course. A useful starting point to install R is the Comprehensive R Archive Network: CRAN website. Once you have installed R, repeat the simple calculations we performed in class to start getting familiar with the system. Here are text files of the "toy" data sets: small_toy.txt, chicken_toy.txt.

 

Review 2. Review of inference concepts, standard errors, confidence intervals, the Bootstrap (more R).

 

Reading assignment: read the article Bootstrap methods for standard errors, confidence intervals and other measures of statistical accuracy, B. Efron and R. Tibshirani, Statistical Science 1(1) 54-75 (available through the JSTOR website). You do not need to understand all the details, but this gives you a good general introduction to the logic of the bootstrap, and introduces more sophisticated bootstrap-based CI's than the percentile CI's discussed in class.

 

Data analysis assignment: Using R and the data in chicken_toy.txt, compute bootstrap-based CI's  (percentile CI's as in class, or more sophisticated if you want) for: (1) the 0.25 quantile of the log length ratio, and (2) the correlation between log length ratio and log insertion ratio. Finally, think about the following question: would bootstrap methods be effective for inference about extreme values? (e.g. the minimum of the log length ratio).

 

Review 3. More inference concepts, testing, random permutations.

 

Data analysis assignment: Using R and the data in chicken_toy.txt, compute bootstrap-based and permutation-based empirical p-values for testing: (1) equality of the median log length ratio between the two groups of windows defined by 1=micro+medium, 2=macro chicken chromosomes, and (2) that the correlation between log length ratio and log insertion ratio is 0. For alternatives in the two tests take, respectively: (1) the median for micro+medium is larger than that for macro, and (2) the correlation is positive. If you want to read more about bootstrap and permutation testing, a good reference is An Introduction to the Bootstrap (reference books below). This assignment will require searching for R functions on the internet and/or creating your own functions. Work in groups and prepare a brief write-up of your results. HAND IN: Tue Feb 12, in class. Solution.

 

 

B. Microarray (MA) Data, Generalities and Pre-processing

 

Introduction. MA data and a statistical roadmap.

 

Assignment: In future lectures, we will describe a few methods and algorithms implemented in R and distributed through Bioconductor, an open source and open development software project for the analysis of genomic data. Functions available through Bioconductor may be very useful for your research. Go to the Bioconductor site, download the relevant material, and spend some time browsing documentation and available options.

 

Preprocessing MA data. Normalization -- notions on non-parametric regression and robust methods.

 

Reading assignment: The following are three reference papers for the material presented in class, and much more: Yang Y.H., Dudoit S., Luu P., Speed T.P. (2001) Normalization for cDNA microarray data, SPIE BiOS 2001, San Jose CA;  Yang Y.H., Dudoit S., Luu P., Lin D.M., Peng V., Ngai J., Speed T.P. (2002), Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Research 30(4); Bolstad B.M., Irizarry R. A., Astrand M., Speed T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance, Bioinformatics 19(2): 185-193 (see package "affy" available through Bioconductor). Read the papers, and explore the relevant Bioconductor materials. If you work with affymetrix data, you may also want to explore this RMAExpress site.

 

Remarks:

  • In addition to 2-color "spotted" arrays and Affymetrix chips, other microarray platforms have become broadly used -- and many articles have been written comparing the performance of different platforms (e.g. Reynies et al. (2006), Comparison of the latest commercial short and long oligonucleotide microarray technologies, BMC Genomics, 7:51). One such platform is Agilent. Normalization for data produced by Agilent microarrays is discussed in Zahurak et al. (2007), Pre-processing Agilent microarray data, BMC Bioinformatics 8:142. The references in this article also provide very good pointers to recent literature on microarray normalization.
  • Also, as technologies improve and allow more and more “spots” or “cells” to be placed on the arrays, other applications have become possible (e.g. exon arrays, which allow to trace transcriptional abundance for splice variants, or tiling arrays, which allow to represent, and detect potential transcription, across an entire genomic sequence – we may discuss these further at the end of the course).
  • To explore issues related to the design of MA experiments, two good starting points are: Churchill (2002), Fundamentals of experimental design for cDNA microarrays, Nature Genetics 32, 490-495, and Yang and Speed (2002), Design issues for cDNA microarray experiments, Nature Reviews Genetics 3, 579-588.

 

More on preprocessing MA data. Missing value imputation, other preliminary transformations, filtering.

 

Reading assignment: a paper that gives a good instance of how people think about  missing value imputation for microarray data is Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001), Missing value estimation methods for DNA microarrays, Bioinformatics 17(6):520-525. Consider the procedure proposed to evaluate various imputation methods in this paper, and refer to the concepts of missing completely at random, at random, not at random (as introduced in the lecture). Do you see any problems with the proposed evaluation procedure? Another interesting paper is R. Jornsten et al. (2005), DNA microarray imputation and significance analysis of differential expression, Bioinformatics 21(22):4155-4161. This proposes a technique to combine imputation methods, adapting the weights given to local vs global imputation. The paper also contains an analysis of the impact of various imputation approaches (including SVD and KNN) on the detection of differentially expressed genes.

 

 

C. MA data, Differential Expression for Thousands of Genes

 

Differential expression. Identifying differentially expressed genes -- notions on multiple testing and p-value adjustments.
 

Reading assignment: the reference paper for the material presented in class, and more, is: Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002), Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experimentsStatistica Sinica 12(1):111-139. More detail on multiple testing and p-value adjustments in MA data analysis can be found in: Dudoit S., Shaffer J.P., Boldrick J.C. (2003). Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1): 71-103. Another interesting paper on the identification of differentially expressed genes is  Efron B., Tibshirani, R., Storey J.D., and Tusher V. (2001), Empirical Bayes analysis of a microarray experiment, Journal of the American Statistical Association 96:1151-1160.

 

Reading assignment: SAM (Significance Analysis of Microarrays) is a widely used methodology, which involves the concept of FDR (False Discovery Rate). Read about it in Tusher, V.G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response, PNAS 98:5116-5121, and at the SAM site. For more information on FDR, refer to Storey J.D. (2002), A direct approach to false discovery rates, JRSS-B 64(3):479-498. Working in groups, prepare a short report (2-3 pages) on SAM. HAND IN: Thur Mar 20.

 

 

D. Multivariate Statistics for MA Data

 

Basic patterns in data and (unsupervised) dimension reduction. Principal Components Analysis.

More details on principal components can be found in Multivariate Analysis text books (see list below).

 

Reading assignment: two papers introducing these techniques to the analysis of MA data are: N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar, and N.V. Fedoroff (2000), Fundamental patterns underlying gene expression profiles: Simplicity from complexity, PNAS 97: 8409-8414; O. Alter, P.O. Brown, and D. Botstein (2000), Singular value decomposition for genome-wide expression data processing and modeling, PNAS 97: 10101-10106.

 

Data analysis assignment: Instructions. Data set yeast_cycle.txt. Work in groups and prepare a brief write-up of your results. HAND IN: Thur Apr 3.

 

Characteristic patterns in data and (unsupervised) classification: Cluster Analysis.

Clustering and dimension reduction: An example.

Evaluating cluster solutions: How many clusters?

More details on cluster analysis can be found in Multivariate Analysis text books (see list below).

 

Reading assignment: Several papers have addressed the issue of selecting the number of clusters in MA data analysis. Some of the concepts presented in class can be found in: Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Proceedings of PSB 2002; Tibshirani R, Walther G, Hastie. T (2001): Estimating the Number of Clusters in a Dataset via the Gap Statistic, J. Royal Stat. Soc. B 63, 411-423 (as tech report); Dudoit S, Fridlyand J (2002) A prediction based resampling method for estimating the number of clusters in a data set. Genome Biology 3:research0036.1-0036.21.
 

Data analysis assignment: Instructions. Data set yeast_shock.txt. (yeast_shock.xls also contains short descriptions for the genes). Work in groups and prepare a brief write-up of your results. HAND IN: Tue Apr 29.

  

More on gene clustering: Jacknifing, seeded clustering and replicates.

 

Reading assignment: the following two papers describe more issues and techniques for gene clustering with MA data. They will be useful in working on the final projects. Bryan (2003) Problems in gene clustering based on gene expression data, Journal of Multivariate Analysis 90, 44-66 (see also references therein; related papers by the same author). Heyer, Kruglyak, and Yooseph (1999) Exploring Expression Data: Identification and Analysis of Coexpressed Genes, Genome Research 9(11) 1106-1115.

  

 

E. Supervised Analyses for MA and other Genomics Data

 

Working with a response, and supervised dimension reduction: Linear Discriminant Analysis or Sliced Inverse Regression applied after Principal Components Analysis in large-p-small-n problems.

 

Reading assignment: more details on the two-stage approach for supervised dimension reduction applied to microarray data can be found in Chiaromonte F. and Martinelli J.A. (2002) Dimension reduction strategies for analyzing global gene expression data with a response, Mathematical Biosciences 176 (1), 123-144. To learn more about how variable (e.g. gene) selection can be performed based on supervised dimension reduction, see Li L., Cook R.D. and Nachtsheim C.J. (2005) Model free variable selection, J.R. Statist. Soc. B 67(2), 285-299.

 

More on supervised dimension reduction in large-p-small-n problems: "Directing" the inversion of a variance-covariance matrix with an iterative approach.

 

Reading assignment: Here we applied supervised dimension reduction to data derived from multi-species genomic alignment. More details on this iterative methodology can be found in Cook R.D., Li B. and Chiaromonte F. (2007) Dimension reduction in regression without matrix inversion, Biometrika 94(3), 569-584. More details on Regulatory Potential scores (7-ways) and the algorithms used in their calculation can be found in Taylor J., Tyekucheva S., King D.C., Hardison R.C., Miller W., and Chiaromonte F. (2006) ESPERR: Learning strong and weak signals in genomic sequence alignments to identify functional elements, Genome Research 16, 1596-1604.

 

Yet more on the analysis of under-sampled data: The Augmented Bootstrap approach – Guest lecture by S. Tyekucheva.

 

Reading assignment: The Augmented Bootstrap, and some applications to genomic data for supervised dimension reduction, network reconstruction and CART trees are described in Tyekucheva S. and Chiaromonte F. (2008) Augmenting the bootstrap to analyze high dimensional genomic data, Test 17, 1-18, and the Rejoinder, Test 17, 47-55. More details on the X-chromosome inactivation study can be found in Carrel L., Park C., Tyekucheva S., Dunn J., Chiaromonte F. and Makova K.D. (2006) Genomic Environment Predicts Expression Patterns on the Human Inactive X Chromosome, PLoS Genetics.

 

A data set to practice with supervised classification/dimension reduction: Data_description. Data set leukemia_886.xls.

 


 

FINAL PROJECTS

 

Each group should prepare a presentation lasting approximately 25 minutes. All group members should be involved in describing the work (i.e. take turns in speaking) and be ready to answer questions. A hard copy of the presentation (or if you want an extended description of what you did) should be handed in to Francesca Chiaromonte right before your presentation. If you want a pdf file of your presentation to be posted on the class web-site, email it to Francesca the evening before your presentation date.

 

Presentation Schedule: Monday May 5, 327 Thomas (times are approximate)

·         1.00pm, Group 1: Dinatale Brett C., Kumar Swathi A., Chen Kuan-Bei. G1E microarray analysis: comparison of methods.

·         1.25pm, Group 3: Moktali Venkatesh P., Samorodnitsky Eric, Kim Eun Kyoung.

·         1.50pm, Group 4: Jiao Yuannian , Paranich Gary, Zhou Xiaofan, Pussey Barbara. Arabidopsis Cold Stress Network.

·         2.15pm, Group 5: Ma Zhaorong, Polato Nicholas R., Chang Ti-Cheng, Gowda Aghalaya Shyama Sundar Deepika. Differential gene expression in Melitaea cinxia in relation to population age and metabolic rate.

·         2.40pm, Group 2: Tian Donglan, Hariharan Charanya, Zhang Zhenhai, Zhang Yao. Yeast responses to water deficit and rehydration.


 

An (evolving) list of useful links (in addition to the ones given for specific lectures)

T. Speed's group | G. Churchill's group | Stanford Stat's group | W. Li's bibliographic reference list |

Info on multiple imputation methods |

 

An (evolving) list of Reference Books

Statistical Analysis of Microarray Data:

Statistical Analysis of Gene Expression Microarray Data. Speed (ed.). Chapman & Hall.

The Analysis of Gene Expression Data: Methods and Software. Parmigiani, Garrett, Irizarry and Zeger (eds). Springer NY.

Analyzing Microarray Gene Expression Data. McLachlan, Do and Ambroise. Wiley NY.
Statistics for Microarrays. Wit and McClure. Wiley NY.

General Statistics:

Probability and Statistical Inference (5th ed). Hogg and Tanis. Prentice Hall.

R and S-plus:

Data Analysis and Graphics Using R. Maindonald and Braun. Cambridge Univ. Press.

Introductory Statistics with R. Dalgaard. Springer-Verlag.

Programming with Data, a Guide to the S Language. Chambers. Springer-Verlag.

S programming. Venables and Ripley. Springer-Verlag.

Modern Applied Statistics with S (4th ed). Venables and Ripley. Springer-Verlag.

Computational Statistics:

An Introduction to the Bootstrap. Efron and Tibshirani. Chapman & Hall CRC.

Permutation Tests (2nd ed). Good. Springer-Verlag.

Regression Methods (and related topics):

Applied Regression Including Computing and Graphics. Cook and Weisberg. Wiley NY.

Applied Regression Analysis. Draper and Smith. Wiley NY.

Multivariate Analysis:

Methods for Statistical Data Analysis of Multivariate Observations (2nd ed). Gnanadesikan. Wiley NY.

Multivariate Observations. Seber. Wiley NY.

Clustering Algorithms. Hartigan. Wiley NY.

Self Organizing Maps (2nd ed). Kohonen. Springer-Verlag.

Finding Groups in Data: An Introduction to Cluster Analysis. Kaufman and Rousseeuw. Wiley NY.