Statistical methods for human microbiome data analysis [electronic resource].

Chen, Jun.
118 p.
Local subjects:
Biology, Biostatistics. (search)
Biology, Microbiology. (search)
Biology, Bioinformatics. (search)
Penn dissertations -- Genomics and computational biology. (search)
Genomics and computational biology -- Penn dissertations. (search)
System Details:
Mode of access: World Wide Web.
The human microbiome is the totality of the microbes, their genetic elements and the interactions they have with surrounding environments throughout the human body. Studies have implicated the human microbiome in health and disease. Two central themes of human microbiome studies are to identify potential factors influencing the microbiome composition, and to define the relationship between microbiome features and biological or clinical outcomes. With the development of next generation sequencing technologies, the human microbiome composition can be interrogated using high-throughput DNA sequencing. One strategy sequences the bacterial 16S ribosomal RNA gene for species identification. These 16S sequences are usually clustered into Operational Taxonomic Units (OTUs). Analysis of such OTU data raises several important statistical challenges, including taking into account the phylogenetic relationship among OTUs and modeling high-dimensional overdispersed count data. This dissertation presents three statistical methods developed specifically for 16S data analysis centering around the two themes. To test the association between overall microbiome composition and a covariate/an outcome, a testing procedure based on a generalized UniFrac distance was developed. The generalized UniFrac distance corrects the unduly weighting of classic UniFrac distances on either highly abundant or rare lineages, and was shown to be more powerful than the classic UniFracs. Under the framework of canonical correlation analysis (CCA), a structure-constrained sparse CCA was proposed to select the OTUs and their correlated covariates. A phylogenetic structure-constrained penalty function was imposed to induce certain smoothness on the linear coefficients according to the OTU phylogenetic relationship. Structure-constrained sparse CCA performed much better than sparse CCA in selecting relevant OTUs. Finally, a sparse Dirichlet-multinomial regression (SDMR) model was developed to link the microbiome composition to environmental covariates and to select the most important covariates and their affected OTUs. SDMR accounts for the overdispersion of OTU counts and uses a sparse group l1 penalty function to facilitate selection of covariates and OTUs simultaneously. These methods were illustrated using simulations as well as a real human gut microbiome data set from a study of dietary effects on gut microbiome composition.
Thesis (Ph.D. in Genomics and Computational Biology) -- University of Pennsylvania, 2012.
Source: Dissertation Abstracts International, Volume: 74-03(E), Section: B.
Adviser: Hongzhe Li.
Local notes:
School code: 0175.
Zhang, Nancy committee member
Wang, Li-San committee member
Bushman, Frederic committee member
Li, Mingyao committee member
Li, Hongzhe, advisor
University of Pennsylvania. Genomics and Computational Biology.
Contained In:
Dissertation Abstracts International 74-03B(E).
Access Restriction:
Restricted for use by site license.
Location Notes Your Loan Policy
Description Status Barcode Your Loan Policy