Fall 2017
Fall 2017 colloquia will take place Mondays, 3pm4pm,in ENR2 S395.
Statistics GIDP Colloquium: Monday, September 11, 2017
Speaker: Yue (Selena) Niu, University of Arizona
Title: Reduced Ranked Linear Discriminant Analysis
Abstract: Many high dimensional classification techniques have been
developed recently. However, most works focus on only the binary
classification problem. Available classification tools for the multiclass
cases are either based on oversimplified covariance structure or
computationally complicated. In this talk, following the idea of reduced
ranked linear discriminant analysis, we introduce a new dimension
reduction tool with the flavor of supervised principal component analysis.
The proposed method is computationally efficient and can incorporate the
correlation structure among the features. We illustrate our methods by
simulated and real data examples.
Statistics GIDP Colloquium: Monday, October 2, 2017
Speaker: Yehua Li, Iowa State University
Title: Nested Hierarchical Functional Data Modeling and Inference for the Analysis of Functional Plant Phenotypes
Abstract: In a plant science Root Image Study, the process of seedling roots bending in response to gravity is recorded using digital cameras, and the bending rates are modeled as functional plant phenotype data. The functional phenotypes are collected from seeds representing a large variety of genotypes and have a threelevel nested hierarchical structure, with seeds nested in groups nested in genotypes. The seeds are imaged on different days of the lunar cycle, and an important scientific question is whether there are lunar effects on root bending. We allow the mean function of the bending rate to depend on the lunar day and model the phenotypic variation between genotypes, groups of seeds imaged together, and individual seeds by hierarchical functional random effects. We estimate the covariance functions of the functional random effects by a fast penalized tensor product spline approach, perform multilevel functional principal component analysis (FPCA) using the best linear unbiased predictor of the principal component scores, and improve the efficiency of mean estimation by iterative decorrelation. We choose the number of principal components using a conditional Akaike Information Criterion and test the lunar day effect using generalized likelihood ratio test statistics based on the marginal and conditional likelihoods. We also propose a permutation procedure to evaluate the null distribution of the test statistics. Our simulation studies show that our model selection criterion selects the correct number of principal components with remarkably high frequency, and the likelihoodbased tests based on FPCA have higher power than a test based on working independence.
Statistics GIDP Colloquium: Monday, November 6, 2017
Speaker: Wenxuan Zhong, Georgia State University
Title: MetaGen: ReferenceFree Learning with Multiple Metagenomic Samples
Abstract: A major goal of metagenomics is to identify and study the entire collection of microbial species in a set of targeted samples. In this talk, I will present a novel statistical metagenomic algorithm that simultaneously identifies microbial species and estimates their abundances without using reference genomes. Compared to referencefree methods based primarily on kmer distributions or coverage information, the proposed approach achieves a higher species binning accuracy and is particularly powerful when sequencing coverage is low. I will demonstrate the performance of this new method through both simulation and real metagenomic studies. The MetaGen software is available at https://github.com/BioAlgs/MetaGen.
Statistics GIDP Colloquium: Monday, December 4, 2017
Speaker: Edward Bedrick, University of Arizona
Spring 2017
Statistics GIDP Colloquium: Monday, May 1, 2017.
Speaker: Ming Hu, Lerner Research Institute, Cleveland Clinic; http://www.lerner.ccf.org/qhs/hum/
Title: Statistical Methods, Computational Tools and Visualization of HiC Data
Abstract: Harnessing the power of highthroughput chromatin conformation capture (3C) based technologies, we have recently generated a compendium of datasets to characterize chromatin organization across human cell lines and primary tissues. Knowledge revealed from these data facilitates deeper understanding of long range chromatin interactions (i.e., peaks) and their functional implications on transcription regulation and genetic mechanisms underlying complex human diseases and traits. However, various layers of uncertainties and complex dependency structure complicate the analysis and interpretation of these data. We have proposed hidden Markov random field (HMRF) based statistical methods, which properly address the complicated dependency issue in HiC data, and further leverage such dependency by borrowing information from neighboring pairs of loci, for more powerful and more reproducible peak detection. Through extensive simulations and real data analysis, we demonstrate the power of our methods over existing peak callers. We have applied our methods to the compendium of HiC from 21 human cell lines and tissues, and further develop an online visualization tool to facilitate identification of potential target gene(s) for the vast majority of noncoding variants identified from the recent waves of genomewide association studies.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, April 7, 2017.
Speaker: Sunder Sethuraman, Department of Mathematics, University of Arizona; http://math.arizona.edu/~sethuram/
Title: Consistency of modularity clustering and Kelvin's tiling problem
Abstract: Given a graph, the popular `modularity' clustering method specifies a partition of the vertex set as the solution of a certain optimization problem. In this talk, we will discuss consistency properties, or scaling limits, of this method with respect to random geometric graphs constructed from n i.i.d. points, V_n = \{X_1, X_2, . . . ,X_n\}, distributed according to a probability measure supported on a bounded domain in R^d. A main result is the following: Suppose the number of clusters, or partitioning sets of V_n, is bounded above, then we show that the discrete optimal modularity clusterings converge in a specific sense to a continuum partition of the underlying domain, characterized as the solution of a `soap bubble', or `Kelvin'type shape optimization problem.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, March 3, 2017.
Speaker: MingHung (Jason) Kao, Associate Professor, School of Mathematical and Statistical Sciences, Arizona State University; https://math.la.asu.edu/~mhkao/
Title: Experimental Designs for Functional Brain Imaging with fMRI
Abstract. Functional magnetic resonance imaging (fMRI) experiments are widely conducted in many fields for studying functions of the brain. One of the important first steps of such experiments is to select a good experimental design to allow for a valid and precise statistical inference. However, the identification and construction of highquality fMRI designs can be quite challenging. In this talk, we introduce some methods for constructing good fMRI designs, and discuss the statistical optimality of these designs.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, February 3, 2017.
Speaker: Ge Yong, Management Information Systems, University of Arizona; https://mis.eller.arizona.edu/people/yongge
Title: PointofInterest Recommendations in Locationbased Social Networks
Abstract. With the rapid development of Locationbased Social Network (LBSN) services, a large number of PointOfInterests (POIs) have been available, which consequently raises a great demand of building personalized POI recommender systems. A personalized POI recommender system can significantly assist users to find their preferred POIs and help POI owners to attract more customers. However, it is very challenging to develop a personalized POI recommender system because a user's checkin decision making process is very complex and could be influenced by many factors such as social network, geographical position, and the dynamics of user preferences. In the first part of this talk, we propose to divide the whole recommendation space into two parts: social friend space and user interest space, and we develop models for each space for generating recommendations. In the second part of this talk, we introduce a new ranked based method for implicit feedbackbased recommendation. To evaluate the proposed methods, we conduct extensive experiments with many stateoftheart baseline methods and evaluation metrics on the realworld data sets.
Bio. Dr. Yong Ge is an assistant professor at MIS Dept. of UoA. He received his Ph.D. in Information Technology from Rutgers, The State University of New Jersey in 2013, the M.S. degree in Signal and Information Processing from the University of Science and Technology of China (USTC) in 2008, and the B.E. degree in Information Engineering from Xi'an Jiao Tong University in 2005. He received the ICDM2011 Best Research Paper Award, Excellence in Academic Research (one per school) at Rutgers Business School in 2013, and the Dissertation Fellowship at Rutgers University in 2012. He has published prolifically in refereed journals and conference proceedings, such as IEEE TKDE, ACM TOIS, ACM TKDD, ACM TIST, ACM SIGKDD, and IEEE ICDM. His work have been supported by UoA, NSF and NIH.
3:00 pm  4:00 pm, Mathematics Building, room 501.
Fall 2016
Statistics GIDP Colloquium: Wednesday, September 7, 2016.
 Speaker: Matti Morzfeld, Department of Mathematics, University of Arizona; http://math.arizona.edu/~mmo/Home.html
 Title: U2 can UQ  Projects and Life in Uncertainty
 Abstract:
I will give an overview about mathematical and computational problems I face when I combine numerical models and data. I will first review basic tools such as Bayes' rule and importance sampling, then explain what difficulties arise when using these tools, and then present two specific applications.The first application uses lowdimensional models to describe and predict reversals of the geomagnetic dipole, the second uses adaptive importance sampling to solve a parameter estimation problem in combustion modeling, leveraging parallelism of DOE's super computers.
 3:00 pm  4:00 pm, Mathematics Building, room 402.
 Statistics GIDP Colloquium: Wednesday, October 5, 2016.
 Speaker: Han Xiao, Dept of Statistics and Biostatistics, Rutgers University; http://stat.rutgers.edu/home/hxiao/
 Title: On the maximum cross correlations under high dimension
 Abstract: Multiple time series often exhibits cross leadlag relationship among its component series. It is very challenging to identify this type of relationship when the number of series is large. We study the leadlag relationship in the high dimensional context, using the maximum cross correlations and some other variants. Asymptotic distributions are obtained. We also use moving blocks bootstrap to improve the finite sample performance.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
 This talk will be preceded by a graduate student lunch  contact Kristina Souders (ksouders@email.arizona.edu) for information.
 Statistics GIDP Colloquium: Wednesday, November 2, 2016.
 Speaker: Haiquan Li, Assistant Professor, Director for Translational Bioinformatics, Department of Medicine, University of Arizona; http://u.arizona.edu/~haiquan/
 Title: Scattered diseaselinked variants and convergent functions: discovery from big data integration
 Abstract: Genomewide association studies (GWAS) has identified thousands of diseaselinked single nucleotide polymorphisms (SNP) in the human genome. Most of them have a small effect size (OR<1.4) and locate independently across multiple chromosomes. It remains unclear how they collectively cause the diseases due to the issue of missing heritability. Classic tests of genetic interactions suffer from insufficient power. Here, we will present an integrative approach that leverages several omics datasets to obtain additional information beyond genotypes and thus reducing the number of hypotheses. We combine traditional semantic similarity for genes’ functions and very deep network permutations (100K times) to quantify the empirical significance of downstream function similarity of any pair of SNPs. This approach enabled us to discover a fundamental biological mechanism for complex diseases: SNPs associated with the same disease are more likely to associate with the same downstream genes or functionally similar genes than unrelated diseases (OR>12). We also found 4050% of prioritized SNPpairs have significant genetic interactions from three independent GWAS datasets. These results provide new biological interpretation to genetic interactions and a “roadmap” of disease mechanisms emerging from GWAS SNPs, especially those out of coding regions.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
 Statistics GIDP Colloquium: Wednesday, December 7, 2016.
 Speaker: Timothy Hanson, Professor, Department of Statistics, University of South Carolina; http://people.stat.sc.edu/hansont/
 Title: A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data
 Abstract: A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonlyused semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, currentstatus, left and right censored, and mixtures of these. Left truncated data are also accommodated leading to models for timedependent covariates. Both georeferenced (location observed exactly) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled. Variable selection is also incorporated. Model fit is assessed with conditional CoxSnell residuals, and model choice carried out via LPML and DIC. Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via new functions which call efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonlyused proportional hazards model.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
Spring 2016
 Statistics GIDP Colloquium: Wednesday, February 3, 2016.
 Speaker: Walt Piegorsch, PhD, University of Arizona, GIDP; http://math.arizona.edu/~piegorsch/
 Title: Model uncertainty in environmental risk assessment
 Abstract: Estimation of lowdose ‘benchmark’ points in environmental risk analysis is discussed. Focus is on the increasing recognition that model uncertainty and misspecification can drastically affect point estimators and confidence limits built from limited doseresponse data, which in turn can lead to imprecise risk assessments with uncertain, even dangerous, policy implications. Some possible remedies are mentioned, including use of parametric (frequentist) model averaging over a suite of potential doseresponse models, and nonparametric doseresponse analysis via isotonic regression. An example on formaldehyde toxicity illustrates the calculations.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Wednesday, March 2, 2016.
 Speaker: Zhaoxia Yu, PhD, University of California Irvine, Dept of Statistics; http://www.ics.uci.edu/~zhaoxia/
 Title: Strategies on Identifying GeneGene Interactions
 Abstract: Characterizing genegene interactions is of fundamental importance in unraveling the etiology of complex human diseases. However, due to the ultra highdimensional nature of the problem, the degree to which genes jointly affect disease risk is largely unknown. Two major obstacles toward this goal are the enormous computational effort and heavy burden of multiple testing in testing genegene interactions. In this talk I will discuss several strategies using three examples. In this first example we derived closeform and consistent estimates of interaction parameters for casecontrol data. The derived Wald tests gave very similar results with the gold standard but were ten times faster. In a study of multiple sclerosis, we identified interactions within the major histocompatibility complex region. In the second example, we used information that is independent of interaction testing to prioritize genegene pairs for caseparents design. The application of this strategy provided suggestive evidence for interactions between two genomic regions: the major histocompatibility complex region on chr 6 and the killercell immunoglobulinlike receptor region on chr19. In the last example, we borrowed information across distinct but similar diseases. We found that genes interacting in multiple sclerosis also interacted with each other in type 1 diabetes.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Wednesday, April 6, 2016.
 Speaker: Jie Chen, PhD, Georgia Regents University, Dept of Biostatistics & Epidemiology; http://biostat.gru.edu/Faculty&Staff/JChen
 Title: Change point models in the Bayesian Perspective and their applications in CNV study
 Abstract: Biomedical researchers now use advanced technologies, such as the comparative genomic hybridization (CGH), the arraybased comparative genomic hybridization (aCGH), and the high throughput next generation sequencing (NGS), to conduct DNA copy number experiments for detecting DNA copy number variations (CNVs) as cancer development, genetic disorders, and many other diseases are usually relevant to CNVs on the genome. Identifying boundaries of CNV regions on a chromosome or a genome can be viewed as a change point problem of detecting signal/intensity changes presented in the genomic data. The analysis of high throughput genomic data for possible changes has become one of the most recent viable applications of statistical change point analysis. In this talk, I present several change point models suitable to formulate different data types resulting from the aCGH and the NGS technologies and provide Bayesian solutions to these models. Applications of these methods to tumor cell line data will also be given.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Monday, May 2, 2016.
 Speaker: Bikas Sinha, PhD, Retired Faculty, Indian Statistical Institute, Kolkata, India
 Title: Mixture Experiments: Theory and Applications
 Abstract: This is a review talk dealing briefly with mixture models, standard mixture designs and optimal mixture experiments. Some application areas will be highlighted.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
Fall 2015
 Statistics GIDP Colloquium: Wednesday, September 2, 2015, 2014. Speaker: Neng Fan, Assistant Professor, Systems and Industrial Engineering Department, University of Arizona. Title: Learning from Data with Uncertainties via DataDriven Optimization
 12:00 pm  1:00 pm, Saguaro Hall 114.
 Abstract: In the last several decades, many advanced technologies have been developed to collect and store data continuously, and data and decisions are more strongly linked together than ever before. In most cases, the data includes a lot of uncertainties, such as missing or incomplete information, measurement errors, noise, etc. Traditional machine learning methods for decisions are dealing with the exact information of data. Only to some extent, the data uncertainty, modeled by some support sets, mean or moment values, has been considered for robust decisions. In this talk, we discuss statistical models for data uncertainties and datadrive optimization approaches for decisionmaking under uncertainty, especially in the case of big data. Some robust and chanceconstrained optimization models and algorithms for support vector machines will be introduced, and numerical experiments will be performed to validate the proposed approaches.
 Statistics GIDP Colloquium: Wednesday, September 30, 2015. Speaker: Clayton Morrison, Associate Professor, School of Information, University of Arizona.
 Title: Finding Structure in Time: Inferring Structured Latent Sequences and Activity Descriptions
 12:00 pm  1:00 pm, Modern Languages Building 410.

Abstract: Humans excel at understanding complex dynamic histories, recognizing relevant context and using that context to interpret events that are sometimes hierarchically and recursively structured. Our research group has found the tools of Bayesian nonparametric modeling and inference well suited for approaching several aspects of these problems. In this talk I present ongoing work on two applications that require methods for inferring structurally rich representations of time series: identifying context relevant to interpreting biochemical reactions described in cancer biology research papers, and constructing descriptions of coordinated activities from observations in video.
 Statistics GIDP Colloquium: Wednesday, November 4, 2015. Speaker: Professor Avelino Arellano, Jr., Dept of Atmospheric Science, University of Arizona.
 Title: Towards Seamless Prediction of Chemical Weather
 12:00 pm  1:00 pm, Modern Languages Building 410.
 Abstract
 Statistics GIDP Colloquium: Wednesday, December 2, 2015. Speaker: Professor Gen Li, Department of Biostatistics, Mailman School of Public Health, Columbia University.

Title: Supervised Principal Component Analysis and Extensions

Abstract: It is increasingly common to have heterogeneous data sets measured on the same set of samples. Integrative analysis of multisource data promises to reveal a more comprehensive picture of the underlying truth than individual analysis. In this talk, I will introduce a novel integrative dimension reduction framework called the Supervised Principal Component Analysis (SupPCA). The research is motivated by applications where people are interested in the low rank structure of some primary data while auxiliary variables are also available on the same set of samples. The proposed method can make use of the extra information in the auxiliary data to accurately extract underlying structures that are more interpretable. The model is formulated in a hierarchical fashion using latent variables, and subsumes many existing models as special cases. The asymptotic properties of parameter estimation are derived. We also extend the framework to accommodate special features, such as highdimensional data, functional data, and multimodal data. Applications to bioinformatics and business analytics problems demonstrate the advantage of the proposed methodology.

12:00 pm  1:00 pm, Modern Languages Building 410.