Spring 2020
Spring 2020 Colloquia will take place Mondays, 3:30pm in Math 501, unless otherwise noted.
 February 3, 2020  Speaker  Ian McKeague, Department of Biostatistics, Columbia University
Title: Functional data analysis for activity profiles from wearable devices
Abstract: This talk introduces a nonparametric framework for analyzing physiological sensor data collected from wearable devices. The idea is to apply the stochastic process notion of occupation times to construct activity profiles that can be treated as monotonically decreasing functional data.
Whereas raw sensor data typically need to be prealigned before standard functional data methods are applicable, activity profiles are automatically aligned because they are indexed by activity level rather than by followup time. We introduce a nonparametric likelihood ratio approach that makes efficient use of the activity profiles to provide a simultaneous confidence band for their mean (as a function of activity level), along with an ANOVA type test. These procedures are calibrated using bootstrap resampling. Unlike many nonparametric functional data methods, smoothing techniques are not needed. Accelerometer data from subjects in a U.S. National Health and Nutrition Examination Survey (NHANES) are used to illustrate the approach. The talk is based on joint work with Hsinwen Chang (Academia Sinica, Taipei).
 March 2, 2020  Speaker  Yichuan Zhao, Department of Professor Mathematics and Statistics, Georgia State University
Title: Penalized Empirical Likelihood for the Sparse Cox Model
Abstract: The current penalized regression methods for selecting predictor variables and estimating the associated regression coefficients in the Cox model are mainly based on partial likelihood. In this paper, an empirical likelihood method is proposed for the Cox model in conjunction with appropriate penalty functions when the dimensionality of data is high. Theoretical properties of the resulting estimator for the large sample are proved. Simulation studies suggest that empirical likelihood works better than partial likelihood in terms of selecting correct predictors without introducing more model errors. The wellknown primary biliary cirrhosis data set is used to illustrate the proposed empirical likelihood method.
This is joint work with Dongliang Wang and Tong Tong Wu.
Fall 2019
Fall 2019 colloquia will take place Mondays, 3:30 to 4:30pm in MATH 501, unless otherwise noted.
 September 9, 2019  Speaker  Hui Zou, Department of Statistics, University of Minnesota
Title: A nearly conditionfree fast algorithm for Gaussian graphical model recovery
Abstract: Many methods have been proposed for estimating Gaussian graphical model. The most popular ones are the graphical lasso and neighborhood selection, because the two are computational very efficient and have some theoretical guarantees. However, their theory for graph recovery requires some very stringent structure assumption (a.k.a. the irrepresentable condition). We argue that replacing the lasso penalty in these two methods with a nonconvex penalty does not fundamentally remove the theoretical limitation, because another structure condition is required. As an alternative, we propose a new algorithm for graph recovery that is very fast and easy to implement and enjoys strong theoretical properties under basic sparsity assumptions.
 October 14, 2019  Speaker  Jian Liu, Associate Professor, Systems and Industrial Engineering, University of Arizona
Title: Functional Data Analytics for Detecting Bursts in Water Distribution Systems
Abstract: Bursts in water distribution systems (WDSs) are a special type of shortterm, highflow water loss that can be a significant component of a system’s water balance. Since WDSs are usually deployed underground, bursts are difficult to be detected before their catastrophic results are observed on the ground surface. Continuous hydraulic data streams collected from automatic meter reading and advanced metering infrastructure systems make it possible to detect bursts in WDS based on data analytics. Existing methods based on conventional statistical process control charts may not be effective, as the temporal correlations imbedded in the data streams are not explicitly considered. In this seminar, new control charts for burst detection based on functional data analysis will be presented. Both PhaseI and PhaseII monitoring schemes are investigated. The temporal correlations are modeled from empirical data streams continuously collected from the same WDS. Their statistical properties are studied to reflect system inherent uncertainties induced by customers’ daily use without bursts. The bursts are detected by comparing the new hydraulic data stream to the inherent uncertainties through statistical control charting. The new method will significantly reduce the rate of false alarm and miss detection. The effectiveness of the proposed method is demonstrated with a case study based on numerical simulation of a realworld WDS.
 November 12, 2019  Speaker  Dennis Lin, Department of Statistics, Penn State University **NOTE** Tuesday at 1:00pm in Math 501
Title: Interval Data: Modeling and Visualization
Abstract: Intervalvalued data is a special symbolic data composed of lower and upper bounds of intervals. It can be generated from the change of climate, fluctuation of stock prices, daily blood pressures, aggregation of large datasets, and many other situations. Such type of data contains rich information useful for decision making. The prediction of intervalvalued data is a challenging task as the predicted lower bounds of intervals should not cross over the corresponding upper bounds. In this project, a regularized artificial neural network (RANN) is proposed to address this difficult problem. It provides a flexible tradeoff between prediction accuracy and interval crossing. Empirical study indicates the usefulness and accuracy of the proposed method. The second portion of this project provides some new insights for visualization of interval data. Two plots are proposed—segment plot and dandelion plot. The new approach compensates the existing visualization methods and provides much more information. Theorems have been established for reading these new plots. Examples are given for illustration.
 December 2, 2019  Speaker  Wenguang Sun, Associate Professor, Dept of Data Sciences and Operations, USC
Title: LargeScale Estimation and Testing Under Heteroscedasticity
Abstract: The simultaneous inference of many parameters, based on a corresponding set of observations, is a key research problem that has received much attention in the highdimensional setting. Many practical situations involve heterogeneous data where the most common setting involves unknown effect sizes observed with heteroscedastic errors. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in largescale inference. The first part of my talk addresses the selection bias issue in largescale estimation problem by introducing the “Nonparametric Empirical Bayes Smoothing Tweedie” (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie’s formula. The second part of my talk focuses on a parallel issue in multiple testing. We show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity–adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The common message in both NEST and HART is that the variance structure, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power in both shrinkage estimation and multiple testing problems.
 December 9, 2019  Speaker  Vladimir Minin, Department of Statistics, UC Irvine
Title: Statistical challenges and opportunities in stochastic epidemic modeling
Abstract: Stochastic epidemic models describe how infectious diseases spread through populations. These models are constructed by first assigning individuals to compartments (e.g., susceptible, infectious, and removed) and then defining a stochastic process that governs the evolution of sizes of these compartments through time. Stochastic epidemic models and their deterministic counterparts are useful for evaluating strategies for controlling the infectious disease spread and for predicting the future course of a given epidemic. However, fitting these models to data turns out to be a challenging task, because even the most vigilant infectious disease surveillance programs offer only noisy snapshots of the number of infected individuals in the population. Such indirect observations of the infectious disease spread result in high dimensional missing data (e.g., number and times of infections) that needs to be accounted for during statistical inference. I will demonstrate that combining stochastic process approximation techniques with high dimensional Markov chain Monte Carlo algorithms makes Bayesian data augmentation for stochastic epidemic models computationally tractable. I will present examples of fitting stochastic epidemic models to incidence time series data collected during outbreaks of Influenza and Ebola viruses.
Spring 2019
Spring 2018 colloquia will take place Mondays, 3:20  4:30pm, in ENR2 S395.

April 1, 2019  Speaker  Yunpeng Zhao, School of Mathematical and Natural Sciences, ASU
Title: Network Structure Inference from Grouped Data
Abstract: Statistical network analysis typically deals with inference concerning various parameters of an observed network. In several applications, especially those from social sciences, behavioral information concerning groups of subjects are observed. In such data sets, even though a network structure is present it is not typically observed. These are referred to as implicit networks. In this presentation, we describe a modelbased framework to uncover the implicit network structure and address related inferential questions. We also describe extensions of the methodology to time series of grouped observations.

April 15, 2019  Speaker  Jianqiang Cheng, Assistant Professor Systems and Industrial Engineering, University of Arizona
Title: Computationally Efficient approximations for Distributionally Robust Optimization
Abstract: Distributionally robust optimization (DRO) has gained increasing popularity because it offers a way to overcome the conservativeness of robust optimization without requiring the specificity of stochastic optimization. On the computational side, many practical DRO instances can be equivalently (or approximately) formulated as semidefinite programming (SDP) problems via conic duality of the moment problem. However, despite being theoretically solvable in polynomial time, SDP problems in practice are computationally challenging and quickly become intractable with increasing problem size. In this talk, we adopt the principal component analysis (PCA) approach to solve DRO problems with different types of ambiguity sets. We show that the PCA approximation yields a relaxation of the original problem and derive theoretical bounds on the gap between the original problem and its PCA approximation. Furthermore, an extensive numerical study shows the strength of the proposed approximation method in terms of solution quality and runtime.

April 29, 2019  Speaker  Yves Lussier, Department of Medicine, University of Arizona

March 18, 2019  Speaker  Anthony Howell, Associate Professor, School of Economics, Peking University
Professor Howell is a candidate for a faculty position, Assistant Professor, Spatial Statistics and Quantitative Methods.
Fall 2018
Fall 2018 colloquia will take place Mondays, 3:30  4:30pm, in ENR2 S395.

December 3, 2018  Speaker  Hidehiko Ichimura, Professor of Economics University of Arizona
Title: Locally Robust Semiparametric Estimation
Abstract: We give a general construction of debiased/locally robust/orthogonal (LR) moment functions for GMM, where the derivative with respect to first step nonparametric estima tion is zero and equivalently first step estimation has no effect on the influence function. This construction consists of adding an estimator of the influence function adjustment term for first step nonparametric estimation to identifying or original moment conditions. We also give numerical methods for estimating LR moment functions that do not require an explicit formula for the adjustment term.
LR moment conditions have reduced bias and so are important when the first step is machine learning. We derive LR moment conditions for dynamic discrete choice based on first step machine learning estimators of conditional choice probabilities.
We provide simple and general asymptotic theory for LR estimators based on sample splitting. This theory uses the additive decomposition of LR moment conditions into an identifying condition and a first step influence adjustment. Our conditions require only mean square consistency and a few (generally either one or two) readily interpretable rate conditions.
LR moment functions have the advantage of being less sensitive to first step estimation. Some LR moment functions are also doubly robust meaning they hold if one first step is incorrect. We give novel classes of doubly robust moment functions and characterize double robustness. For doubly robust estimators our asymptotic theory only requires one rate condition.

November 5, 2018  Speaker  Huiyan Sang, Associate Professor, Department of Statistics, Texas A&M University
Title: Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets
Abstract: Spatial regression models have been widely used to describe the relationship between a response variable and some explanatory variables over a region of interest under the assumption that the responses are spatially correlated. Nearly all existing work assumes the regression coefficients to be constant or smoothly varying over the region. In this article, we propose a spatially clustered coefficient regression model to capture the spatially varying pattern, especially clustering pattern in the effect of explanatory variables. In many applications especially with large spatial datasets, it is of great interest to practitioners to identify such clusters that allow straightforward interpretations of local associations among variables. This method incorporates spatial neighboring information through a carefully constructed regularization to automatically detect change points in space and to achieve computational scalability. Numerical studies show that it works very effectively not only in capturing clustered coefficients, but also smoothly varying coefficients because of its strong local adaptivity. This flexibility allows the researchers to explore various spatial structures in regression coefficients. Theoretical properties of the new estimator are also established. The method is applied to explore the relationship between temperature and salinity of sea water in the Atlantic basin, which provides insightful information on the evolution of individual water masses and the pathway and strength of meridional overturning circulation in oceanography.
Bio: Dr. Huiyan Sang received her Ph.D in Statistics from Duke University in 2008, and B.S. in Mathematics and Applied Mathematics from Peking University, China, in 2004. She joined the faculty at Texas A&M University in 2008, where she is currently an Associate Professor in the Department of Statistics. Her research focuses on spatial statistics, extreme values and computational methods for large datasets.

October 1, 2018  Speaker  Ning Hao, Assistant Professor, Department of Mathematics, University of Arizona
Title: A super scalable algorithm for short segment detection
Abstract: In many applications such as copy number variant detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find. In this talk, we will introduce a super scalable short segment detection algorithm. This nonparametric method clusters the locations where the observations exceed a threshold for segment detection. It is computationally efficient and does not rely on Gaussian noise assumption. Moreover, we propose a framework to assign significance levels for detected segments. We demonstrate the advantages of our proposed method by theoretical, simulation, and real data studies. This talk is based on a work joint with Yue Niu, Feifei Xiao and Heping Zhang.
Bio: Ning Hao is an assistant professor in Mathematics Department and Statistics GIDP. He received his B.S. in Mathematics from Peking University, and Ph.D. from the Department of Mathematics at Stony Brook University. He spent one year as a research associate at the statistics lab, Princeton University before joining University of Arizona. His research interests include high dimensional statistical learning, changepoint detection and bioinformatics.

September 10 Speaker  Xiaoxiao Sun, Assistant Professor in the Department of Epidemiology and Biostatistics, University of Arizona.
Title: Optimal Penalized FunctiononFunction Regression
Abstract: Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two reallife examples, we present a functiononfunction regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the representer theorem which states that the exact optimizer of the penalized least squares actually resides in a dataadaptive finitedimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of functiononfunction regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study.
BIO: Xiaoxiao Sun, Ph.D., is an Assistant Professor in the Department of Epidemiology and Biostatistics at the Mel and Enid Zuckerman College of Public Health. His research focus is developing theoretically justifiable and computationally efficient methods for complex and big data arising in datarich areas, such as genomics, social media, and neuroscience. In his previous research, he has developed several statistical methods for time course omics data. He is currently focusing on building the novel computational framework to analyze superlarge datasets efficiently. He earned his Ph.D. in Statistics from the University of Georgia in 2018. Prior to receiving his Ph.D. degree, he obtained BS and MS in Statistics from the Central University of Finance and Economics in Beijing.
Spring 2018
Spring 2018 colloquia will take place Mondays, 3pm 4pm, in ENR2 S395.
The dates for the 2018 colloquia are:
 February 5  Yuhong Yang, Faculty Member in the School of Statistics, University of Minnesota
 March 12  Jacob Bien, Assistant Professor of Data Sciences and Operations, USC
 April 2  Xiaoming Huo, Professor at the Stewart School of Industrial & Systems Engineering at Georgia Institute of Technology
 April 30  Qiang Zhou, Assistant Professor, Systems & Industrial Engineering, University of Arizona
 September 3
 October 1
 November 5
 December 3
More details will be posted in the coming weeks.
Statistics GIDP Colloquium: Monday, April 30, 2018
Speaker: Qiang Zhou, University of Arizona
Dr. Qiang Zhou is an Assistant Professor at the Department of Systems and Industrial Engineering, University of Arizona, and a faculty member of the UA Statistics GIDP. Before moving to UA, he was an Assistant Professor at the Department of Systems Engineering and Engineering Management, City University of Hong Kong during 2012~2016. Dr. Zhou received his B.S. degree in Automotive Engineering and the M.S. Degree in Mechanical Engineering from Tsinghua University, Beijing, China, in 2005 and 2007, M.S. degree in Statistics, and Ph.D. degree in Industrial Engineering from the University of WisconsinMadison, in 2010 and 2011.His research focuses on advanced industrial data analytics for engineering decision making and system performance improvement. Application areas of his research include energy storage systems, semiconductor manufacturing, nanomaterial fabrication, telecommunications, etc.
Title: Pairwise Metamodeling of Multivariate Output Computer Models
Abstract: Gaussian process (GP) is a popular method for emulating deterministic computer simulation models. Its natural extension to computer models with multivariate outputs employs a multivariate Gaussian process (MGP) framework. Nevertheless, with significant increase in the number of design points and the number of model parameters, building an MGP model is a very challenging task. Under a general MGP model framework with nonseparable covariance functions, we propose an efficient metamodeling approach featuring a pairwise model building scheme. The proposed method has excellent scalability even for a large number of output levels. Some properties of the proposed method have been investigated and its performance has been demonstrated through several numerical examples. Experimental designs for building such metamodels will also be briefly discussed.
Statistics GIDP Colloquium: Monday, April 2, 2018
Speaker: Xiaoming Huo, Georgia Tech
Xiaoming Huo is a professor at the Stewart School of Industrial & Systems Engineering at Georgia Tech. Huo's research interests include statistics, computing, and data science. He has made numerous contributions on topics such as sparse representation, wavelets, and statistical problems in detectability. He is a senior member of IEEE since May 2004. He won the Georgia Tech Sigma Xi Young Faculty Award in 2005. His work has led to an interview by Emerging Research Fronts in June 2006 in the field of Mathematics  every two months, one paper is selected. Huo is a fellow of ASA and an AE for Technometrics. He is the executive director of TRIAD: Transdisciplinary Research Institute for Advancing Data Science (http://triad.gatech.edu), and an Associate Director of the Master of Science in Analytics (https://analytics.gatech.edu/).
Title: Statistical and Numerical Efficiency
Abstract: I will describe two recent projects. Both integrates statistics and computing. In the first project, we study how to generate a statistical inference procedure that is both computational efficient and having theoretical guarantee on its statistical performance. Test of independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distancebased methods (such as the distance correlation based hypotheses testing for independence) have numerous advantages, comparing with many other alternatives. A known limitation of the distancebased method is that its computational complexity can be high. In general, when the sample size is n, the order of computational complexity of a distancebased method, which typically requires computing of all pairwise distances, can be O(n^2). Recent advances have discovered that in the univariate cases, a fast method with O(n log n) computational complexity and O(n) memory requirement exists. In this talk, I introduce a test of independence method based on random projection and distance correlation, which achieves nearly the same power as the stateoftheart distancebased approach, works in the multivariate cases, and enjoys the O(n K log n) computational complexity and O(max{n,K}) memory requirement, where K is the number of random projections. Note that saving is achieved when K < n/ log n. We name our method a Randomly Projected Distance Covariance (RPDC). The statistical theoretical analysis takes advantage of some techniques on random projection, which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, in relative to several competitors.
In the second project, we study how the nonconvex penalization can be unified under the framework of differenceofconvex (DC), which is a subfield of optimization. We then argue that many existing statistical procedures can be treated as special cases of DC. Theory on both the numerical efficiency and the statistical optimality can be derived.
Statistics GIDP Colloquium: Monday, March12, 2018
Speaker: Jacob Bien, USC
Title: Title: HighDimensional Variable Selection When Features are Sparse
Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which a large number of columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from biology (e.g., rare species) to natural language processing (e.g., rare words). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. An application to online hotel reviews demonstrates the gain in accuracy achievable by proper treatment of rare words. This is joint work with Xiaohan Yan.
Statistics GIDP Colloquium: Monday, February 5, 2018
Speaker: Yuhong Yang, University of Minnesota
Title: Treatment Allocations Based on MultiArmed Bandit Strategies
Abstract: In practice of medicine, multiple treatments are often available to treat individual patients. The task of identifying the best treatment for a specific patient is very challenging due to patient inhomogeneity. Multiarmed bandit with covariates provides a framework for designing effective treatment allocation rules in a way that integrates the learning from experimentation with maximizing the benefits to the patients along the process.
In this talk, we present new strategies to achieve asymptotically efficient or minimax optimal treatment allocations. Since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean treatment outcome functions (in terms of the covariates) but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance and show its strong consistency. When the mean treatment outcome functions are smooth, rates of convergence can be studied to quantify the effectiveness of a treatment allocation rule in terms of the overall benefits the patients have received. A multistage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in treatment outcome function modeling and a theoretical guarantee of the overall treatment benefits. Numerical results are given to demonstrate the performance of the new strategies.
The talk is based on joint work with Wei Qian
Fall 2017
Fall 2017 colloquia will take place Mondays, 3pm4pm,in ENR2 S395.
Statistics GIDP Colloquium: Monday, September 11, 2017
Speaker: Yue (Selena) Niu, University of Arizona
Title: Reduced Ranked Linear Discriminant Analysis
Abstract: Many high dimensional classification techniques have been
developed recently. However, most works focus on only the binary
classification problem. Available classification tools for the multiclass
cases are either based on oversimplified covariance structure or
computationally complicated. In this talk, following the idea of reduced
ranked linear discriminant analysis, we introduce a new dimension
reduction tool with the flavor of supervised principal component analysis.
The proposed method is computationally efficient and can incorporate the
correlation structure among the features. We illustrate our methods by
simulated and real data examples.
Statistics GIDP Colloquium: Monday, October 2, 2017
Speaker: Yehua Li, Iowa State University
Title: Nested Hierarchical Functional Data Modeling and Inference for the Analysis of Functional Plant Phenotypes
Abstract: In a plant science Root Image Study, the process of seedling roots bending in response to gravity is recorded using digital cameras, and the bending rates are modeled as functional plant phenotype data. The functional phenotypes are collected from seeds representing a large variety of genotypes and have a threelevel nested hierarchical structure, with seeds nested in groups nested in genotypes. The seeds are imaged on different days of the lunar cycle, and an important scientific question is whether there are lunar effects on root bending. We allow the mean function of the bending rate to depend on the lunar day and model the phenotypic variation between genotypes, groups of seeds imaged together, and individual seeds by hierarchical functional random effects. We estimate the covariance functions of the functional random effects by a fast penalized tensor product spline approach, perform multilevel functional principal component analysis (FPCA) using the best linear unbiased predictor of the principal component scores, and improve the efficiency of mean estimation by iterative decorrelation. We choose the number of principal components using a conditional Akaike Information Criterion and test the lunar day effect using generalized likelihood ratio test statistics based on the marginal and conditional likelihoods. We also propose a permutation procedure to evaluate the null distribution of the test statistics. Our simulation studies show that our model selection criterion selects the correct number of principal components with remarkably high frequency, and the likelihoodbased tests based on FPCA have higher power than a test based on working independence.
Statistics GIDP Colloquium: Monday, November 6, 2017 CANCELLED
Statistics GIDP Colloquium: Monday, November 20, 2017
Speaker: Xiaotong Shen, University of Minnesota
Title: Personalized Prediction and Recommender Systems
Abstract: Personalized prediction predicts a user's preference for a large number of items through userspecific as well as contentspecific information, based on a very small amount of observed preference scores. In a sense, predictive accuracy depends on how to pool the information from similar users and items. Two major approaches are collaborative filtering and contentbased filtering. Whereas the former utilizes the information on users that think alike for a specific item, the latter acts on characteristics of the items that a user prefers, on which two kinds of recommender systems Grooveshark and Pandora are built. In this talk, I will review some recent advances in latent factor modeling and discuss various issues as well as scalable strategies based on a ``divideandconquer'' algorithm.
Statistics GIDP Colloquium: Monday, December 4, 2017
Speaker: Edward Bedrick, University of Arizona
Title: Data Reduction Prior to Inference: Is it Sensible to Use Principal Component Scores to Make Group Comparisons in a Student's ttest or ANOVA?
Abstract: There has been a significant recent development of statistical methods, for inference with highdimensional data. Despite these developments, which includes research by faculty at the UofA, biomedical researchers and computational scientists often use a simple twostep step process to analyze high dimensional data. First, the dimensionality is reduced using a standard technique such as principal component analysis, followed by a group comparison using a ttest or analysis of variance. In this talk, I will try to untangle a number of issues associated with this approach, stating with the simplest but most vexing question (since this is left unstated) what hypothesis is being tested? I will use a combination of approaches, including asymptotics, analytical construction of worst case scenarios, and simulation based on actual data to address whether this approach is sensible. Although asymptotics will consider a nonsparse setting, some discussion of implications in sparse problems will be given.
Spring 2017
Statistics GIDP Colloquium: Monday, May 1, 2017.
Speaker: Ming Hu, Lerner Research Institute, Cleveland Clinic; http://www.lerner.ccf.org/qhs/hum/
Title: Statistical Methods, Computational Tools and Visualization of HiC Data
Abstract: Harnessing the power of highthroughput chromatin conformation capture (3C) based technologies, we have recently generated a compendium of datasets to characterize chromatin organization across human cell lines and primary tissues. Knowledge revealed from these data facilitates deeper understanding of long range chromatin interactions (i.e., peaks) and their functional implications on transcription regulation and genetic mechanisms underlying complex human diseases and traits. However, various layers of uncertainties and complex dependency structure complicate the analysis and interpretation of these data. We have proposed hidden Markov random field (HMRF) based statistical methods, which properly address the complicated dependency issue in HiC data, and further leverage such dependency by borrowing information from neighboring pairs of loci, for more powerful and more reproducible peak detection. Through extensive simulations and real data analysis, we demonstrate the power of our methods over existing peak callers. We have applied our methods to the compendium of HiC from 21 human cell lines and tissues, and further develop an online visualization tool to facilitate identification of potential target gene(s) for the vast majority of noncoding variants identified from the recent waves of genomewide association studies.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, April 7, 2017.
Speaker: Sunder Sethuraman, Department of Mathematics, University of Arizona; http://math.arizona.edu/~sethuram/
Title: Consistency of modularity clustering and Kelvin's tiling problem
Abstract: Given a graph, the popular `modularity' clustering method specifies a partition of the vertex set as the solution of a certain optimization problem. In this talk, we will discuss consistency properties, or scaling limits, of this method with respect to random geometric graphs constructed from n i.i.d. points, V_n = \{X_1, X_2, . . . ,X_n\}, distributed according to a probability measure supported on a bounded domain in R^d. A main result is the following: Suppose the number of clusters, or partitioning sets of V_n, is bounded above, then we show that the discrete optimal modularity clusterings converge in a specific sense to a continuum partition of the underlying domain, characterized as the solution of a `soap bubble', or `Kelvin'type shape optimization problem.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, March 3, 2017.
Speaker: MingHung (Jason) Kao, Associate Professor, School of Mathematical and Statistical Sciences, Arizona State University; https://math.la.asu.edu/~mhkao/
Title: Experimental Designs for Functional Brain Imaging with fMRI
Abstract. Functional magnetic resonance imaging (fMRI) experiments are widely conducted in many fields for studying functions of the brain. One of the important first steps of such experiments is to select a good experimental design to allow for a valid and precise statistical inference. However, the identification and construction of highquality fMRI designs can be quite challenging. In this talk, we introduce some methods for constructing good fMRI designs, and discuss the statistical optimality of these designs.
3:00 pm  4:00 pm, Mathematics Building, room 501.
***
Statistics GIDP Colloquium: Friday, February 3, 2017.
Speaker: Ge Yong, Management Information Systems, University of Arizona; https://mis.eller.arizona.edu/people/yongge
Title: PointofInterest Recommendations in Locationbased Social Networks
Abstract. With the rapid development of Locationbased Social Network (LBSN) services, a large number of PointOfInterests (POIs) have been available, which consequently raises a great demand of building personalized POI recommender systems. A personalized POI recommender system can significantly assist users to find their preferred POIs and help POI owners to attract more customers. However, it is very challenging to develop a personalized POI recommender system because a user's checkin decision making process is very complex and could be influenced by many factors such as social network, geographical position, and the dynamics of user preferences. In the first part of this talk, we propose to divide the whole recommendation space into two parts: social friend space and user interest space, and we develop models for each space for generating recommendations. In the second part of this talk, we introduce a new ranked based method for implicit feedbackbased recommendation. To evaluate the proposed methods, we conduct extensive experiments with many stateoftheart baseline methods and evaluation metrics on the realworld data sets.
Bio. Dr. Yong Ge is an assistant professor at MIS Dept. of UoA. He received his Ph.D. in Information Technology from Rutgers, The State University of New Jersey in 2013, the M.S. degree in Signal and Information Processing from the University of Science and Technology of China (USTC) in 2008, and the B.E. degree in Information Engineering from Xi'an Jiao Tong University in 2005. He received the ICDM2011 Best Research Paper Award, Excellence in Academic Research (one per school) at Rutgers Business School in 2013, and the Dissertation Fellowship at Rutgers University in 2012. He has published prolifically in refereed journals and conference proceedings, such as IEEE TKDE, ACM TOIS, ACM TKDD, ACM TIST, ACM SIGKDD, and IEEE ICDM. His work have been supported by UoA, NSF and NIH.
3:00 pm  4:00 pm, Mathematics Building, room 501.
Fall 2016
Statistics GIDP Colloquium: Wednesday, September 7, 2016.
 Speaker: Matti Morzfeld, Department of Mathematics, University of Arizona; http://math.arizona.edu/~mmo/Home.html
 Title: U2 can UQ  Projects and Life in Uncertainty
 Abstract:
I will give an overview about mathematical and computational problems I face when I combine numerical models and data. I will first review basic tools such as Bayes' rule and importance sampling, then explain what difficulties arise when using these tools, and then present two specific applications.The first application uses lowdimensional models to describe and predict reversals of the geomagnetic dipole, the second uses adaptive importance sampling to solve a parameter estimation problem in combustion modeling, leveraging parallelism of DOE's super computers.
 3:00 pm  4:00 pm, Mathematics Building, room 402.
 Statistics GIDP Colloquium: Wednesday, October 5, 2016.
 Speaker: Han Xiao, Dept of Statistics and Biostatistics, Rutgers University; http://stat.rutgers.edu/home/hxiao/
 Title: On the maximum cross correlations under high dimension
 Abstract: Multiple time series often exhibits cross leadlag relationship among its component series. It is very challenging to identify this type of relationship when the number of series is large. We study the leadlag relationship in the high dimensional context, using the maximum cross correlations and some other variants. Asymptotic distributions are obtained. We also use moving blocks bootstrap to improve the finite sample performance.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
 This talk will be preceded by a graduate student lunch  contact Kristina Souders (ksouders@email.arizona.edu) for information.
 Statistics GIDP Colloquium: Wednesday, November 2, 2016.
 Speaker: Haiquan Li, Assistant Professor, Director for Translational Bioinformatics, Department of Medicine, University of Arizona; http://u.arizona.edu/~haiquan/
 Title: Scattered diseaselinked variants and convergent functions: discovery from big data integration
 Abstract: Genomewide association studies (GWAS) has identified thousands of diseaselinked single nucleotide polymorphisms (SNP) in the human genome. Most of them have a small effect size (OR<1.4) and locate independently across multiple chromosomes. It remains unclear how they collectively cause the diseases due to the issue of missing heritability. Classic tests of genetic interactions suffer from insufficient power. Here, we will present an integrative approach that leverages several omics datasets to obtain additional information beyond genotypes and thus reducing the number of hypotheses. We combine traditional semantic similarity for genes’ functions and very deep network permutations (100K times) to quantify the empirical significance of downstream function similarity of any pair of SNPs. This approach enabled us to discover a fundamental biological mechanism for complex diseases: SNPs associated with the same disease are more likely to associate with the same downstream genes or functionally similar genes than unrelated diseases (OR>12). We also found 4050% of prioritized SNPpairs have significant genetic interactions from three independent GWAS datasets. These results provide new biological interpretation to genetic interactions and a “roadmap” of disease mechanisms emerging from GWAS SNPs, especially those out of coding regions.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
 Statistics GIDP Colloquium: Wednesday, December 7, 2016.
 Speaker: Timothy Hanson, Professor, Department of Statistics, University of South Carolina; http://people.stat.sc.edu/hansont/
 Title: A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data
 Abstract: A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonlyused semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, currentstatus, left and right censored, and mixtures of these. Left truncated data are also accommodated leading to models for timedependent covariates. Both georeferenced (location observed exactly) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled. Variable selection is also incorporated. Model fit is assessed with conditional CoxSnell residuals, and model choice carried out via LPML and DIC. Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via new functions which call efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonlyused proportional hazards model.
 3:00 pm  4:00 pm, Mathematics Building, room 501.
Spring 2016
 Statistics GIDP Colloquium: Wednesday, February 3, 2016.
 Speaker: Walt Piegorsch, PhD, University of Arizona, GIDP; http://math.arizona.edu/~piegorsch/
 Title: Model uncertainty in environmental risk assessment
 Abstract: Estimation of lowdose ‘benchmark’ points in environmental risk analysis is discussed. Focus is on the increasing recognition that model uncertainty and misspecification can drastically affect point estimators and confidence limits built from limited doseresponse data, which in turn can lead to imprecise risk assessments with uncertain, even dangerous, policy implications. Some possible remedies are mentioned, including use of parametric (frequentist) model averaging over a suite of potential doseresponse models, and nonparametric doseresponse analysis via isotonic regression. An example on formaldehyde toxicity illustrates the calculations.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Wednesday, March 2, 2016.
 Speaker: Zhaoxia Yu, PhD, University of California Irvine, Dept of Statistics; http://www.ics.uci.edu/~zhaoxia/
 Title: Strategies on Identifying GeneGene Interactions
 Abstract: Characterizing genegene interactions is of fundamental importance in unraveling the etiology of complex human diseases. However, due to the ultra highdimensional nature of the problem, the degree to which genes jointly affect disease risk is largely unknown. Two major obstacles toward this goal are the enormous computational effort and heavy burden of multiple testing in testing genegene interactions. In this talk I will discuss several strategies using three examples. In this first example we derived closeform and consistent estimates of interaction parameters for casecontrol data. The derived Wald tests gave very similar results with the gold standard but were ten times faster. In a study of multiple sclerosis, we identified interactions within the major histocompatibility complex region. In the second example, we used information that is independent of interaction testing to prioritize genegene pairs for caseparents design. The application of this strategy provided suggestive evidence for interactions between two genomic regions: the major histocompatibility complex region on chr 6 and the killercell immunoglobulinlike receptor region on chr19. In the last example, we borrowed information across distinct but similar diseases. We found that genes interacting in multiple sclerosis also interacted with each other in type 1 diabetes.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Wednesday, April 6, 2016.
 Speaker: Jie Chen, PhD, Georgia Regents University, Dept of Biostatistics & Epidemiology; http://biostat.gru.edu/Faculty&Staff/JChen
 Title: Change point models in the Bayesian Perspective and their applications in CNV study
 Abstract: Biomedical researchers now use advanced technologies, such as the comparative genomic hybridization (CGH), the arraybased comparative genomic hybridization (aCGH), and the high throughput next generation sequencing (NGS), to conduct DNA copy number experiments for detecting DNA copy number variations (CNVs) as cancer development, genetic disorders, and many other diseases are usually relevant to CNVs on the genome. Identifying boundaries of CNV regions on a chromosome or a genome can be viewed as a change point problem of detecting signal/intensity changes presented in the genomic data. The analysis of high throughput genomic data for possible changes has become one of the most recent viable applications of statistical change point analysis. In this talk, I present several change point models suitable to formulate different data types resulting from the aCGH and the NGS technologies and provide Bayesian solutions to these models. Applications of these methods to tumor cell line data will also be given.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
 Statistics GIDP Colloquium: Monday, May 2, 2016.
 Speaker: Bikas Sinha, PhD, Retired Faculty, Indian Statistical Institute, Kolkata, India
 Title: Mixture Experiments: Theory and Applications
 Abstract: This is a review talk dealing briefly with mixture models, standard mixture designs and optimal mixture experiments. Some application areas will be highlighted.
 12:00 pm  1:00 pm, Physics and Atmospheric Sciences Building, room 314.
Fall 2015
 Statistics GIDP Colloquium: Wednesday, September 2, 2015, 2014. Speaker: Neng Fan, Assistant Professor, Systems and Industrial Engineering Department, University of Arizona. Title: Learning from Data with Uncertainties via DataDriven Optimization
 12:00 pm  1:00 pm, Saguaro Hall 114.
 Abstract: In the last several decades, many advanced technologies have been developed to collect and store data continuously, and data and decisions are more strongly linked together than ever before. In most cases, the data includes a lot of uncertainties, such as missing or incomplete information, measurement errors, noise, etc. Traditional machine learning methods for decisions are dealing with the exact information of data. Only to some extent, the data uncertainty, modeled by some support sets, mean or moment values, has been considered for robust decisions. In this talk, we discuss statistical models for data uncertainties and datadrive optimization approaches for decisionmaking under uncertainty, especially in the case of big data. Some robust and chanceconstrained optimization models and algorithms for support vector machines will be introduced, and numerical experiments will be performed to validate the proposed approaches.
 Statistics GIDP Colloquium: Wednesday, September 30, 2015. Speaker: Clayton Morrison, Associate Professor, School of Information, University of Arizona.
 Title: Finding Structure in Time: Inferring Structured Latent Sequences and Activity Descriptions
 12:00 pm  1:00 pm, Modern Languages Building 410.

Abstract: Humans excel at understanding complex dynamic histories, recognizing relevant context and using that context to interpret events that are sometimes hierarchically and recursively structured. Our research group has found the tools of Bayesian nonparametric modeling and inference well suited for approaching several aspects of these problems. In this talk I present ongoing work on two applications that require methods for inferring structurally rich representations of time series: identifying context relevant to interpreting biochemical reactions described in cancer biology research papers, and constructing descriptions of coordinated activities from observations in video.
 Statistics GIDP Colloquium: Wednesday, November 4, 2015. Speaker: Professor Avelino Arellano, Jr., Dept of Atmospheric Science, University of Arizona.
 Title: Towards Seamless Prediction of Chemical Weather
 12:00 pm  1:00 pm, Modern Languages Building 410.
 Abstract
 Statistics GIDP Colloquium: Wednesday, December 2, 2015. Speaker: Professor Gen Li, Department of Biostatistics, Mailman School of Public Health, Columbia University.

Title: Supervised Principal Component Analysis and Extensions

Abstract: It is increasingly common to have heterogeneous data sets measured on the same set of samples. Integrative analysis of multisource data promises to reveal a more comprehensive picture of the underlying truth than individual analysis. In this talk, I will introduce a novel integrative dimension reduction framework called the Supervised Principal Component Analysis (SupPCA). The research is motivated by applications where people are interested in the low rank structure of some primary data while auxiliary variables are also available on the same set of samples. The proposed method can make use of the extra information in the auxiliary data to accurately extract underlying structures that are more interpretable. The model is formulated in a hierarchical fashion using latent variables, and subsumes many existing models as special cases. The asymptotic properties of parameter estimation are derived. We also extend the framework to accommodate special features, such as highdimensional data, functional data, and multimodal data. Applications to bioinformatics and business analytics problems demonstrate the advantage of the proposed methodology.

12:00 pm  1:00 pm, Modern Languages Building 410.