数据科学seminar学术报告
报告日期:2024年6月29日
报告地点:理学院东北楼302报告厅
Extreme-based causal effect learning (EXCEL) with unmeasured light-tailed confounding
报告人:苗旺(北京大学)
时 间:08:30-08:55
报告人简介:苗旺,现为北京大学概率统计系和统计科学中心助理教授,2008-2017年在北京大学数学科学学院读本科和博士,2017-2018年在哈佛大学生物统计系做博士后研究,2018年入职北京大学光华管理学院,2020年调入数学科学学院。苗旺的研究兴趣包括因果推断,缺失数据,半参数统计及其应用,与合作者提出混杂分析的代理推断理论,发展非随机缺失数据的识别性和双稳健估计理论,以及数据融合的半参数理论,获得国家重点研发计划青年科学家项目,国家自然科学基金面上项目和长江学者奖励计划青年项目资助。个人网页https://www.math.pku.edu.cn/teachers/mwfy
Abstract: Unmeasured confounding poses a significant challenge in identifying and estimating causal effects across various research domains. Existing methods to address confounding often rely on either parametric models or auxiliary variables, which strongly rest on domain knowledge and could be fairly restrictive in practice. In this paper, we propose a novel strategy for identifying causal effects in the presence of confounding under an additive structural equation with light-tailed confounding. This strategy uncovers the causal effect by exploring the relationship between the exposure and outcome at the extreme, which can bypass the need for parametric assumptions and auxiliary variables. The resulting identification is versatile, accommodating a multi-dimensional exposure, and applicable in scenarios involving unmeasured confounders, selection bias, or measurement errors. Building on this identification approach, we develop an Extreme-based Causal Effect Learning (EXCEL) method and further establish its consistency and non-asymptotic error bound. The asymptotic normality of the proposed estimator is established under the linear model. The EXCEL method is applied to causal inference problems with invalid instruments to construct a valid confidence set for the causal effect. Simulations and a real data analysis are used to illustrate the potential application of our method in causal inference.
LMANStat: A multi-layer academic network dataset derived from statistical publications
报告人:潘蕊(中央财经大学)
时 间:08:55-09:20
报告人简介:潘蕊,中央财经大学统计与数学学院教授、博士生导师,中央财经大学龙马学者青年学者。主要研究领域为网络结构数据的统计建模、时空数据的统计分析等。在Annals of Statistics、Journal of the American Statistical Association、Journal of Business & Economic Statistics等期刊发表论文30余篇。著有中文专著《数据思维实践》、《网络结构数据分析与应用》。主持国家自然科学基金项目。具有丰富的案例创作和授课经验,曾获得中央财经大学青年教师教学基本功比赛二等奖,首届中国高校财经慕课联盟“同课异构”课程思政教学竞赛一等奖。
Abstract: The utilization of multi-layer network structures now enables the explanation of complex systems in nature from multiple perspectives. Multi-layer academic networks capture diverse relationships among academic entities, facilitating the study of academic development and the prediction of future directions. However, there are currently few academic network datasets that simultaneously consider multi-layer academic networks; often, they only include a single layer. In this study, we provide a large-scale multi-layer academic network dataset, namely, LMANStat, which includes collaboration, co-institution, citation, co-citation, journal citation, author citation, author-paper and keyword co-occurrence networks. Furthermore, each layer of the multi-layer academic network is dynamic. Additionally, we expand the attributes of nodes, such as authors' research interests, productivity, region and institution. Supported by this dataset, it is possible to study the development and evolution of statistical disciplines from multiple perspectives. This dataset also provides fertile ground for studying complex systems with multi-layer structures.
Block Randomized Experiments, Block 2K Factorial Experiments, and Covariate Adjustment
报告人:杨玥含(中央财经大学)
时 间:09:20-09:45
报告人简介:杨玥含,中央财经大学统计与数学学院教授,北京大学博士。中央财经大学青年英才、龙马学者青年学者。主要从事复杂数据建模、因果推断、迁移学习等研究,主持多项国家自然科学基金,多次获得优秀论文奖及实践教学奖。作为独立作者、第一及通信作者在Journal of the American Statistical Association、Biometrika、Pattern Recognition、Expert Systems with Applications、《中国科学:数学》等国内外期刊发表论文40余篇。
Abstract: Randomized block factorial experiments are widely used in industrial engineering, clinical trials, and social science. Researchers often use a linear model and analysis of covariance to analyze experimental results; however, limited studies have addressed the validity and robustness of the resulting inferences because assumptions for a linear model might not be justified by randomization in randomized block factorial experiments. In this article, we establish a new finite population joint central limit theorem for usual (unadjusted) factorial effect estimators in randomized block 2K factorial experiments. Our theorem is obtained under a randomization-based inference framework, making use of an extension of the vector form of the Wald–Wolfowitz–Hoeffding theorem for a linear rank statistic. It is robust to model misspecification, numbers of blocks, block sizes, and propensity scores across blocks. To improve the estimation and inference efficiency, we propose four covariate adjustment methods. We show that under mild conditions, the resulting covariate-adjusted factorial effect estimators are consistent, jointly asymptotically normal, and generally more efficient than the unadjusted estimator. In addition, we propose Neyman-type conservative estimators for the asymptotic covariances to facilitate valid inferences. Simulation studies and a clinical trial data analysis demonstrate the benefits of the covariate adjustment methods.
“Statistical Reinforcement Learning” VS “Reinforcement Statistical Learning”
报告人:严晓东(山东大学)
时 间:09:45-10:10
报告人简介:严晓东,山东大学未来学者,副研究员,博士生导师,曾任香港理工大学研究员(Research Fellow),加拿大阿尔伯塔大学博士后研究员,获得云南大学和香港理工大学联合培养博士学位,山东省高等学校优秀青年创新团队负责人,目前兼任全国工业统计学教学研究会理事,中关村软联智能算法委员会秘书长等。 在统计学著名期刊JRSSB, AOS, JASA, 计量经济著名期刊JOE以及人工智能顶级会议AAAI等发表论文30余篇,荣获山东省大数据研究会“优秀青年”称号和“云南省2020年优秀博士论文”奖, 以主持人获得了国自科面上和青年基金,科技部重点研发(项目骨干)和国家统计局等项目资助。
Abstract: Reinforcement learning is widely studied across various fields. "Statistical Reinforcement Learning" aims to develop a reinforcement learning process with enhanced interpretability using statistical models and methods. "Reinforcement Statistical Learning" we developed is a new research direction that seeks to develop more robust and effective statistical learning methods using the tool of limit theory in reinforcement learning processes. Both utilize their respective strengths to develop new machine learning methods. This report will detail the methods related to "Reinforcement Statistical Learning" from the perspective of pioneering strategic limit theory, based on the simplest model of reinforcement learning—the multi-armed bandit process.Recently, our team has pioneered the "Strategic Limit Theory" based on the simplest model of reinforcement learning—the multi-armed bandit model. This represents a significant breakthrough in the intersection of nonlinear probability theory and reinforcement learning, expanding the research paradigms of traditional statistical methods. This report primarily introduces subsequent research conducted based on the Strategic Limit Theory, including studies on statistical theories and methods such as two-sample test, sequential sampling, experimental design, online learning and transfer learning.
休息:10:10—10:20
Decorrelated forward regression for high-dimensional data analysis
报告人:蒋学军(南方科技大学)
时 间:10:20-10:45
报告人简介:蒋学军,南方科技大学统计与数据科学系副教授,研究员、博士生导师,于2009年博士毕业于香港中文大学统计系,2009-2010在港中文从事博士后研究,2010-2013任中南财经政法大学副教授,2013年07月加入南方科技大学,入选深圳市海外高层次人才孔雀计划(2016),曾获南方科技大学杰出教学奖(2018),深圳市优秀教师(2018),主持和完成国家(广东省)自然科学基金、深圳市基础研究面上项目等10余项。其主要研究方向包括分位数回归、变量选择、假设检验、高维统计推断,金融统计与计量等,已在Biometrika, Bernoulli, Statistics and Computing, Statistica Sinica, Econometrics Journal等国际一流统计学及计量经济学期刊上发表SCI&SSCI论文50余篇并出版英文教材一部。国内学会任职主要有中国现场统计研究会-教育统计与管理分会副理事长,多元分析应用专业委员会秘书长等。
Abstract: Forward regression (FR) is a crucial methodology for automatically identifying important predictors from a large pool of potential covariates. While forward selection techniques achieve screening consistency in contexts with moderate predictor correlation, this property gradually becomes invalid when dealing with substantially correlated variables—especially in high-dimensional datasets where strong correlations exist among predictors. This challenge is not unique to forward selection methods and is encountered by other model selection approaches as well. To address these challenges, we introduce a novel decorrelated forward (DF) selection framework for generalized mean regression models, including prevalent models, such as linear, logistic, Poisson, and quasi likelihood. The DF selection framework stands out because of its ability to convert generalized mean regression models into linear ones, thus providing a clear interpretation of the forward selection process. It also offers a closed-form expression for forward iteration, to improve practical applicability and efficiency. Theoretically, we establish the screening consistency of DF selection and determine the upper bound of the selected submodel's size. To reduce computational burden, we develop a thresholding DF algorithm that provides a stopping rule for the forward-searching process. Simulations and real data applications show the outstanding performance of our method compared with that of some existing model selection methods.
Communication-Efficient Pilot Estimation for Non-Randomly Distributed Data in Diverging Dimensions
报告人:夏小超(重庆大学)
时 间:10:45-11:10
报告人简介:夏小超, 重庆大学数学与统计学院副教授, 硕士生导师,曾在新加坡国立大学从事博士后研究工作。主要感兴趣的研究方向是海量数据分析、高维特征筛选和模型平均。目前发表论文多篇,主持完成1项国家基金项目。
Abstract: Distributed learning has been a dispensable tool in dealing with massive or distributed datasets. As an important and popular distributed learning method, the communication-efficient surrogate likelihood (CSL, Jordan et al, 2019, JASA) framework were proposed and has received much attention from the distributed machine learning community. In most of the works based on the CSL framework, there are two common treatments: (i) choosing the first machine as the central machine to solve an optimization problem using the data on the first machine; and (ii) assuming that the dimension is fixed when deriving some statistical properties. However, treatment (i) may not be appropriate when the data are stored in a non-random manner or heterogeneously distributed across different machines, which might be common in practice; and treatment (ii) largely limits the applications of CSL to diverging- or high-dimensional datasets, especially when the purpose is to infer some parameters of interest. To address the challenges posed by (i) and (ii), we develop a communication-efficient pilot (CEP) estimation strategy. Specifically, we first implement a pilot sampling on each machine to obtain a pilot sample dataset, and then use a new pilot sample-based surrogate loss function to approximate the global one and its minimizer is named as the CEP estimator. Second, we rigorously investigate theoretical properties of the CEP estimator including its convergence rate, which can reach the global rate √(𝑃_𝑛/𝑁), and its asymptotic normality when the dimension 𝑃_𝑛 diverges with the pilot sample size 𝑟 and 𝑃_𝑛<𝑛. Furthermore, we extend the CEP method to the high dimensional case, i.e., 𝑃_𝑛>𝑛 and propose a regularized version of CEP (CERP). We establish the non-asymptotic error bounds for an 𝑙_1-regularized CERP estimator (CERP-Lasso) and provide the convergence rate and asymptotic normality for a weighted 𝑙_1-regularized CERP estimator (CERP-aLasso) under generalized linear models. Finally, extensive synthetic and real datasets are employed to illustrate the superiority of the proposed approaches.
Robust Mendelian randomization method accounting for idiosyncratic and correlated pleiotropy with applications to stroke outcomes
报告人:成青(西南财经大学)
时 间:11:10-11:35
报告人简介:成青,西南财经大学统计学院副教授,主要研究领域包括基于孟德尔随机化的因果分析,函数型数据分析,交互效应特征筛选,条件独立性检验等,在Nature Communications, Journal of the American Statistical Association, Bioinformatics, NAR genomics and bioinformatics等期刊发表论文,主持一项国家自然科学基金。
Abstract: Mendelian randomization (MR) serves as a valuable tool for investigating causal relationships between exposures and disease outcomes in observational studies. However, MR methods, operating under classical assumptions, may yield biased estimates and inflated false-positive causal relationships when faced with realistic and complex correlated horizontal pleiotropy (CHP). While numerous MR methods have emerged to address CHP effects, limited methods can effectively handle relatively large direct effects, commonly known as idiosyncratic pleiotropy. In response to this gap, we propose an efficient and Robust Mendelian Randomization method to account for Idiosyncratic and Correlated Pleiotropy, named RMR-ICP. Furthermore, our method employs paralleled Gibbs sampling to incorporate linkage disequilibrium structure, thereby enhancing statistical power. We demonstrate the robustness and efficiency of our method through extensive simulation studies and applications. Particularly, we apply RMR-ICP to study the effects of plasma proteins on stroke. Several notable associations are identified. For example, SELE has a positive causal effect on any stroke. An elevated BNP is associated with an increased risk of cardioembolic stroke, but not with other stroke subtypes. This offers a fresh perspective in the identification of plasma proteins associated with stroke.
Probabilistic cell/domain-type assignment of spatial transcriptomics data with SpatialAnno
报告人:杨屹(东南大学)
时 间:11:35-12:00
报告人简介:杨屹,东南大学生命科学与技术学院副研究员。博士毕业于上海财经大学,新加坡国立大学Duke-NUS医学院博士后。研究领域为生物统计学,主要通过数学与统计建模的方式去分析生物数据。一个研究方向是GWAS&TWAS, 通过建模去分析SNP或者基因和性状的关系,揭示复杂性状和SNP或者基因的相关性。另一个研究方向是通过建模去分析单细胞数据或者空间转录组数据,从而更好的对细胞进行聚类和标注。
Abstract: In the analysis of both single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) data, classifying cells/spots into cell/domain types is an essential analytic step for many secondary analyses. Most of the existing annotation methods have been developed for scRNA-seq datasets without any consideration of spatial information. Here, we present SpatialAnno, an efficient and accurate annotation method for spatial transcriptomics datasets, with the capability to effectively leverage a large number of non-marker genes as well as ‘qualitative’ information about marker genes without using a reference dataset. Uniquely, SpatialAnno estimates low-dimensional embeddings for a large number of non-marker genes via a factor model while promoting spatial smoothness among neighboring spots via a Potts model. Using both simulated and four real spatial transcriptomics datasets from the 10x Visium, ST, Slide-seqV1/2, and seqFISH platforms, we showcase the method’s improved spatial annotation accuracy, including its robustness to the inclusion of marker genes for irrelevant cell/domain types and to various degrees of marker gene misspecification. SpatialAnno is computationally scalable and applicable to SRT datasets from different platforms. Furthermore, the estimated embeddings for cellular biological effects facilitate many downstream analyses.