Skip to content Skip to navigation

Elements of Causal Learning: Basic Concepts, Theory, Methods, Algorithms and Applications

Jan 2017 to Dec 2020




Harvard University, Prime Sponsor: Department of the Navy

This proposal studies new methods for causal inference, with a focus on methods that incorporate recent advances from machine learning and statistical learning theory.  A first set of questions concerns learning about treatment effect heterogeneity, or more generally, heterogeneity in estimated parameters of a structural or causal model.   A closely related question concerns the estimation and evaluation of optimal policies.  We propose to look for efficient methods to construct policies from data as well as to evaluate (with correct p-values) the return to using the personalized policies.  Another related question concerns online experimentation: how can experiments be designed to assign individuals to treatments as they arrive, using data from earlier individuals to estimate a policy, balancing the need for exploration against the desire for “exploitation,” that is, the desire to avoid giving individuals suboptimal treatments?  This problem has been well studied, but there is much less known about the setting where individuals have observed attributes, and so the goal is to construct and evaluate personalized treatment assignments.  We propose to incorporate insights from the causal inference literature to improve upon existing algorithms and design new ones.  Another class of questions concerns the problem of estimating average causal effects under unconfoundedness in settings with a modest number of covariates, where the asymptotic properties are derived given a fixed number of covariates. See Imbens and Rubin (2015) for a textbook discussion. More recently researchers have focused on settings with a large number of covariates, relative to the number of observations. Many of the proposed methods take methods from the literature on a modest number of covariates, and adapt them by estimating unknown functions using machine learning methods. I plan to work on modifications of  doubly robust methods  previously proposed in settings with few covariates to large data settings, to balance the focus on covariate balancing and regression adjustment. A final set of questions concerns the use of surrogates in high dimensional settings. There is a biostatistics literature on the use of surrogate markers in settings where the outcome of interest is difficult to measure. We study the setting where we have two sets of data. The first data set is from a randomized experiment in which we observe a large number of potential surrogates.  The second data set is an observational dataset with data about the surrogates and the final outcome of interest.