LEVERAGING EXTERNAL DATA IN CLINICAL TRIAL DESIGN: SYNTHETIC CONTROL ARM CONSTRUCTION USING A CAUSAL INFERENCE INTEGRATED MACHINE LEARNING APPROACH

Author/Creator

Author/Creator ORCID

Department

Mathematics and Statistics

Program

Statistics

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

The integration of external data to construct synthetic controls represents a shift from conventional reliance on concurrent randomized controlled trials (RCTs) in evidence-based medicine. The use of real-world data (RWD) has gained momentum, driven by the rising cost and feasibility challenges of RCTs, the availability of high-quality RWD, and advances in causal inference and Bayesian modeling for estimating average treatment effects (ATE). Synthetic control methods construct control subjects from external data to approximate traditional control arms, enabling ATE estimation in RCTs with limited concurrent controls. However, challenges such as covariate heterogeneity and unmeasured confounding make valid integration of external data complex. Traditional propensity score (PS) methods, commonly used to balance covariates in observational studies, face limitations when used for integrating external data. The treatment indicator is redefined as a data source indicator. Since assignment to data sources may not depend on covariates, standard PS modeling becomes less reliable, and optimal model selection is not well defined. To address these issues, Chapter 2 proposes a achine learning approach OneClass Support Vector Machine (OCSVM), which identifies external units compatible with the current study using only current data. OCSVM avoids the need to model assignment mechanisms and better handles non-linearities and heterogeneity across datasets. PS methods, the proposed OCSVM, Bayesian approaches, and two-stage approaches were compared in the simulation study. OCSVM consistently achieved better covariate balance and improved performance relative to PS methods. Building on this foundation, Chapter 3 introduces three improvements of OCSVM: (1) a tuning procedure for the γ parameter in the radial basis function kernel, (2) a weighted OCSVM method that incorporates position and density based weights to reduce sensitivity to outliers, and (3) a custom kernel function designed to accommodate mixed-type variables. These innovations enhance the robustness, flexibility, and generalizability of OCSVM. Chapter 4 addresses a limitation of OCSVM: treating all borrowed external data within the decision boundary equally can introduce bias when covariate distributions differ between sources. To mitigate this, a hybrid method OCSVM-EB is proposed. It first applies OCSVM to trim incompatible external units, followed by entropy balancing (EB) to reweight the remaining data and align covariate distributions. EB imposes constraints to match covariate moments across current and external data. Simulation studies confirm that OCSVM-EB achieves superior covariate balance and improved estimation accuracy compared to PS-based methods.