I am interested in developing and applying statistical methods and statistical learning algorithms to advance knowledge in fields including medicine, public health, and insurance. I have been working on a broad spectrum of areas including recursive partitioning learning, causal inference, dependence modelling, Bayesian methods, and two-phase design.
The creation of many powerful machine learning algorithms is motivated by problems in finance, marketing, retail, etc. They are either incompatible or give misleading results when applying to data arising from medical studies or epidemiological filed, which are usually characterized with censoring, truncation, repeated measurements, measurement errors, etc. I studied Classification and Regression Trees, a simple and popular machine learning algorithm, when data features right-censoring, interval-censoring, current status data, misclassification or repeated measurements. I am working on making more machine learning algorithms available for more data types arising from the fields of medicine and public health.
Two-phase sampling is useful when resource constraints prohibit collecting data on some expensive variable of interest from a large amount of subjects but it is easy or inexpensive to collect data on some correlated variables. The inexpensive correlated variables are collected on a large sample in Phase I and the expensive variable of interest is obtained on a small subsample in Phase II. I am interested in the the design aspects of two-phase studies and give attention on developing efficient sampling scheme in Phase II guided by the phase I information. I proposed a new Adaptive Two-Phase Design scheme and applied with various frameworks of statistical inference.
A comprehensive understanding and quantification of human mortality rates is the fundamental task for actuaries because a reliable mortality forecast is crucial for the pricing and valuation of various life insurance and living benefits products. I am interested in developing learning algorithms to facilitate information borrowing across populations and/or across time and eventually enhance the accuracy of mortality forecasting. I introduced a data driven DSA (deletion-substitution-addition) algorithm to directly “learns” for the best group of populations from a given candidate pool. I also investigated in the model-averaging idea on time-shifted models to borrow information across time.
A Polya tree is a 'distribution over distributions' and it is often used as a conjugate prior in nonparametric Bayesian statistics . A Dirichlet Process is a special case of Polya tree; the former can only generate discrete distributions but the latter facilitate generation of continuous distributions. I constructed the Polya tree distribution and used it novelly in developing a sampling algorithm from multi-model distributions. I also introduced a fully nonparametric local regression based on meticulously constructed Polya tree distributions.
There are scenarios in which data arise from an observational study and interest lies in estimating causal effects of the treatment (or lack thereof) according to the value of a subgroup variable. A challenge arises when the subgroup variable is incompletely observed. I did a series of works to deal with the incomplete data problem in causal inference by doubly weighted estimating equations, a propensity score weighted multiple imputation method, a doubly weighted EM algorithm, and “nested doubly robust” estimating equations.
Data with complex dependence structures arises commonly in modern scientific research. Copula and vine copula are powerful and flexible tools to model the multivariate distribution and it allows separate models for marginal distribution and dependence structure. I adopted copula-based models for dependence modeling when the data arises from multifaceted event history processes, longitudinal data, and data collected under a hierarchical structure.
Published, In Press, or Accepted Manuscripts