Research Interests

I am interested in developing and applying statistical methods and statistical learning algorithms to advance knowledge in fields including medicine, public health, and insurance. I have been working on a broad spectrum of areas including recursive partitioning learning, causal inference, dependence modelling, Bayesian methods, and two-phase design.

Recursive Partitioning

The creation of many powerful machine learning algorithms is motivated by problems in finance, marketing, retail, etc. They are either incompatible or give misleading results when applying to data arising from medical studies or epidemiological filed, which are usually characterized with censoring, truncation, repeated measurements, measurement errors, etc. I studied Classification and Regression Trees, a simple and popular machine learning algorithm, when data features right-censoring, interval-censoring, current status data, misclassification or repeated measurements. I am working on making more machine learning algorithms available for more data types arising from the fields of medicine and public health.

Two-phase Sampling

Two-phase sampling is useful when resource constraints prohibit collecting data on some expensive variable of interest from a large amount of subjects but it is easy or inexpensive to collect data on some correlated variables. The inexpensive correlated variables are collected on a large sample in Phase I and the expensive variable of interest is obtained on a small subsample in Phase II. I am interested in the the design aspects of two-phase studies and give attention on developing efficient sampling scheme in Phase II guided by the phase I information. I proposed a new Adaptive Two-Phase Design scheme and applied with various frameworks of statistical inference. 

Mortality Forecasting

A comprehensive understanding and quantification of human mortality rates is the fundamental task for actuaries because a reliable mortality forecast is crucial for the pricing and valuation of various life insurance and living benefits products. I am interested in developing learning algorithms to facilitate information borrowing across populations and/or across time and eventually enhance the accuracy of mortality forecasting. I introduced a data driven DSA (deletion-substitution-addition) algorithm to directly “learns” for the best group of populations from a given candidate pool. I also investigated in the model-averaging idea on time-shifted models to borrow information across time.

Polya Tree

A Polya tree is a 'distribution over distributions' and it is often used as a conjugate prior in nonparametric Bayesian statistics . A Dirichlet Process is a special case of Polya tree; the former can only generate discrete distributions but the latter facilitate generation of continuous distributions. I constructed the Polya tree distribution and used it novelly in developing a sampling algorithm from multi-model distributions. I also introduced a fully nonparametric local regression based on meticulously constructed Polya tree distributions.

Causal Inference

There are scenarios in which data arise from an observational study and interest lies in estimating causal effects of the treatment (or lack thereof) according to the value of a subgroup variable. A challenge arises when the subgroup variable is incompletely observed. I did a series of works to deal with the incomplete data problem in causal inference by doubly weighted estimating equations, a propensity score weighted multiple imputation method, a doubly weighted EM algorithm, and “nested doubly robust” estimating equations. 

Dependence Modelling

Data with complex dependence structures arises commonly in modern scientific research. Copula and vine copula are powerful and flexible tools to model the multivariate distribution and it allows separate models for marginal distribution and dependence structure. I adopted copula-based models for dependence modeling when the data arises from multifaceted event history processes, longitudinal data, and data collected under a hierarchical structure.


Under Revision

Yang, C., Cook, R.J., Diao, L., 2023+. Secondary Analysis and Sequential Design of Two-Phase Studies. Under Minor Revision for Statistical Methods in Medical Research.         

Yang, C., Li, X., Diao, L., Cook, R.J., 2023+. Regression Trees for Interval-censored Failure Time Data Based on Censoring Unbiased Transformations and Pseudo-Observations. Revision Submitted to Canadian Journal of Statistics.  

Published, In Press, or Accepted Manuscripts

Diao, L., Yi, Y., 2023 Classification Trees for Misclassified Responses. Accepted by Journal of Classification.    

Diao, L., Meng, Y., Weng, C., Wirjanto, T., 2023. Enhancing Mortality Forecasting Through Bivariate Model Based Ensemble. Accepted by North American Actuarial Journal. 

Cuerden, M., Diao, L., Cotton, C., Cook, R.J., 2022. Doubly Weighted Mean Score Estimating Functions with a Partially Observed Effect Modifier. Accepted by Communications in Statistics - Theory and Methods.  

Zhuang, H., Diao, L., Yi, Y., 2023. Polya-Tree Monte Carlo Method. Computational Statistics & Data Analysis 180. Published Online at

Battista, K., Diao, L., Dubin, J.A., Patte, K.A., Leatherdale, S.T., 2023. Examining the Use of Decision Trees in Population Health Surveillance Research: An Application to Youth Mental Health Survey Data in the COMPASS Study. Health Promotion and Chronic Disease Prevention in Canada 43:2.

Yang, C., Diao, L., Cook, R.J., 2022. Adaptive Two-Phase Designs: Some Results on Robustness and Efficiency. Statistics in Medicine. 41(22): 4403-4425.

Zhuang, H., Diao, L., Yi, Y., 2022. Polya Tree Based Nearest Neighbour Regression. Statistics and Computing. 32:59

Cuerden, M., Diao, L., Cotton, C., Cook, R.J., 2022. Doubly weighted estimating equations and weighted multiple imputation for causal inference with an incomplete subgroup variable. Biostatistics and Epidemiology. Published Online at DOI: 10.1080/24709360.2022.2069457 

Battista, K., Patte, K.A., Diao, L., Dubin, J.A., Leatherdale, S.T., 2022. Using Decision Trees to Examine Environmental and Behavioural Factors Associated with Youth Anxiety, Depression, and Flourishing. International Journal of Environmental Research and Public Health 19(17):10873. 

Diao, L., Cook, R.J., 2021. Nested Doubly Robust Estimating Equations for Causal Analysis with an Incomplete Effect Modifier. Canadian Journal of Statistics 50(3) 776-794. Zhuang, H., Diao, L., Yi, Y., 2022. A Bayesian Nonparametric Mixture Model for Grouping Dependence Structures and Selecting Copula Functions. Econometrics and Statistics. 22 172-189.    Yang, C., Diao, L., Cook, R.J., 2021. Survival Trees for Current Status Data. Proceedings of AAAI Spring Symposium on Survival Prediction - Algorithms, Challenges, and Applications, Proceedings of Machine Learning Research 146, 83-94 Zhuang, H., Diao, L., Yi, Y., 2021. A Vine Copula Model for Climate Trend Analysis using Canadian Temperature Data. Journal of Data Science. 19(1) 37–55. Diao, L., Meng, Y., Weng, C., 2021. A DSA Algorithm for Mortality Forecasting. North American Actuarial Journal. 25(3) 438-458 Zhuang, H., Diao, L., Yi, Y., 2020. A Bayesian Hierarchical Copula Model. Electronic Journal of Statistics. 14(2), 4457-4488. Steingrimsson, J.A.*, Diao, L.*, Strawderman, R.L., 2019. Censoring Unbiased Regression Trees and Ensembles. Journal of the American Statistical Association 114(525), 370-383. Diao, L. and Weng, C., 2019. Regression Tree Credibility Model. North American Actuarial Journal 23(2), 169-196. Steingrimsson, J.A., Diao, L., Molinaro, A.M., Strawderman, R.L., 2016. Double Robust Survival Trees. Statistics in Medicine 35(20), 3595-3612. Diao, L., Cook, R.J., Lee, K.-A., 2014. Statistical Analysis of Recurrent Adverse Events in Statistical Methods for Evaluating Safety in Medical Product Development Edited by A. L. Gould, John Wiley & Sons, Ltd, Chichester, UK, 180-192. Diao, L. and Cook, R.J., 2014. Composite Likelihood for Joint Analysis of Multiple Multistate Processes via Copulas. Biostatistics 15(4), 690-705. Diao, L., Cook, R.J. and Lee, K.-A., 2013. A Copula Model for Marked Point Processes. Lifetime Data Analysis 19(4), 463-489. Diao, L., 2013. Copula Models for Multi-type Life History Processes. Doctoral Dissertation, The University of Waterloo. Note: “ * ” indicates the joint first authorship.