Doctor of Public Health (Dr.PH.)
Degree Granting Department
Epidemiology and Biostatistics
Yangxin Huang, Ph.D.
Henian Chen, M.D., Ph.D.
Yiliang Zhu, Ph.D.
Getachew Dagne, Ph.D.
Julie Baldwin, Ph.D.
Substance use data such as alcohol drinking often contain a high proportion of zeros. In studies
examining the alcohol consumption in college students, for instance, many students may not drink
in the studied period, resulting in a number of zeros. Zero-in
ated continuous data, also called semi-
continuous data, typically consist of a mixture of a degenerate distribution at the origin (zero) and
a right-skewed, continuous distribution for the positive values. Ignoring the extreme non-normality
in semi-continuous data may lead to substantially biased estimates and inference. Longitudinal or
repeated measures of semi-continuous data present special challenges in statistical inference because
of the correlation tangled in the repeated measures on the same subject.
Linear mixed-eects models (LMM) with normality assumption that is routinely used to analyze
correlated continuous outcomes are inapplicable for analyzing semi-continuous outcome. Data
transformation such as log transformation is typically used to correct the non-normality in data.
However, log-transformed data, after the addition of a small constant to handle zeros, may not
successfully approximate the normal distribution due to the spike caused by the zeros in the original
observations. In addition, the reasons that data transformation should be avoided include: (i)
transforming usually provides reduced information on an underlying data generation mechanism;
(ii) data transformation causes diculty in regard to interpretation of the transformed scale; and
(iii) it may cause re-transformation bias. Two-part mixed-eects models with one component
modeling the probability of being zero and one modeling the intensity of nonzero values have
been developed over the last ten years to analyze the longitudinal semi-continuous data. However,
log transformation is still needed for the right-skewed nonzero continuous values in the two-part
In this research, we developed Bayesian hierarchical models in which the extreme non-normality
in the longitudinal semi-continuous data caused by the spike at zero and right skewness was accom-
modated using skew-elliptical (SE) distribution and all of the inferences were carried out through
Bayesian approach via Markov chain Monte Carlo (MCMC). The substance abuse/dependence data,
including alcohol abuse/dependence symptoms (AADS) data and marijuana abuse/dependence symptoms (MADS) data from a longitudinal observational study, were used to illustrate the pro-
posed models and methods. This dissertation explored three topics:
First, we presented one-part LMM with skew-normal (SN) distribution under Bayesian frame-
work and applied it to AADS data. The association between AADS and gene serotonin transporter
polymorphism (5-HTTLPR) and baseline covariates was analyzed. The results from the proposed
model were compared with those from LMMs with normal, Gamma and LN distributional assump-
tions. Simulation studies were conducted to evaluate the performance of the proposed models.
We concluded that the LMM with SN distribution not only provides the best model t based on
Deviance Information Criterion (DIC), but also oers more intuitive and convenient interpretation
of results, because it models the original scale of response variable.
Second, we proposed a
exible two-part mixed-eects model with skew distributions including
skew-t (ST) and SN distributions for the right-skewed nonzero values in Part II of model under a
Bayesian framework. The proposed model is illustrated with the longitudinal AADS data and the
results from models with ST, SN and normal distributions were compared under dierent random-
eects structures. Simulation studies are conducted to evaluate the performance of the proposed
Third, multivariate (bivariate) correlated semi-continuous data are also commonly encountered
in clinical research. For instance, the alcohol use and marijuana use may be observed in the
same subject and there might be underlying common factors to cause the dependence of alcohol
and marijuana uses. There is very limited literature on multivariate analysis of semi-continuous
data. We proposed a Bayesian approach to analyze bivariate semi-continuous outcomes by jointly
modeling a logistic mixed-eects model on zero-in
ation in either response and a bivariate linear
mixed-eects model (BLMM) on the positive values through a correlated random-eects structure.
Multivariate skew distributions including ST and SN distributions were used to relax the normality
assumption in BLMM. The proposed models were illustrated with an application to the longitudinal
AADS and MADS data. A simulation study was conducted to evaluate the performance of the
Scholar Commons Citation
Xing, Dongyuan, "Bayesian Inference on Longitudinal Semi-continuous Substance" (2015). Graduate Theses and Dissertations.