Graduation Year


Document Type




Degree Name

Doctor of Public Health (Dr.PH.)

Degree Granting Department

Epidemiology and Biostatistics

Major Professor

Yangxin Huang, Ph.D.

Committee Member

Henian Chen, M.D., Ph.D.

Committee Member

Yiliang Zhu, Ph.D.

Committee Member

Getachew Dagne, Ph.D.

Committee Member

Julie Baldwin, Ph.D.


Substance use data such as alcohol drinking often contain a high proportion of zeros. In studies

examining the alcohol consumption in college students, for instance, many students may not drink

in the studied period, resulting in a number of zeros. Zero-in

ated continuous data, also called semi-

continuous data, typically consist of a mixture of a degenerate distribution at the origin (zero) and

a right-skewed, continuous distribution for the positive values. Ignoring the extreme non-normality

in semi-continuous data may lead to substantially biased estimates and inference. Longitudinal or

repeated measures of semi-continuous data present special challenges in statistical inference because

of the correlation tangled in the repeated measures on the same subject.

Linear mixed-eects models (LMM) with normality assumption that is routinely used to analyze

correlated continuous outcomes are inapplicable for analyzing semi-continuous outcome. Data

transformation such as log transformation is typically used to correct the non-normality in data.

However, log-transformed data, after the addition of a small constant to handle zeros, may not

successfully approximate the normal distribution due to the spike caused by the zeros in the original

observations. In addition, the reasons that data transformation should be avoided include: (i)

transforming usually provides reduced information on an underlying data generation mechanism;

(ii) data transformation causes diculty in regard to interpretation of the transformed scale; and

(iii) it may cause re-transformation bias. Two-part mixed-eects models with one component

modeling the probability of being zero and one modeling the intensity of nonzero values have

been developed over the last ten years to analyze the longitudinal semi-continuous data. However,

log transformation is still needed for the right-skewed nonzero continuous values in the two-part


In this research, we developed Bayesian hierarchical models in which the extreme non-normality

in the longitudinal semi-continuous data caused by the spike at zero and right skewness was accom-

modated using skew-elliptical (SE) distribution and all of the inferences were carried out through

Bayesian approach via Markov chain Monte Carlo (MCMC). The substance abuse/dependence data,

including alcohol abuse/dependence symptoms (AADS) data and marijuana abuse/dependence symptoms (MADS) data from a longitudinal observational study, were used to illustrate the pro-

posed models and methods. This dissertation explored three topics:

First, we presented one-part LMM with skew-normal (SN) distribution under Bayesian frame-

work and applied it to AADS data. The association between AADS and gene serotonin transporter

polymorphism (5-HTTLPR) and baseline covariates was analyzed. The results from the proposed

model were compared with those from LMMs with normal, Gamma and LN distributional assump-

tions. Simulation studies were conducted to evaluate the performance of the proposed models.

We concluded that the LMM with SN distribution not only provides the best model t based on

Deviance Information Criterion (DIC), but also oers more intuitive and convenient interpretation

of results, because it models the original scale of response variable.

Second, we proposed a

exible two-part mixed-eects model with skew distributions including

skew-t (ST) and SN distributions for the right-skewed nonzero values in Part II of model under a

Bayesian framework. The proposed model is illustrated with the longitudinal AADS data and the

results from models with ST, SN and normal distributions were compared under dierent random-

eects structures. Simulation studies are conducted to evaluate the performance of the proposed


Third, multivariate (bivariate) correlated semi-continuous data are also commonly encountered

in clinical research. For instance, the alcohol use and marijuana use may be observed in the

same subject and there might be underlying common factors to cause the dependence of alcohol

and marijuana uses. There is very limited literature on multivariate analysis of semi-continuous

data. We proposed a Bayesian approach to analyze bivariate semi-continuous outcomes by jointly

modeling a logistic mixed-eects model on zero-in

ation in either response and a bivariate linear

mixed-eects model (BLMM) on the positive values through a correlated random-eects structure.

Multivariate skew distributions including ST and SN distributions were used to relax the normality

assumption in BLMM. The proposed models were illustrated with an application to the longitudinal

AADS and MADS data. A simulation study was conducted to evaluate the performance of the

proposed models.

Included in

Biostatistics Commons