dc.description.abstracteng | Genotyping arrays have greatly facilitated genetic epidemiological studies into genetic risk factors for numerous complex diseases such as psychiatric disorders. The use of genome-wide association analysis (GWAS) is unequivocally established. More recently, DNA methylation arrays have enabled genome-wide profiling of the methylome, in addition to contemporary genetic epidemiology study design. An example of one such study is the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) Lipidomics Study, which identified methylation markers (CpG markers) and single nucleotide polymorphisms (SNPs), associated with the change in triglyceride levels after drug intervention. Genotyping and methylation arrays assay several hundred thousand markers; however, single-marker association analysis suffers greatly from the burden of multiple testing. Set-based (SNP or CpG set) association approaches offer great flexibility, thus allowing the joint testing of a set of variants. For instance, a polygenic risk score (PRS) is a set-based approach, which, in addition to the strongly associated SNPs identified by large-scale GWAS, recruits SNPs with moderate to weak effects. The genotype information of the SNP set in the PRS is taken from an independent sample (target sample) and is then weighted by individual SNP effects derived from a relevant GWAS performed on a separate sample (discovery sample) into a cumulative score for each individual in the target sample. The resulting score, based on a SNP set or the PRS, is then regressed on the target phenotype. Such a regression model is evaluated by the amount of variance explained (R2) by the PRS in the target phenotype. Another strategy of set-based association analysis is kernel machine regression (KMR): a semi-parametric regression approach, in which the effects of markers within a set (CpG set or SNP set) are modelled via a kernel function and thus evaluated by a single-component variance test. A kernel function computes pairwise genomic similarity between the individuals, that is, the inner product of a set of variants under analysis, maybe comprising a gene or a biological pathway. For my first article, I performed a simulation study to evaluate the performance of PRS in correlated discovery and target traits by considering various sample sizes of the target sample, namely n=200, 500, and 1000. The PRS for correlated traits can be viewed as a situation of calculating schizophrenia-PRS for psychosocial endophenotypes such as global assessment functioning (GAF) score or positive and negative syndrome scale (PANSS) score. Considering such a situation, I simulated four correlated target traits that had varying degrees of correlation (r2) with the discovery trait, i.e., r2= 1.00, 0.8, 0.6, and 0.4. The results demonstrated that the average R2 estimates by the PRS roughly decreased by the square of the correlation between the target traits. In addition, the range of estimated R2 is most inflated in the sample size of the target trait n=200. Thus, the simulation findings alert researchers conducting clinical studies with endophenotypes to the fact that they need to pay attention to two important factors: first, the sample size of the target trait and secondly, the shared amount of genetic correlation between the target and discovery traits. In my second article, I implemented a KMR approach for set-based association testing of a CpG set. KMR has been successfully employed on SNP sets. In preparation of the second article, I used real and simulated datasets (based on a real dataset) provided by the Genetic Analysis Workshop 20 (GAW20) from the GOLDN study. GOLDN is a longitudinal study with individuals recruited from pedigrees. In my analysis, I only used independent individuals, which restricted the sample size in the real and simulated datasets to n<200. CpG sets were devised using the evidence of association reported by the GOLDN study in the real data set. For simulated datasets, true causal CpGs were provided by GAW20. Thus, I formulated candidate genomic regions of varying lengths while keeping the associated CpG(s) inside the region. The results replicated the evidence of association reported by GOLDN in the real data, and in simulated datasets albeit nominally. Moreover, in the simulated data, causal SNPs exert their full effect on the phenoytpes given when the causal CpG loci had no methylation (B-value=0). Thus, I also considered modelling an interaction term along with the main effects. The results yielded significant association. As part of the discussion, simulation results on the performance of the linear kernel for a CpG set with original (B-values) and logit transformed methylation values (M-values) indicated that logit transformation results in a loss of power. There, I also considered analysing an additive kernel that combines the genotype kernel and the methylation kernel and then tests for association with the phenotype. The initial simulations suggest that an additive kernel with a CpG set including hypo, semi, and hypermethylated sites simultaneously might not improve the model over only including a SNP set. However, it appears fruitful to investigate further the situation in which only one type of methylation state is present in a CpG set. | de |