Sparse Modeling of SNPs


During the summer of 2019 I worked as a computational biologist at Lawrence Berkeley National Labs in Kristofer Bouchard's Lab, analyzing sparse models for identifying SNPs related to a phenotypic trait. The question of which genotypes are responsible for a given phenotype has existed since Mendelian time and is a recurring theme in genetics. Many genotypic-phenotypic meta-analyses report that their methods accurately explain the phenotypic variance within their dataset, yet the set of responsible single nucleotide polymorphisms (SNPs) found are too large to be interpreted. Inversely analyses that produce a sparse subset of responsible SNPs lack the ability to explain the phenotypic variance seen in the dataset. This project aimed to find a robust method that eliminates this trade off and produces a sparse subset of SNPs whilst maintaining the ability to accurately explain phenotypic variance within the data. To find a robust method for continuous phenotypic traits Linear Regression and method of Union of Intersections (UoI) were used on an inbred mouse dataset consisting of 350 mice which measured weight and speed.


The project studies regression methods and hence continuous phenotypic traits such as weight and speed were studied. The regression methods for speed and weight are evaluated on a dataset of 350 mice. To keep track of the genotypic differences present in each mouse, each mouse in the dataset has 120k SNPs recorded. Finally, to identify the contributions made by the SNPs, Mouse Genome Informatics (MGI), the international database for mouse genomics, was utilized to map SNPs to specific genes and identify the gene types and features.


In order to find the optimal method three different methods were evaluated for phenotypes speed and weight. The three methods evaluated for each phenotype consisted of one UoI model and two linear regression models provided by scikit learn with L1 and no penalty (see figure below). For both phenotypes, the three methods were trained using the mice dataset. Prior to fitting UoI, Lasso Regression, and Linear Regression, data for the methods were first partitioned into training and test sets with a 80/20 split and then stratified by strain. The methods were fitted using their respective training set and produced a set of coefficients and an intercept. In the set of coefficients estimated by each method, each coefficient maps to a one-hot encoding of a SNP, and the value of each coefficient represents the importance of a specific genotype in explaining the variance of a given phenotype. If a coefficient is a non-zero that indicates that the SNP it maps to is responsible for explaining the phenotypic trait, otherwise that coefficient can be ignored.

Analyzing Sparsity

The six methods were analyzed for accuracy and sparsity by evaluating the coefficients each method produced. Values of the coefficients directly indicate the sparsity of the results; when a method produces a coefficient with a non-zero value it indicates that the SNP it maps to has a role in explaining the variance of a phenotype. By comparing the non-zero coefficients with the entire set of coefficients, we can get an idea of the sparsity of the selected one-hot-encoded SNPs. The number of non-zero coefficients is directly indicative of the sparsity of the results. However, it is crucial to evaluate the predictive accuracy of the selected one-hot encoded SNPs to understand how well the selected encoded SNPs describe the variance observed in the phenotype. By computing the R2 score for each method using the held out test set, we can quantitatively see how well the selected one-hot encoded SNPs describe the variance of a phenotype. Finally the Bayesian information criterion (BIC) was computed for each method to measure how well each method explains the phenotype relative to the number of features.

(a.1&2) One-hot encoded SNP coefficients estimated by UoI Regression and Lasso Regression for phenotypes speed and weight. Of the 120K total SNPs, the identified SNPs by each model are visualized via vertical lines indicating the coefficient weight produced by the model. (b.1&2) Scatter plots identifying true phenotypic values versus predicted values by UoI Regression and Lasso Regression for phenotypes speed and weight. (c.1&2) Method performance for UoI Lasso, Lasso Regression, and Linear Regression, by phenotypes speed and weight. Identifies sparsity of SNPs, R2 score, and BIC for each method. For both phenotypes UoI attains less SNPs and a lower BIC as desired, while maintaining a competitive R2 score. High BIC and large number of selected SNPs demonstrates impractical use of Linear Regression without a penalty.

Further Interpretation of Coefficients

The non-zero coefficients produced by each method, provided insight into the selected one-hot encoded SNPs as well as the performance of each method via the methods R2 score and BIC. While Linear Regression produced impressive R2 scores, the number of significant SNPs identified were too large to be interpreted which is shown by the high BIC score. As predicted, the methods of UoI and Lasso Regression found a small number of responsible SNPs whilst maintaining high R2 and low BIC scores for each phenotype. The ratio between the set of responsible SNPs found by UoI and Lasso Regression demonstrates the general succientity of UoI. UoI idefinified 153 and 132 SNPs while Lasso Regression identified 161 and 408 SNPs for speed and weight. Furthermore, when comparing the R2 scores between UoI and Lasso Regression, UoI performances comparably to Lasso Regression, whilst still maintaining a minimal feature set. This interpretation is proved mathematically by the consistently low BIC score derived from the UoI results, which indicate that the results from Lasso Regression may be overfitting. The level of accuracy achieved by UoI in tandem with the sparsity of identified SNPs proves UoI success in creating sparse yet robust results and provides a new level of interpretability in this genotypic-phenotypic study and can potentially have the same effects in a GWAS meta-analysis study

Mapping Coefficients to Genes

Once the coefficients have been produced for each phenotype and the R2 score has been computed, the selected one-hot encoded SNPs were mapped to gene annotations retrieved from the MGI database. Subsequent analysis used information about gene feature types from their gene annotations to provide a rich explanation of the SNPs found. Once the respective gene features are found they are then compounded into a final set to be interpreted and further analyzed.

The sparsity of the one-hot encoded SNPs selected by the UoI allowed for further analysis into the meaning of the identified SNPs. To interpret the biological significance of the SNPs selected by UoI, the SNPs were mapped to MGI identifiers which were then used to identify the genes and feature types. The identified genes were organized by feature type to provide clarity in the resulting genes. Most of the feature types identified were related to Long non-coding RNA (lncRNA), instead of genes, lncRNA are known to have a complex yet profound input on genes and their expression. The results from this study further prove the complex responsibility that lncRNA has in genetics and its role in phenotypic traits. Additionally, once the one-hot encoded SNPs were mapped to their corresponding gene, 49 and 37 unique genes were found for the phenotypic traits speed and weight. Once the genes were identified, the National Institute of Health’s (NIH) Gene Browser was used to find the biological functions related to the genes. Of the 49 genes found for speed and 37 genes found for weight 4 and 3 genes were found to be directly related to speed and weight phenotypes.