Simulating RNA-Seq Data


For my Master's Thesis I needed simulated RNA-Seq data. While there exist many packages in R for simulating RNA-Seq data, I needed a specific package where I could control the influence of covariate/batch effects on the data as well as the number of differentially expressed genes (DEGs). After attempting to use many packages, I decided to make my own package, in order to fulfill my needs.

For the purpose of my research my Simulated dataset consisted of 2000 genes, 1000 samples, and 500 true DEGs; although the functions I created are parameterized for the number of genes, samples, and DEGs. The formula for generating the raw counts can be seen on the right.

Generating the Baseline Counts:

Prior to adding in the covariates, experimental factors, and noise the baseline counts needed to be simulated in order to show the true genetic signal. Initially I used the negative binomial distribution to generate the counts, but later I moved to upsampling a public airway RNA-seq dataset containing 6 samples of 38694 genes to 1000 samples with 2000 genes, where genes were eliminated randomly.

Of the 2000 genes, 500 differentially expressed (DE) genes were randomly selected from a pool of genes which have a standard deviation > 2 and their log-fold change was generated using two random normal distributions (mean = -0.75, standard deviation = .5) and (mean = 0.75, standard deviation = .5). Dispersions were created using biases from the covariates, plates, along with random noise.

As a result, the counts for these genes follow the negative binomial distribution as both expected and required. (image to the right shows the raw counts after upsampling and introducing DE genes, see how they follow the negative binomial distribution)

Adding Dispersions:

Dispersions are essential in generating RNA-Seq data but also crucial for the requirements of my package. For the needs of my research I needed to carefully measure the effect that covariates have on the downstream error. Ensuring that I created a reliable method for adding dispersions to my RNA-Seq data was of the utmost importance.

Covariates & Plates:

After initialization, the covariate and plate matrix are scaled to ensure covariates and plates have a similar effect, each covariate and plate values were scaled to be within the range [0,1] by dividing each value by the maximum value for that covariate or for the maximum plate. Two covariates a and d were randomly selected to interact with the diagnosis the plate vector was also selected to interact with the diagnosis. Interactions were simulated by adding the diagnosis divided by 50 to the covariate and plate vectors selected. All of the covariates and the plate values are multiplied by a negative random binomial distribution, the distribution was further randomized to ensure that covariates/plates affected different genes. To further randomize the negative binomial distributions, a function was used to randomly select the number of successful trials to be either 1 or 2 and for the probability of success to be within the range (1e-5, 1e-2).

Random Error:

For additional random error, an error matrix was added to the RNA-Seq count data and is generated using the absolute value of a random normal distribution with a mean of 0 and a standard deviation equal to the maximum value of the raw RNA-Seq data divided by the number of genes present. This was done to ensure that the random error was sufficiently large for the number of genes.

Validating Impact of Covariates and Plates:

In order to measure the effects of the covariate effects, the plate effects, and the strength of the DEGs I created four RNA-Seq datasets based on the same original covariate/plate values: Unbalanced Library and Unbalanced Covariates (ULUC), Unbalanced Library and Balanced Covariates (ULBC), Balanced Library and Unbalanced Covariates (BLUC), and Balanced Library and Balanced Covariates (BLBC). If the library is Balanced that indicates that the plate matrix contains all 0s and the orignal counts are effectively not affected by the plate values; similarly when the covariates are balanced the original counts are not affected by the six covariates a,b,c,d,e, and f. The formula for generating each of these datasets can be seen below

These four datasets were then compared using sample distances, gene set enrichment analysis (GSEA) and Principal Variance Component Analysis (PVCA).

Sample Distances:

Heat Maps visualizing sample distances for the four balancing scenarios: Unbalanced Library and Unbalanced Covariates (ULUC), Unbalanced Library and Balanced Covariates (ULBC), Balanced Library and Unbalanced Covariates (BLUC), and Balanced Library and Balanced Covariates (BLBC). BLUC and BLBC use the 750 samples that are known to be balanced for all the covariates. Sample distances are computed using the euclidean distances between samples using their VST normalized RNA-Seq counts.

Gene Set Enrichment Analysis:

GSEA was performed using 5 million randomly generated gene sets. Over 10,000 iterations, 100 gene sets were generated for each respective size gene set 100, 200, 300, 400, and 500; totalling for 500 total gene sets per iteration. Using the true DE genes from the simulated RNA-Seq count data the 500 total gene sets were labeled as enriched or unenriched, the predicted DE genes from each method were then used to predict whether each gene set is enriched or unenriched. The results of the true and predicted gene sets were used to compute Sensitivity, Specificity, and Accuracy metrics for covariate and library balancing. GSEA served as a way of comparing how different balancing (library and covariate) across the four datasets (ULUC, ULBC, BLUC, and BLBC) affected the enrichment of certain gene sets.

Principal Variance Component Analysis:

PVCA, is used to determine which factors are responsible for variance within RNA-Seq counts. Within the scope of this project, PVCA is used to illustrate that the amount of variance attributed to the factor of interest (the diagnosis), increases as the noise in the dataset decreases. The figure to the right demonstrates that as we balance the covariates and library that the amount of variance in the data attributed to the diagnosis increases.


This side project for my lab served as an essential tool for validating my work on statistical methods for removing batch effects from data as it enabled me to measure the downstream effects of experimental factors. The code as of right now is not publically available but is used only for private lab use. The code will probably be made public sometime fall 2021.