Initial Data Exploration and Importing

First, we must import, organize, and ensure our read count data looks as expected. After importing, the head of our counts file appears correct.

   

Now, the total number of reads per sample should be quantified. This provides an idea of sequencing depth per sample. Ideally, all samples should have similar sequencing depth. Otherwise, the dataset will be normalized prior to further analysis.

##      DMSO_S1      DMSO_S2      DMSO_S3 Estradiol_S1 Estradiol_S2 Estradiol_S3 
##     46234960     38355243     44275449     44986194     44738896     43422599

   

All samples exhibit quite comparable sequencing depth, lending to the data being high-quality. However, DMSO_S2 does contain a relatively disparate number of reads, so normalization will be implemented within the DESeq2 object.

DESeq2 Object Creation

To construct the DESeq2 object, the experimental design first needs to be described. In this case, there are 3 samples treated with DMSO (acting as controls) and 3 samples treated with Estradiol (with the intention of inducing the overexpression of the MUTE gene).

   

After successful DESeq2 object creation, it is good practice to ensure the data still remains intact. For instance, the raw read counts should remain unchanged.

   

As expected, the raw counts contained within the DESeq2 object remain unchanged. Now, it is possible to quickly and easily normalize these count values within the object.

   

Further analysis will proceed with these normalized counts, as they do not drastically vary from their original raw values, and it is generally considered best practice for a dataset exhibiting this much variance.

Gene Annotation File Accession

Before further analysis can proceed, a gene annotation file linking A. Thaliana Gene IDs to Gene Names and Types will be required to make any meaningful conclusions about the dataset. This file was generated by accessing EnsemblPlants BioMart and selecting the plant gene database then the TAIR 10 A. Thaliana gene dataset. The only chosen gene attributes were “Gene stable ID,” “Gene name,” and “Gene type.” Results were then gathered, selecting unique results only and exporting the file as a csv. The file was then imported and checked to be within the epected format in R.

   

Now that the gene annotation file is properly imported, its dataframe can be merged with the read count data to provide a single dataframe with all relevant gene and sample information. This annotated dataframe may also be convenient to have on hand for future analyses, so it will be written to a csv file.

   

Sample Variance Analysis and Data Exploration

Before carrying out Differential Expression Analysis, it is good practice to check for and visualize variation between samples and batch effects. This should also make obvious any outliers within the dataset and aid in understanding of the general trends within the data.

   

First, a Principal Component Analysis plot was generated to visualize the dataset’s general variance. As expected, the DMSO and Estradiol samples clustered together, and these clusters are far flung from each other. This variance in treatment condition accounted for the vast majority of variance within the entire dataset, which both makes logical sense and could indicate the extreme effect of the Estradiol treatment on gene expression versus the DMSO control. The remaining variance among samples is relatively low, indicating low sample-to-sample variance.          

Next, a variable genes heatmap was generated to gain a better understanding of which highly variable genes are associated with specific samples or conditions.

   

   

From the heatmap above, it appears that the MUTE gene is the most extremely variably expressed of all genes in the dataset. This is expected, as the experimental design sought to induce MUTE overexpression via Estradiol treatment. More generally, the distinct clustering of quadrants is also interesting and expected, as it indicates that these groups/classes of genes are being inversely expressed per treatment condition.

Differential Expression Analysis

Now that the dataset’s quality has been verified and is better understood, we can more confidently seek to determine which genes are upregulated and downregulated due to induced MUTE overexpression via Estradiol treatment. Using the results() command on the DESeq2 object to compare the Estradiol vs DMSO treatments, relevant statistics were calculated per gene - most notably baseMean, log2FoldChange, and padj. baseMean correlates to the overall expression of a given gene, log2FoldChange represents how much more a gene is expressed in Estradiol treatment compared to DMSO treatment (thus negative values represent greater expression in DMSO compared to Estradiol), and padj corresponds to the likelihood that a given gene will be significantly different between the 2 treatment conditions. With this in mind, these values were used to generate visualizations of differentially expressed genes.      

     

Another heatmap was generated, this time utilizing the normalized gene values and the aforementioned padj and log2FoldChange values. This provides insight about sample-to-sample variation and shows the top differentially expressed genes across the dataset.

   

From the Differentially Expressed Genes Heatmap, it appears that the most consistently upregulated genes due to induced MUTE overexpression may be ERL1, TMM, AT1G12845, AT4G31805, and AT5G67020. Additionally, it seems that induced MUTE overexpression may not consistently affect gene regulation of all Estradiol treatment samples, as seen in the relatively high amount of sample-to-sample variation with some genes such as VIM4, WRRKY49, FAF2, and AT1G08833 being highly expressed in one Estradiol sample and barely expressed in another. Conversely, no genes appear to be significantly downregulated due to MUTE overexpression given the results of the above heatmap.

   

A volcano plot was generated using the normalized data to visualize genes exhibiting extreme regulation states or genes with significant p-values. This additional plot also serves to validate previous conclusions concerning which genes are significantly up- or downregulated in the MUTE overexpressed samples.

       

From the above volcano plot, it seems that the TMM, AT4G18970, AT4G31805, AT4G29020, AT2G27385, LCR68, and ERL1 genes may be significantly upregulated in conjunction with MUTE overexpression. In contrast, the EXPA1, GSTU20, and CYP79F1 genes may downregulate due to MUTE overexpression. Notably, more genes in total are upregulated due to MUTE overexpression than downregulated. This could potentially indicate that MUTE overexpression amplifies the transcription of some kind of response pathway.

Conclusion

Given the congruence of the differential expression plots, the ERL1, TMM, and AT4G31805 genes are very likely consistently upregulated due to induced MUTE overexpression. Both ERL1 and TMM are implicated in stomatal and protodermal cell development. AT4G31805 encodes for the POLAR scaffold protein associated with meristemoids, with ERL1 also encoding for a kinase which prevents terminal differentiation of meristemoid. In contrast, 2 of the 3 identified significantly downregulated genes are involved in environmental response to predators - EXPA1 is implicated in the formation of root syncytia induced by nematodes, and CYP79F1 is involved in the response to insect oral secretion/attack.