First, we must import, organize, and ensure our read count data looks as expected. After importing, the head of our counts file appears correct.
Now, the total number of reads per sample should be quantified. This provides an idea of sequencing depth per sample. Ideally, all samples should have similar sequencing depth. Otherwise, the dataset will be normalized prior to further analysis.
## DMSO_S1 DMSO_S2 DMSO_S3 Estradiol_S1 Estradiol_S2 Estradiol_S3
## 46234960 38355243 44275449 44986194 44738896 43422599
All samples exhibit quite comparable sequencing depth, lending to the data being high-quality. However, DMSO_S2 does contain a relatively disparate number of reads, so normalization will be implemented within the DESeq2 object.
To construct the DESeq2 object, the experimental design first needs to be described. In this case, there are 3 samples treated with DMSO (acting as controls) and 3 samples treated with Estradiol (with the intention of inducing the overexpression of the MUTE gene).
After successful DESeq2 object creation, it is good practice to ensure the data still remains intact. For instance, the raw read counts should remain unchanged.
As expected, the raw counts contained within the DESeq2 object remain unchanged. Now, it is possible to quickly and easily normalize these count values within the object.
Further analysis will proceed with these normalized counts, as they do not drastically vary from their original raw values, and it is generally considered best practice for a dataset exhibiting this much variance.
Before further analysis can proceed, a gene annotation file linking
A. Thaliana Gene IDs to Gene Names and Types will be required
to make any meaningful conclusions about the dataset. This file was
generated by accessing EnsemblPlants
BioMart and selecting the plant gene database then the TAIR 10
A. Thaliana gene dataset. The only chosen gene attributes were
“Gene stable ID,” “Gene name,” and “Gene type.” Results were then
gathered, selecting unique results only and exporting the file as a
csv. The file was then imported and checked to be within
the epected format in R.
Now that the gene annotation file is properly imported, its dataframe
can be merged with the read count data to provide a single dataframe
with all relevant gene and sample information. This annotated dataframe
may also be convenient to have on hand for future analyses, so it will
be written to a csv file.
Before carrying out Differential Expression Analysis, it is good practice to check for and visualize variation between samples and batch effects. This should also make obvious any outliers within the dataset and aid in understanding of the general trends within the data.
First, a Principal Component Analysis plot was generated to visualize
the dataset’s general variance. As expected, the DMSO and Estradiol
samples clustered together, and these clusters are far flung from each
other. This variance in treatment condition accounted for the vast
majority of variance within the entire dataset, which both makes logical
sense and could indicate the extreme effect of the Estradiol treatment
on gene expression versus the DMSO control. The remaining variance among
samples is relatively low, indicating low sample-to-sample variance.
Next, a variable genes heatmap was generated to gain a better understanding of which highly variable genes are associated with specific samples or conditions.
From the heatmap above, it appears that the MUTE gene is the most extremely variably expressed of all genes in the dataset. This is expected, as the experimental design sought to induce MUTE overexpression via Estradiol treatment. More generally, the distinct clustering of quadrants is also interesting and expected, as it indicates that these groups/classes of genes are being inversely expressed per treatment condition.
Now that the dataset’s quality has been verified and is better
understood, we can more confidently seek to determine which genes are
upregulated and downregulated due to induced MUTE overexpression via
Estradiol treatment. Using the results() command on the
DESeq2 object to compare the Estradiol vs DMSO treatments, relevant
statistics were calculated per gene - most notably
baseMean, log2FoldChange, and
padj. baseMean correlates to the overall
expression of a given gene, log2FoldChange represents how
much more a gene is expressed in Estradiol treatment compared to DMSO
treatment (thus negative values represent greater expression in DMSO
compared to Estradiol), and padj corresponds to the
likelihood that a given gene will be significantly different between the
2 treatment conditions. With this in mind, these values were used to
generate visualizations of differentially expressed genes.
Another heatmap was generated, this time utilizing the normalized
gene values and the aforementioned padj and
log2FoldChange values. This provides insight about
sample-to-sample variation and shows the top differentially expressed
genes across the dataset.
From the Differentially Expressed Genes Heatmap, it appears that the most consistently upregulated genes due to induced MUTE overexpression may be ERL1, TMM, AT1G12845, AT4G31805, and AT5G67020. Additionally, it seems that induced MUTE overexpression may not consistently affect gene regulation of all Estradiol treatment samples, as seen in the relatively high amount of sample-to-sample variation with some genes such as VIM4, WRRKY49, FAF2, and AT1G08833 being highly expressed in one Estradiol sample and barely expressed in another. Conversely, no genes appear to be significantly downregulated due to MUTE overexpression given the results of the above heatmap.
A volcano plot was generated using the normalized data to visualize genes exhibiting extreme regulation states or genes with significant p-values. This additional plot also serves to validate previous conclusions concerning which genes are significantly up- or downregulated in the MUTE overexpressed samples.
From the above volcano plot, it seems that the TMM, AT4G18970, AT4G31805, AT4G29020, AT2G27385, LCR68, and ERL1 genes may be significantly upregulated in conjunction with MUTE overexpression. In contrast, the EXPA1, GSTU20, and CYP79F1 genes may downregulate due to MUTE overexpression. Notably, more genes in total are upregulated due to MUTE overexpression than downregulated. This could potentially indicate that MUTE overexpression amplifies the transcription of some kind of response pathway.
Given the congruence of the differential expression plots, the ERL1, TMM, and AT4G31805 genes are very likely consistently upregulated due to induced MUTE overexpression. Both ERL1 and TMM are implicated in stomatal and protodermal cell development. AT4G31805 encodes for the POLAR scaffold protein associated with meristemoids, with ERL1 also encoding for a kinase which prevents terminal differentiation of meristemoid. In contrast, 2 of the 3 identified significantly downregulated genes are involved in environmental response to predators - EXPA1 is implicated in the formation of root syncytia induced by nematodes, and CYP79F1 is involved in the response to insect oral secretion/attack.