Since there are many robust and capable servers for functional enrichment
analysis, we feel it is better to use a previously tool/web server than to
try to reproduce what is already available.
Examples of these services include the following:
For this example, we will look at how you can extract differentially expressed gene data for IRIS-EDA and how you can use DAVID for functional enrichment.
Once you have run DGE Analysis in the DGE Analysis
tab, simply
download the differentially expressed genes using the
Download Filtered Data
button at the bottom of the page on the
prior tab (Plots
tab).
This will download a set of significantly differentially expressed genes or IDs that can be opened up in R, Excel, LibreOffice Calc, or any other spreadsheet viewing software.
Once the spreadsheet is open, simply copy the IDs from the first column and open up DAVID Bioinformatics Resources (https://david.ncifcrf.gov/).
Click on the Gene Functional Classification
button on the left-hand
side of the page. This will be under the Shortcut to DAVID Tools.
On this page, click on the Upload
button on the left-hand side of the
page. In Step 1: Enter Gene List, paste your gene IDs into this text
box. If you have pasted this list into a text file, you can upload
that instead.
For Step 2: Select Identifier, choose the ID source for streamlined
analysis. For example, the small example data set found in the tutorial is
derived from the FLYBASE_GENE_ID
database, therefore, I would select
that from the list. Your data, of course, may differ. If you don't know
the origin of the gene IDs,there is a Not Sure
option at the bottom of
the list.
For Step 3: List Type, choose Gene List
and finally, submit the list
using the Submit List
button in Step 4: Submit List.
Once submitted, you can view which genes are enriched, clusters, etc. For more information about downstream analyses, please click here.
GEO (Gene Expression Omnibus) is a public data repository that accepts array- and high throughput sequence-based data. To submit data to GEO, you will need three components:
After submitting your data to the DGE Analysis
tab, you can fill out this
questionnaire which will populate the metadata template file required for GEO
submission.
Table 1: A comparative overview of citation counts for DGE tools and DGE servers. DGE analytical tools (Tool) are compared based on the following criteria: Current number of citations (Citations), percentage of total citations from the analytical tools presented (Citation %), year the analytical tool was published (Year), approximate citations per year based on data accrued through 2017 (Citations/Year), and if the analytical tool has an R-based application (R-based).
Tool | Citations | Citation % | Year | Citations/Year (through 2017) | R-based? |
---|---|---|---|---|---|
edgeR (Robinson et al. 2010) | 7093 | 32.39700375 | 2010 | 1013.285714 | Yes |
Cuffdiff (Trapnell et al. 2012) | 4533 | 20.70430255 | 2012 | 906.6 | No |
Cuffdiff2 (Trapnell et al. 2013) | 1508 | 6.887731799 | 2013 | 377 | No |
DESeq2 (Love et al. 2014) | 4249 | 19.40714351 | 2014 | 1416.333333 | Yes |
limma (Ritchie et al. 2015) | 2399 | 10.95733991 | 2015 | 1199.5 | Yes |
DEGseq (Wang et al. 2009) | 1237 | 5.649949758 | 2009 | 154.625 | Yes |
baySeq (Hardcastle et al. 2010) | 562 | 2.56691331 | 2010 | 80.28571429 | Yes |
SAMseq (Li et al. 2013) | 275 | 1.256051886 | 2013 | 68.75 | Yes |
NOIseq (Tarazona et al. 2012) | 38 | 0.173563533 | 2012 | 7.6 | Yes |
IRIS-EDA can be freely accessed directly through this link or through R using the following code in the proceeding sections.
IRIS-EDA requires several packages to operate. To get these pacakages, the following commands can be entereed into an R terminal to check if you already have the necessary packages. If not, the following code will install any missing packages:
# CRAN
packages <- c(
"crosstalk", "dplyr", "DT", "gtools", "plotly", "shiny", "plyr",
"shinyBS", "shinycssloaders", "shinythemes", "tibble", "tidyr",
"Rcpp", "Hmisc", "ggplot2", "locfit", "GGally", "pheatmap",
"reshape2", "backports", "digest", "fields", "psych", "stringr",
"tools", "openxlsx", "Rtsne", "WGCNA", "flashClust", "parallel",
"MCL", "kmed", "ape"
)
np <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(np)) install.packages(np)
You will also need several Bioconductor packages. Similar to the prior section, the following code will check and install any missing Bioconductor packages into your R library:
# Bioconductor
bioc_packages <- c(
"DESeq2", "edgeR", "limma", "QUBIC", "geneplotter", "GO.db", "impute",
"preprocessCore", "AnnotationDbi"
)
np <- bioc_packages[!(bioc_packages %in% installed.packages()[,"Package"])]
if (!require("BiocManager")) install.packages("BiocManager")
BiocManager::install(np)
To run BRIC analysis, you also need to download the source code for this clustering algorithm. Run this code to get the GitHub package:
# GitHub
if (!require("devtools")) install.packages("devtools")
devtools::install_github("OSU-BMBL/BRIC",force = T)
Once you have installed all of the necessary packages, you can now run the application locally by entering the following code into an R terminal:
shiny::runGitHub("iris", "OSU-BMBL")
Typically, the link will provide an easier route to using IRIS-EDA. In circumstances where internet connections will be limited (such as during travel), loading IRIS-EDA through R while internet is still available will allow users to utilize IRIS-EDA without an internet connection later on.
Installing the local application via GitHub will also let the user have access to more “developmental” versions of IRIS-EDA. Be warned though; using developmental versions of this application may be “cutting edge”, however, this could potentially break various features in the application. Therefore, using the server (https://bmbls.bmi.osumc.edu/IRIS/) will provide a more “stable”, peer-reviewed release of the application.
Upon sample submission to IRIS-EDA, you have two options: (1) Perform analyses on a normal RNA-seq experiment or (2) run single-cell RNA-seq analysis.
If you choose the latter option, you will need to provide a file containing ID lengths to help reduce variability. This is crucial since filtration of low TPM (transcripts per million) reads can lead to better results. In order to calculate TPM, ID length approximations are needed in conjunction with raw count data. To submit this information to IRIS-EDA, you will need to provide ID data that follows the proceeding criteria:
id id_len_kbp
gene001 1.4
gene002 5.6
gene003 0.8
... ...
Note: This file must be in a comma seperated value (CSV) format, where column 1 is the ID (e.g. gene, transcript, etc.) and column 2 is the approximation of length in kilobase pairs.
There are many ways to obtain these values. A common procedure would be to parse the respective general fearture format version 3 (GFF3) file and determine the length through calculating the difference between the start and end locations.
If you need help parsing this information, a primitive R
function has
been made, which you can find
here.
To run this function, you will need to install and load 3 packages into
your R library:
ape
stringr
dplyr
All of these packages can be found on the CRAN
repository, or by using the
install.packages()
function provided in the base package of R
.
This function has four parameters:
gff
: the location to the GFF3 file on your computer
cts
: your raw count matrix that you will provide to IRIS-EDA.
Note: this object must be a matrix
type and have the same
format that is found in the first section of this walkthrough (A
note about input data).
type
: the GFF3 type column identifier (e.g. gene, transcript,
exon, CDS, etc.)
attribute
: the GFF3 attribute (e.g. ID
, gene_id
,
transcript_id
, gene_name
, or transcript_name
)
Depending on the size of this file, this may take some time. This
function will return a tibble
data frame, so make sure you assign it
to an object before you use write.csv()
. After you perform this task,
the file can successfully be submitted to IRIS-EDA.
Once you have submitted the data, you will notice that the
Filter cutoff
changes from count data row sums
to TPM
:
The default is set to a value of 1
, however, this can be changed at
the user's discretion.
Note (1): this value corresponds to which rowsums (i.e. ID sums) will be filtered out if they have a value that is less than the user parameter.
Note (2): “For users with 10X scRNA-Seq data or other data type with
expected low expression across all genes, we have a submission option
that changes default parameterizations to account for this. If
submitting 10X scRNA data, please use the scRNA - 10X Genomics
option.
IRIS-EDA requires two pieces of information for analysis. The first is an expression estimation matrix, also referred to as a count matrix, displaying the gene expression estimates for each sample. The format requires a CSV file with the row names to list the gen IDs and column names to list the sample IDs. The second required input is a condition matrix, wherein the factor levels for each sample are provided. This file requires a CSV format and row names to be the sample IDs matching the sample IDs from the expression estimation matrix and the column names to be the condition factors.
The data used for this tutorial are derived from 28 Vitis vinifera
(grape) samples with three distinct factors (Rootstock, row, and block).
This data can viewed as the “big” example data set found under the
Submit and QC
tab under 1. Submission Parameters
.
Typically, an expression matrix, also known as count data or a count
matrix refers to data where every i-th row and j-th column refer to
how many reads are assigned to gene (ID) i in sample j. For example,
if we have simplified count data for 4 samples and three genes, the R
output will look something like this:
sample1 sample2 sample3 sample4
gene001 23 3 45 2
gene002 6 7 7 8
gene003 0 34 3 42
Note1: When loading count data into IRIS-EDA, make sure
that the first column is your gene IDs and that sample names are short,
concise, and avoid the use of mathematical operators (+
, -
, /
,
*
, ^
, etc.) and spaces between words. If a space is necessary for
legibility, please consider using an underscore (_
)
Note2: For sample names, avoid starting any entries
with numerical elements (e.g. 1sample
, 2_human
, 42_amf
, etc.).
This may potentially cause unexpected errors in downstream analyses!
Condition matrices, also known as metadata, details the design of your
experiment. In this type of matrix, every i-th row and j-th column
refer to factor levels assigned to sample i and factor j. For
example, if were to look at the samples given in the count
data section, the metadata R
output will look something
like this:
condition time
sample1 treated 0h
sample2 untreated 0h
sample3 treated 24h
sample4 untreated 24h
Note1: When loading metadata into IRIS-EDA, make sure
that the first column is your sample names and that column names and
treatment levels are short, concise, and avoid the use of mathematical
operators (+
, -
, /
, *
, ^
, etc.) and spaces between words. If a
space is necessary for legibility, please consider using an underscore
(_
).
Note2: For this data, avoid starting any entries with numerical elements (e.g. 1, 2, 42, etc.). This may potentially cause unexpected errors in downstream analyses!
Note3: Metadata can be expanded to fit the nature of your experiment (i.e. multiple factors can be added). The only thing that must remain consistent between these two matrices, is the sample information. Column names in count data must be the same as row names in the metadata.
Submit and QC
tab near the top-left corner of the
application:1. Submission Paramters
, select either
Start with a small example data set
,
Start with a big example data set
, or Load my own data
to upload
user data. Note: User data requires one count matrix and one
condition matrix:
2. Data Processing
, select a filter cutoff to simplify and
expedite computations by eliminating any rows that have below specified
expression totals. The default argument for our application is 10
:
3. Right below the filter cutoff parameter, select a transformation
method for the count data for displaying the expression estimate.
Note: For more information about any of these topics, take a look at
the FAQ section at the bottom of this document:
4. Click Submit
to load the data under 3. Launch Overview
:
5. After you click Submit
, the main page will now be populated with
several pieces of information in two sub-tabs, File Summary
and
Count Summary
.
File Summary
provides a glimpse into the submitted data. This subtab
includes three main components.
count data
section will detail the first and last five rows of
the count data and also include the total number of IDs and samples:
2. The second portion includes an overview of the condition data this is
simply an R
console-based output of this submitted CSV file:
3. Finally, pre- and post-filtered gene ID counts gives a
“before-and-after” count of the number of IDs that were filtered using
the filter cutoff parameter under 2. Data Processing
:
## S5.2: Count Summary Count summary
provides three interactive,
downloadable visualizations based on the file for each sample.
Submit and QC
tab, you may
proceed to the Discovery-Driven Analyses
tab:
2. The Discovery-Driven Analyses
tab will be populated with several
subtabs, similar to the Submit and QC
tab. The first subtab you will
see is the Correlation
tab. This tab provides correlation analysis of
the expression matrix by sample:
Correlation
subtab you will see several visualizations.
Interactive Correlation Analysis
displays a heatmap of the correlation
between samples with interactivity showing the actual correlation
between the two intersecting samples. The example data shows most
sample-sample correlations of 0.95 or larger, indicating relatively high
correlation. The darker cells here signify less similar samples, which
may yield more interesting differential expression results. This graph
can indicate comparisons of interest in future analyses. In most cases,
the high number of gene IDs with similar or identical expression
estimates will cause correlations to be large, even for dissimilar
genetic expression profiles:
4. Clicking on a cell will provided a scatterplot of the two
intersecting samples with each data point representing one gene ID. A
scatterplot with points falling more closely along the diagonal
indicates samples with more similar genetic expression profiles. This
scatterplot shows a clear trend of data points (gene IDs) falling along
or close to the diagonal. That means these two samples have very similar
genetic expression profiles. Data points that fall far from the diagonal
line represent genes with dissimilar expression levels between the
select samples:
6. The next visualization will be under the PCA
subtab. This subtab
provides Principal Component Analysis (PCA) for the expression
estimation matrix:
8. The next visualization will be under the MDS
subtab. This subtab
provides multi dimensional scaling for the expression estimation matrix.
This is similar to PCA except is develops components through non-linear
transformations of the gene expressions:
11. The next visualization will be under the Heatmap
subtab. This
subtab provides an interactive heatmap with rows representing the gene
IDs and columns representing sample IDs:
Biclustering
performs a biclustering
analysis using one of the selected biclustering tools (3) with a maximum
bicluster size of the indicated cutoff value (2):
Clustering
tab allows for users to select one of three
clustering methods respectively representing hierarchical,
representative, and graph-based clustering: Weighted Gene Co-expression
Network Analysis (WGCNA), k-medoids, and the Markov Clustering Algorithm
(MCL). While all three methods have demonstrated high performance
related to module detection, the default is WGCNA due to it being the
best of the three. Here, users should be careful with large datasets, as
these methods can take large amounts of time to run. Setting the
variable cutoff option lower will reduce the time by selecting fewer of
the most differentially expressed genes to use in clustering.