Overview RNA-Seq data analysis
RNA-Seq has revolutionized the way we explore gene expression data. Simple gene-level or more advanced transcript-level data analysis with the potential to detect alternative splice events are now on your fingertip. Are you interested in long non-coding RNA? Fusion transcripts? The organism of your interest is none of the usual suspects? You want to make use of your collection of formalin-fixed, paraffin-embedded (FFPE) samples? Everything is possible and the quality of the results is striking. However, with new possibilities comes new obstacles and a lot of decisions have to be made in order to obtain the best possible result. And even the best result might not be the right one. We think, that analysis of data always needs to be also adjusted to the initial aim of the researcher. We understand complicated experimental designs and will adapt our data analysis workflow according to your aim. No standard pipelines. Promised.
No result counts, if not presented in the best way. We are aiming for high-quality figures. We provide high-resolution images and additionally pdf versions of your graphics, which enable you to manipulate colors, text and many other options. Please see an example video here.
In case you want to contract it’s biology to analyze your RNA-Seq project, we will divide the whole process into 4 steps, with you choosing which level of analysis you need:
Consulting in the experimental design and technical procedure of the experiment. Sometimes one phone call can help tremendously.
Low-level data analysis
Data quality assessment, read manipulation (trimming, filtering), alignment and quantification and normalization
Statistics and Visualizations
Statistics and visualization of differential expressed genes, transcripts or isoforms.
Interpretation and Integration
Further analysis, like Gene-ontolgy enrichment, Pathway involvement or integration with results from other assays.
We are analyzing RNA-Seq data from all mayor next-generation sequencers from Illumina or Ion-Torrent. We can start from files in FASTA, FASTQ, unaligned BAM files or SRA format.
Please scroll down for more information about the single steps of our RNA-Seq workflow. Please contact us here, in case you have any question about our service.
Quality control and read preparation
It is no secret that quality control of RNA-Seq raw data is essential in order to reveal a good result. This is not different to other biological assay data, but in RNA-Seq we do have the possibility to shape the raw data according to quality parameters. One example for such shaping would be trimming of read ends based on the quality scores of the bases. We do use standard quality control tools like FASTQC, however we add additional quality assessments whenever needed. For example we check also by default the possibility of RNA degradation for each sample. We have seen, that this step is particularly important when working with samples from formalin-fixed, paraffin-embedded (FFPE) tissue. Also we check for contaminating ribosomal RNA in your samples and exclude those reads from the analysis in order not to interfere with later normalization procedures. For RNA-Seq samples coming from cell culture we also include a screen against prominent contaminants like yeast, bacteria (intra- and extracellular) viruses and cross-contamination with other species (during sample preparation). And for security we actually do this contamination check always, regardless of the sample origin.
You will receive a pdf report containing all crucial quality plots for both raw and quality-enhanced data. We will discuss with you all manipulation steps and of course we document every step we perform.
Alignment of RNA-Seq reads to the genome is the most computational demanding process in the whole workflow. And again a tricky one, since quite some decisions have to be made. Which Aligner? What should I use as reference? The whole genome or just the transcriptome? What are the optimized parameters for the chosen alignment algorithm? We definitely have answers to those questions and are able to give you first results of your medium size RNA-Seq experiment (16 samples, 200 GB raw data) in just one working day.
Here is what we do. After preprocessing of the reads we align to the whole genome using our favorite RNA-Seq aligner, Star, which shows high specificity and sensitivity. And it’s fast. In parallel we also align subsamples of your data using a set of other splice-site aware aligners (for example GSNAP and Tophat2). In case the Star alignment does not match our parameters for successful alignment we compare with the other alignment sets, based on selected parameters and also by inspecting the aligned reads visually, in order to make a decision about the best alignment strategy.
The result of this step are sorted and indexed BAM files, which can be used for the next step, quantification, or visualized in any genome browser.
Quantification and Normalization
Now, since we now the position of the reads on the genome the next step is to quantify the reads to known genes and transcripts. Again, there are a plethora of tools available, some being better than others. We have very good experience with SALMON, a successor of SAILFISH, which is right now our favorite tool, to get stable gene/transcript expression values. Another decision to make here is, which database of known genes/transcripts should be used here. RefSeq, being conservative contains much less isoforms than for example ENSEMBL. And both have advantages, which we will discuss with you in order to fit the results to your aim.
When we have our table with expression values we need to normalize the data. There is still no clear agreement, which normalization strategy work best with next generation sequencing data, although some have been shown to have clear drawbacks (e.g. rpkm normalization). We apply “the normalizer”“ to apply a selection of valid normalization procedures and check, which one is working best with the data set at hand. One main criterion here is, how the normalization algorithm is handling local and general derivation from the mean. Additionally we apply statistics to all normalization sets and verify the output by means of sensitivity, false-positive rate and biological meaning.
Statistics and more
Likewise we test various normalization procedures on the different data sets, our experience strongly emphasizes to carefully select the suitable statistical test for your data set. Each data set behaves differently, with the number of samples per condition being one crucial parameter. We apply the mayor statistical test available for RNA-Seq data and carefully check the output of each test, respectively. To be more precise, we cluster all genes/transcript which are specific for the one or other or any overlap between any of the statistical test. We have posted an example for one data-set in our Blog. By carefully investigating the output of each statistical test also on biological meaning, we are confident to select the right test for your data. This might be one specific statistical test or a meta-set based on two or more test.
Besides hierarchical cluster, we offer a huge selection of exploratory visualizations which help us to interpret the results. Examples would be classical x-y (scatter) plots of the means of your conditions, SOM-cluster, K-means clustering, volcano plots, principal component analysis (PCA), box-plots or histograms. We are applying not only Gene-Ontology (GO) enrichment and Pathway enrichment on the sets of significant genes/transcripts. We also make use of custom gene sets, transcription factor target sites or mi-RNA target sequences. This can uncover transcription factors or mi-RNAs acting as regulators in the experiment. Additionally we do a statistical test on GO classes and Pathway sets for all expressed genes/transcripts regardless of significance. This allows us to reveal Pathways which are generally targeted by the experimental procedure, but maybe to a level, which would been detected by simple enrichment analysis.
On single gene-level, we are offering barplots, dot-plots, interaction-plots and line-graphs. This is always done on single sample level, but also on the means of the conditions. We carefully select recent annotation from various, selected resources. For example a single-gene pdf reports would always include the comprehensive RefSeq summary for the gene. This all, to help you to get ideas and biological insight for your experiment with ease.
We also can do integration analysis with your RNA-Seq results. You might have for exmple miRNA-Seq or Chip-Seq data you want to put in parallel with your expression data. We bring the sets together. Have you done the experiment already using microarrays? We are experts in doing comparison of microarray and RNA-Seq data sets.
You don´t have RNA-Seq data yet, but somebody else published a data set you are interested in? You want to know, if the results are valid? You have already analyzed your data, but you want a second opinion? All questions you can contact us about. We are confident that we can provide you with answers.
The discovery of true alternative splicing events is tricky, since it is prone to report false-positive events. It is crucial, that your data set has sufficient read depth in order to get reasonable result. In principle there are two different approaches to detect alternative splice patterns: Based on transcript expression or on exon level. We prefer transcript level alternative splicing detection, since we think, that the output is easier to interpret by the investigator. After isoform quantification we apply ANOVA statistical modelling in order to get a p-value for the probability of alternative splicing. After adjusting the p-values for multiple testing, we manually curate all significant genes for being sure true positive alternative-splice events. You will get a pdf report including all necessary graphs for interpreting the detected splice pattern both on transcript as well as on exon level both on single sample level (to visualize the deviation) and on level of the means of your conditions. Also we include gene-focused genomic plots including the transcript model and the raw reads for visual examination. Example plots can be seen in the slider window above.