- Design principles
- Creating and processing MISO output with
- Creating custom GFF annotations for MISO
- Frequently Asked Questions (FAQ)
rnaseqlib is a simple, lightweight pipeline for RNA-Seq analysis.
- Pipeline for analyzing RNA sequencing data:
- Maps reads to genome and splice junctions
- Computes basic quality control statistics
- Outputs RPKM values for genes
- Supports several sequencing experiments:
- mRNA sequencing (mRNA-Seq)
- Ribosome profiling data (Ribo-Seq)
- CLIP for RNA-binding proteins (CLIP-Seq)
- SELEX-Seq (Bind-n-Seq)
- Contains utilities for processing MISO output
rnaseqlib follows three simple design principles. It is intended to be:
- Simple: provides minimalistic support for RNA-Seq. Performs only simple computations that are applicable to nearly all experiments – complexities that are specific to certain experiments/libraries are left as post-processing steps for the user.
- Lightweight: minimal dependencies. Relies mostly on Python and commonly used genomic packages (such as Bedtools), to avoid software bloat and complex installation.
- Compact: produces and consumes compressed files, so that it can be used in projects with hundreds of samples.
rnaseqlib from the GitHub repository (http://github.com/yarden/rnaseqlib). To install with distribute‘s
For local installation with distribute, use:
easy_install --user -U rnaseqlib
If you don’t have
easy_install, install it as an ordinary Python package:
python setup.py install
Or for local installation:
python setup.py install --prefix=/your/local/dir
Requires Bedtools and Samtools, and several commonly used Python modules (like pandas, all of which are installed automatically using the Python package manager.) The initialization step requires Jim Kent’s genePredToGtf utility for processing UCSC genome browser tables.
There are two steps to running
rnaseqlib: First, creating a set of initialization
files for your genome (called an RNA Base) – this is done once per genome.
Second, writing a configuration file that describes your samples and library parameters,
which can then be used to run the pipeline.
Initializing an RNA base for your genome¶
The first step is to create a set of files, called an RNA Base,
required to run
rnaseqlib for a particular genome. For example,
rna_pipeline.py --init mm9 --output-dir pipeline_init
This will create a directory called
pipeline_init containing the
necessary files for running the pipeline on the mm9 genome.
The initialization procedure will, among other things, do the following:
- Download the genome sequence files from UCSC (in FASTA format)
- Download gene tables from UCSC for Ensembl genes, UCSC knownGenes, and RefSeq genes.
- Compute the coordinates for exons, introns, constitutive exons, constitutive exons in coding regions, and other useful features of gene tables
- Index the genome files using
For the mouse and human genomes, sequences of ribosomal RNA (rRNA) are automatically
downloaded from NCBI and built into the Bowtie index as
chrRibo. The pipeline
chrRibo in later steps to filter out rRNA reads and measure the level
of rRNA contamination in libraries.
Configuration: specifying your data and mapping parameters¶
The settings of the RNA-Seq pipeline are specified through a single settings file that contains four main sections:
pipeline: what data type is used (e.g. RNA-Seq, Ribo-Seq, etc.)
pipeline-files: where the pipeline initialization files are stored
mapping: parameters related to mapping of the data (e.g. what mapper to use, where the genome index is.)
data: where the input sequence files are and where the results should be outputted
The following is an example settings file for a set of mRNA-Seq samples:
## ## Pipeline settings for RNA-Seq ## [pipeline] # Data type to use (e.g. 'rnaseq' or 'riboseq') data_type = rnaseq [pipeline-files] init_dir = ./mm9 [mapping] mapper = tophat bowtie_path = bowtie tophat_path = tophat tophat_index = mm9_index cluster_type = bsub genome = mm9 tophat_options = --bowtie1 --min-anchor-length 4 tophat_gtf = ./mm9/ucsc/knownGene.gtf readlen = 40 overhanglen = 4 paired = True paired_end_frag = 300 stranded = fr-first [data] indir = ./input_dir outdir = ./results sequence_files = [ ["sample1_p1.fastq.gz", "sample1_p1"], ["sample1_p2.fastq.gz", "sample1_p2"], ["sample2_p1.fastq.gz", "sample2_p1"], ["sample2_p2.fastq.gz", "sample2_p2"]] sample_groups = [["sample1", ["sample1_p1", "sample1_p2"]], ["sample2", ["sample2_p1", "sample2_p2"]]]
Running the pipeline¶
To run the pipeline, use the
rna_pipeline.py --run --settings ./settings.txt --output-dir ./my_results
settings.txt is the pipeline settings file and
my_results is a
directory where the pipeline output should go.
An example settings file and small FASTQ files for an mRNA-Seq dataset are available
examples/rnaseq directory of the pipeline. The paths in the settings
examples/rnaseq/rnaseq_settings.txt need to be edited to reflect the paths of
various files on your own filesystem (e.g.
init_dir needs to be set to the
location of your RNA base.) The rawdata files for the examples are two paired-end
mRNA-Seq samples are available in the
run_pipeline.py is the main driver script of the pipeline.
It takes the following main arguments:
--run Run pipeline. --settings Settings filename. --init Initialize the pipeline. Takes as input a genome, e.g. mm9 or hg18. --output-dir Output directory.
Settings file features and options¶
Below are the main parameters to be used in the
rnaseqlib settings file.
Parameters that have default values are considered optional and can be omitted
from the file.
mapper: The mapper to use (currently, only
bowtie_path: Path to
bowtieprogram (optional). Default is
bowtiemust be on your path.
tophat_path: Path to
tophatprogram (optional). Default is
tophatmust be on your path.
tophat_index: Path to genome index that should be used by
tophatmapping. This index is typically built by
bowtie-buildand is automatically made as part of the RNA Base using the
cluster_typesettings under the
[mapping]section of the settings file specifies how distinct samples should be processed. It can be set to one of three values:
bsub: Run distinct samples as separate jobs on a cluster using
qsub: Same as
bsub, but using the
none: Run the pipeline using multiple cores on the local computer (does not make use of a cluster.)
genome: Genome to be used, e.g.
tophat_options: String of command-line options to be passed to
tophatwhen it is invoked for mapping. E.g.
tophat_options = --bowtie1 --min-anchor-length 4, to signal to Tophat to use
bowtie1for mapping and use a minimum of 4 base overhang for junction reads. This string is appended to the Tophat call and so must contain valid Tophat arguments for the call to succeed.
tophat_gtf: GTF file of known gene models to be used with Tophat (optional). Used to tell Tophat about known junctions.
readlen: Read length of input BAM files.
overhanglen: Overhang restriction on junctions (optional). Default is 1.
paired: Whether the data is paired-end or not. If paired-end, specify
True, if single-end, specify
paired_end_frag: Average fragment length (optional). Only used for paired-end runs. Used internally as an argument to Tophat to specify the expected fragment length.
stranded: If data set is strand-specific, specify the strand convention (optional). Uses the same strand conventions as Tophat (e.g.
Creating and processing MISO output with
Section under construction
misowrap is a utility for running MISO and processing its output for a set of samples.
It takes a configuration file as described above that lists a set of BAM files and their sample
labels, and automatically runs MISO on these samples and generates pairwise comparisons between them.
It can also create a set of filtered events, that are selected to meet read coverage criteria. Filtered events are outputted to a table containing various gene features from the UCSC tables generated by the pipeline’s RNA base.
Example MISO pipeline¶
The following pipeline steps start with a settings file describing a set of samples and run MISO on the samples, generating summary files and comparisons for relevant samples, and combining the results in unfiltered and filtered (based on read counts) form.
- Run MISO on samples:
misowrap.py run misowrap_settings.txt miso_output/
The raw MISO output will be placed in the directory defined in
misowrap_settings.txt and the logs of the run will be placed in
- Once run is completed, summarize samples:
misowrap.py summarize misowrap_settings.txt miso_output/
This summarizes samples and places summary files for each sample in relevant place as inferred by sample organization in
misowrap_settings.txt. Logs are placed in
3. Compare samples to each other. Uses the
sample_groups parameter in settings file to determine which
pairwise comparisons to run.
misowrap.py compare misowrap_settings.txt miso_output/
This compare samples to each other and places the output of comparisons in a
comparisons subdirectory of the MISO output directory given in
misowrap_settings.txt. Logs are placed in the
- Filter comparisons based on reads counts, using the count filters defined for each event type in
misowrap.py filter-comparisons misowrap_settings.txt miso_output/
This creates filtered versions of the comparison
.miso_bf files and places them in the directory
filtered_events in the MISO comparisons directory (which is the
comparisons directory in the MISO output directory given in
misowrap_settings.txt). For example:
- MISO output directory (defined by settings file) - comparisons/ - filtered_events/ - SE/ - SE samples here... - ... - RI/ - RI samples here... - ...
- Combine comparisons into single files and place them in the directory where our comparisons are. Takes the
comparisonsdirectory (which contains MISO comparisons) and the settings file
misowrap.py combine-comparisons miso_output/comparisons/ misowrap_settings.txt –output-dir miso_output/comparisons/
This outputs single files that contain information about events pooled from all the samples described in the settings file. It also creates a filtered version of these combined events where the read filters (defined for each event) in
misowrap_settings.txt are applied to the samples. The resulting output structure is:
- MISO output directory (defined by settings file) - comparisons/ - combined_comparisons/ # combined comparisons (unfiltered) - filtered_events/ # filtered events - combined_comparisons/ # combined comparisons (filtered) - SE.miso_bf # combined, filtered comparisons for all SE events - RI.miso_bf # combined, filtered comparisons for all RI events - ...
Creating custom GFF annotations for MISO¶
rnaseqlib has a set of scripts in the
gff module that can generate a GFF annotation of exon-centric alternative events which can be used by MISO for quantitation using RNA-Seq. Given a set of transcripts in the UCSC genePred format, these scripts will build a “splice graph” representation of how each exon in the transcript can be spliced, and then produce a set of possible events (categorized into alternative event types) by traversing this graph. For example, suppose we have a genePred table called
$ head -n 3 ensGene.txt 585 ENST00000619216.1 chr1 - 17368 17436 17368 17368 1 17368, 17436, 0 MIR6859-2 none none -1, 585 ENST00000473358.1 chr1 + 29553 31097 29553 29553 3 29553,30563,30975, 30039,30667,31097, 0 MIR1302-11 none none -1,-1,-1, 585 ENST00000469289.1 chr1 + 30266 31109 30266 30266 2 30266,30975, 30667,31109, 0 MIR1302-11 none none -1,-1,
We can produce a set of exon-centric GFF events from this table using the
gff_make_annotation script from
$ gff_make_annotation ./ ./gff --flanking-rule commonshortest --genome-label hg38 Making GFF alternative events annotation... - UCSC tables read from: ./ - Output dir: ./gff Loaded 1 UCSC tables. Loading tables... Populating graph... Adding splice edges from table ensGene.txt Populating graph took 37.50 seconds Reading table ./ensGene.txt Generating skipped exons (SE) Generating mutually exclusive exons (MXE) Generating alternative 3' splice sites (A3SS) Generating alternative 5' splice sites (A5SS) Outputting retained introns... Defining retained introns (RI) Took 1.60 minutes to make the annotation.
The call to
gff_make_annotation says: load all genePred tables in the current directory (
./) and output the set of GFF events into the directory
--flanking-shortest parameter specifies how to pick the flanking exons of an alternative event when there are multiple options (e.g. which flanking exons to use for an alternatively skipped exon.) In the above call, we chose to the take the common shortest region as flanking exons. The
--genome-label option specifies what the name of the GFF annotation should be. Note that
rnaseqlib will only load genePred tables if they are named
knownGene.txt. Any genePred files with these names that are in the input directory will be parsed and used to populate the splice graph, so that information from multiple annotation genePred tables can be used to make the GFF events.
This call should generate an output directory with a GFF file for each of the event types:
$ ls ./gff/commonshortest/ A3SS.hg38.gff3 A5SS.hg38.gff3 MXE.hg38.gff3 RI.hg38.gff3 SE.hg38.gff3
Frequently Asked Questions (FAQ)¶
Section under construction
- Noah Spies (MIT)
- Eric Wang (MIT)