RNAseq Workshop

CSI Hands On Session
Daniel Gautheret
Using mapping-free software for efficient splice variant
1. Background

Recent mapping-free software such as Salmon (Patro, 2015) and Kallisto
(Bray, 2016) enable a very fast assesment of the expression level of
genes and transcripts isoforms. We will run a Unix pipeline based on
Kallisto to identify differentially expressed genes and splice
variants between two sets of human RNA-seq libraries. (requires basic
knowledge of Unix commands)
2. Required programs
R libraries:
wasabi (actually won’t be used but is called in the Sleuth.R script)
3. RNA-seq data

The RNA-seq data comes from a study of a epthithelium-mesenchymal
transition (EMT) model by Yang et al. (2016). EMT was induced by
ectopic expression of Zeb1 in a non small cell lung cancer cell line
(H358). The authors obtained RNA-seq data over a time course of 7 days
starting from uninduced cells and every day up to 7 days of induction.

Sequence libraries are polyA+, pair-end 2x100nt, each in biological
triplicate. Sequencing is performed on a Illumina HiSeq 2500.

Here we will compare “Day 0” (ie uninduced cells) with “Day 7”.

paper: http://www.ncbi.nlm.nih.gov/pubmed/?term=27044866
Yang Y, Park JW, Bebee TW, Warzecha CC, Guo Y, Shang X, Xing Y,
Carstens RP. Determination of a Comprehensive Alternative Splicing
Regulatory Network and Combinatorial Regulation by Key Factors during
the Epithelial-to-Mesenchymal Transition. Mol Cell
Biol. 2016. 36:1704-19

fastq files were obtained here: http://www.ncbi.nlm.nih.gov/sra?term=SRP066794
4. Read sampling

To make file manipulation easier, we sampled about 0.5% of the total
RNA-seq reads. All retained reads are from a single chromosome
(chr18), in order to retain statistical power for differential
expression tests. We mapped all reads against the HG19 human genome
using the STAR program, filtered reads mapping to chromosome 18 using
the grep command, and filtered out 50% of the remaining reads using

Typical read numbers (for library Day_0_1):

Initial fastq file: 72Mx2 reads
Reads mapping Chr18 (STAR mapping + grep on SAM file) : 685,000 x2 reads
Sampled by a factor of 0.5 (Samtools) : 343,000 x2 reads

This represents 0.5% of total reads, thus actual runtimes and space
requirement would be up to 200 times higher than in our exercices.
5. Preparing Transcript data

Kallisto needs a Fasta file containing all transcript sequences.
Transcript annotation can be retrieved as GTF, GFF or FASTA files

Gencode: http://www.gencodegenes.org/releases/current.html
UCSC: http://hgdownload.cse.ucsc.edu/downloads.html#human
Directly from the Kallisto web site: http://bio.math.berkeley.edu/kallisto/transcriptomes/

Conversion of GFF/GTF to FASTA format can be done in several ways:

1. Using the gffread program from the Cufflinks suite:
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf_or.gff

2. Using Bedtools:
bedtools getfasta -fi /path/to/genome.fa -bed /gff_or_gtf_or_bed -fo transcripts.fa

(gffread makes a better job of preserving the actual gene names into
the fasta file, however Gene to transcript relations are lost)

Note that you can add your own transcripts to the Fasta file and these
will be counted too! (you may for instance add specific viral RNAs or
non-coding RNAs not included in Refseq/Gencode annotations)
6. Running the Kallisto + Sleuth pipeline

The complete pipeline is named kallisto-sleuth.sh and should run
directly if programs are installed and directories for Fastq files and
Fasta files are correctly specified. The pipeline creates a “Output”
directory with results and error logs.

Run time is dependent on the size of the transcript andFastq
files. Kallisto converts transcript sequences into a k-mer index. The
size of the index and time to generate it are proportional to the
initial file size. For a full Gencode transcriptome (220k
transcripts), expect about 10min for generating a ~3Gb index. After
the index is created, quantification is performed in about 30min for a
complete ~50M reads RNA-seq library, and just a few seconds using our
sampled libraries.

Library orientation: option –rf-stranded (default) or –fr-stranded
determines alignment direction on the RNA strand. A wrong option will
cause alignment failure. This should be checked for each RNA-seq run
as different sequencing facilities use different library protocols.

The kallisto_*.err files in the Output/Logs directory provide useful
information on the kallisto pseudomapping. For instance:

incorrect alignment result:
[quant] processed 298,261 reads, 5,959 reads pseudoaligned

correct alignment result:
[quant] processed 298,261 reads, 256,885 reads pseudoaligned

The final Kallisto results are saved in Output/name_of_library/*
They contain files:
– abundance.tsv : a text file with counts for each transcript
– abundance.h5 : a binary file with counts, readable by Sleuth
– run_info.json : a summary of run stats

Field names in Kallisto tsv table:
– target_id – transcript name
– pval – pvalue
– qval – FDR adjusted pvalue using benjamini-hochberg
– b – the ‘beta’ value (analogous to fold change, though
technically a bias estimator which has to do with the
– se_b – the standard error of beta
– mean_obs – the mean of the observations. this is used for the
– var_obs – the variance of the observations
– tech_var – the technical variance from the bootstraps
– sigma_sq – the raw estimator of the variance once the tech_var
has been removed
– smooth_sigma_sq – the smooth regression fit for the shrinkage
– final_sigma_sq – max(sigma_sq, smooth_sigma_sq). this is the one
used for covariance estimation of beta (in addition to tech_var)
7. Analyzing Sleuth results

– Loading Sleuth results into R (simple text only version)

# read Sleuth result table
# as.is enables reading transcript names as strings,
# not as Factor variable (eg. male/female)
x=read.table(file=”Output/Sleuth/Sleuth_Results”, header=T, as.is=T)

# basic checks
x [1:10,]
options(“width”=200) # for wider window
x [1:10,]

# sorting the table

# Sort by ‘beta’ value (analogous to fold change)
x = x[order(x$b),]
x [1:10,]

# Sort by qval (Benjamini-Hochberg adjusted P-value) =default sort
x = x[order(x$qval),]

– Loading Sleuth results into R (Shiny version)

Retrieve the “Sleuth_Shyny.rds” file produced by the Sleuth session
and save it in your current directory.
Launch R, then:


# to list all sleuth functions:
sleuth_live(so) # launches web page with all Sleuth results

Sleuth displays to be checked:
– Diagnostics/ mean-variance plot
– Summaries/ design matrix
– Summaries/ fragment length distribution
– Summaries/ kallisto table
– Maps/ PCA
– Maps/ Sample heatmap
– Analysis/ Volcano Plot (change opacity, select dots)
– Analysis/ MA Plot (change opacity, select dots)

See also:
– Sleuth “getting started” guide:
– A good Sleuth video:

8. Verifying Sleuth Results with IGV

(requires installation of IGV on workstation)

Load one index BAM file for each condition each (1 DAY_0 and 1 DAY_7 file).
(We could load all 6 BAMs however the display would be crowded).

– Find gene CDH2 on chr18. This is one of the hallmark genes of EMT. It
should be overexpressed by a factor ~2.

– Find gene MBD1. 21 splice forms in the Sleuth output table.


### Static variables
# Fasta transcript file (file can be gzipped)
# Fastq file directory (files can be gzipped)
# Output directory
# Kallisto Index directory and file name (index will be created)
# log directory
# Basename of the samples (fastq files) for each condition
SAMPLIST1=(Day_0_1_chr18.sampled Day_0_2_chr18.sampled Day_0_3_chr18.sampled)
SAMPLIST2=(Day_7_1_chr18.sampled Day_7_2_chr18.sampled Day_7_3_chr18.sampled)
# Thread number
# Number of bootstrap for Kallisto
### Actual Jobs
# checking & building output dirs
if [[ -d ${OUTPUT_DIR} ]]; then
rm -r ${OUTPUT_DIR:?}/*
mkdir ${OUTPUT_DIR} 
mkdir ${LOG_DIR} ${IDX_DIR}
# Building Kallisto Index
# --make-unique is necessary when several transcripts have the same name
echo "Building Kallisto Index"
touch ${IDX_PATH}
kallisto index -i ${IDX_PATH} ${FASTA_PATH} --make-unique \
> "${LOG_DIR}/index.out" \
2> "${LOG_DIR}/index.err"
# Quantifying
# Option --rf-stranded (default) vs. --fr-stranded 
# determines ability to map on the correct RNA strand
for ITER in ${SAMPLIST[*]}; do
echo "Quantifying ${ITER}"
# Current output file
mkdir ${CURR_OUTPUT}
kallisto quant -i ${IDX_PATH} \
--bias \
--rf-stranded \
-t ${THREAD_NB} \
${READS_DIR}/${ITER}.R1.fastq.gz \
${READS_DIR}/${ITER}.R2.fastq.gz \
> "${LOG_DIR}/kallisto_${ITER}.out" \
2> "${LOG_DIR}/kallisto_${ITER}.err"
# SLEUTH Differential expression
echo "Performing Differential Expression"
# Name/path of Sleuth output directory
mkdir ${SLEUTH_PATH}
# Basenames of the samples for Sleuth 
# 1 list per condition
TEMP=( "${SAMPLIST1[@]/#/${OUTPUT_DIR}/}" )
TEMP=( "${TEMP[@]/%//abundance.h5}" )
TEMP=( "${SAMPLIST2[@]/#/${OUTPUT_DIR}/}" )
TEMP=( "${TEMP[@]/%//abundance.h5}" )
./Sleuth.R -w ${SLEUTH_PATH} \
-b \
-c "$(echo ${BASE_PATH1} | tr ' ' ',')" \
-d "$(echo ${BASE_PATH2} | tr ' ' ',')" \
> "${LOG_DIR}/Sleuth.out" \
2> "${LOG_DIR}/Sleuth.err"