RNA-Seq Data Analysis: An abstract Guide

shuaibKhassin 8 views 28 slides Feb 28, 2025
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

An overview of RNA-Seq data analysis, covering the process of analyzing raw sequencing data to identify gene expression patterns and biological insights. It includes key steps, tools, and methods used in transcriptomics research, offering a an abstract guide for researchers and students


Slide Content

RNA-Seq Data Analysis

RNA Seq replaced Microarray

RNA Seq Analysis
●RNA-seq, is a high-throughput sequencing technology used to measure gene
expression levels and identify differentially expressed genes.
●It provides a comprehensive view of the transcriptome, allowing researchers to
study gene expression, alternative splicing, and transcript isoforms
●Workflow:
○Sample Preparation
○Library construction
○Sequencing
○Data Analysis

Prerequisites
System
●Linux Operating system: in built in MAC, for Windows users- simplest
way is to install Ubuntu LTS app
Provides a command-line interface (CLI) for executing complex data
analysis tasks and automating workflows.
●Anaconda: software package manager - Installs Python 3 and ensures
other tools
●R-Studio - user-friendly interface with various tools and features to
facilitate data analysis, visualization, and statistical modeling.

Tools used in the sequence processing
through the steps
1.Quality control- FASTQC, MULTIQC
2.Filtering- Trimmomatic, Cutadapt, NGS Toolkit
3.Mapping - Bowtie, Hisat2, Tophat2
4.Visualization of Alignment - SAM tools, BAM
files
5.Quantification- htseq-counts, featureCounts,
stringTie, RSEM
6.Differential expression - DESeq2, edgeR, limma
7.Functional analysis - EnrichR, GSEA, GOstats

FASTQ File Format
Each read represented by 4 lines
1.@ followed by the read ID
2.The sequence
3.Begins with + sign - optionally followed by a read ID
4.The Phred Quality scores for each base call in the
sequence line - same length as the sequence
A typical sequence reads with
400,000,000 reads will generate a
file containing 1.6 billion lines of
data

Phred Quality scores
●Phred quality scores measure base call accuracy in sequencing.
●They range from 0 to 40, with higher scores indicating higher quality and lower error
probability.
●Scores are logarithmic and often represented using ASCII characters for storage and
visualization.
●The Phred score (Q) is related to the probability (P) of an incorrect base call as follows

FastQC
Sequence Quality Analysis:
Evaluates base quality scores to identify potential sequencing errors and low-quality regions.
Assesses sequence content and GC distribution to detect biases or contamination.

Adapter and Contaminant Detection:
Identifies adapter sequences that may affect downstream analysis.
Detects overrepresented sequences and potential contaminants present in the data.

Sequence Length Distribution:
Analyzes the length distribution of sequences to ensure consistency and identify any anomalies
or outliers.

Per-Base and Per-Sequence Quality Scores:
Generates plots and statistics to assess the quality scores of each base and the overall quality
across sequences.

FastQC

Trimmomatic
trimmomatic SE -phred33 SRR00001.fastq SRR00001_trimmed.fastq
ILLUMINACLIP:/mnt/d/Tools/Trimmomatic/Trimmomatic-0.39/adapters/TruSe
q3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOWN:4:20 MINLEN:36
●Based on the results of FASTQC, one might want to filter and trim the sequences

●It helps remove low-quality reads, adapter sequences, and other artifacts, improving
data quality for downstream analysis.

●To check the adapter contamination we give adapter file from illumina

Alignment

●Reference-based Alignment:
The most common approach is aligning RNA-seq reads to a reference genome using
alignment algorithms such as STAR, HISAT2, or Bowtie.
Reference-based alignment allows for precise mapping of reads to known genomic
regions, including exons, introns, and splice junctions.

●Transcriptome-based Alignment:
In transcriptome-based alignment, RNA-seq reads are aligned to a reference
transcriptome, such as RefSeq or Ensembl, using tools like Salmon or Kallisto.
Transcriptome alignment enables the quantification of transcript-level expression and
identification of novel transcripts.

Bowtie and BWA use a data structure called the Burrows-Wheeler transform (BWT),
which stores the reference genome in a highly compressed form. Through the use of a
special indexing scheme called the Ferragina-Manzini (FM) index

HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) employs
two types of indexes for alignment:
●Global, whole-genome index and tens of thousands of small local
indexes. Using the same BWT/FM index algorithm as Bowtie
●Quality-Aware Alignment
●Tophat

HTSeq

Differential gene expression analysis from assembled gene
expression
1. Launch RStudio and load necessary library

> library(“DESeq2”)

2. Create necessary data object.

> sample.names <- sort(paste(c(“MT”, “WT”), rep(1:3, each=2), sep=““))
> file.names <- paste(“../”, sample.names, “/”, sample.names, “.count.txt”,
sep=““)
> conditions <- factor(c(rep(“MT”, 3), rep(“WT”, 3)))
> sampleTable <- data.frame(sampleName=sample.names,
fileName=file.names,
condition=conditions)
> # read in the HTSeq count data
>
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
directory=“.”,
design=~ condition )

Yalamanchili et al., 2017
Data analysis pipeline for
RNA-seq experiments

3. Run differential gene analysis.
> ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) > 10, ]
> dds <-DESeq(ddsHTSeq)

4. Quality checks on the samples.

> rld <- rlogTransformation(dds, blind=FALSE)
> # Plot PCA plot
> plotPCA(rld, intgroup=“condition”, ntop=nrow(counts(ddsHTSeq)))

> # Plot correlation heatmap
> cU <-cor( as.matrix(assay(rld)))
> cols <- c( “dodgerblue3”, “firebrick3” )[condition]
> heatmap.2(cU, symm=TRUE, col= colorRampPalette(c(“darkblue”,”white”))(100),
labCol=colnames(cU), labRow=colnames(cU),
distfun=function(c) as.dist(1 - c), trace=“none”, Colv=TRUE,
cexRow=0.9, cexCol=0.9, key=F, font=2,
RowSideColors=cols, ColSideColors=cols)

PCA plot (a) and heatmap(b) on correlation coefficient between samples (b) based on
gene expression profiles of the six samples.

5.5. Output differential gene analysis results
#> res <- results(dds, contrast=c(“condition”, “MT”, “WT”))
> grp.mean <- sapply(levels(dds$condition), function(lvl)
rowMeans(counts(dds,normalized=TRUE)[,dds$condition == lvl] ) )

> norm.counts <- counts(dds, normalized=TRUE)
> all <- data.frame(res, grp.mean, norm.counts)
> write.table(all, file=“DESeq2_all_rm.txt”, sep=“\t”)

6. Generate figures on significantly differentiated genes

Results of differential gene
expression analysis.

MA plot (a) and expression
heatmap on the DEGs (adjusted
P < 0.01) (b).

Thank You