RNA-Seq Data Analysis: An abstract Guide

shuaibKhassin 8 views 28 slides Feb 28, 2025

Slide 1 of 28

About This Presentation

An overview of RNA-Seq data analysis, covering the process of analyzing raw sequencing data to identify gene expression patterns and biological insights. It includes key steps, tools, and methods used in transcriptomics research, offering a an abstract guide for researchers and students

Size: 2.26 MB

Language: en

Added: Feb 28, 2025

Slides: 28 pages

Slide Content

RNA-Seq Data Analysis

RNA Seq replaced Microarray

RNA Seq Analysis
●RNA-seq, is a high-throughput sequencing technology used to measure gene
expression levels and identify differentially expressed genes.
●It provides a comprehensive view of the transcriptome, allowing researchers to
study gene expression, alternative splicing, and transcript isoforms
●Workﬂow:
○Sample Preparation
○Library construction
○Sequencing
○Data Analysis

Prerequisites
System
●Linux Operating system: in built in MAC, for Windows users- simplest
way is to install Ubuntu LTS app
Provides a command-line interface (CLI) for executing complex data
analysis tasks and automating workﬂows.
●Anaconda: software package manager - Installs Python 3 and ensures
other tools
●R-Studio - user-friendly interface with various tools and features to
facilitate data analysis, visualization, and statistical modeling.

Tools used in the sequence processing
through the steps
1.Quality control- FASTQC, MULTIQC
2.Filtering- Trimmomatic, Cutadapt, NGS Toolkit
3.Mapping - Bowtie, Hisat2, Tophat2
4.Visualization of Alignment - SAM tools, BAM
ﬁles
5.Quantiﬁcation- htseq-counts, featureCounts,
stringTie, RSEM
6.Differential expression - DESeq2, edgeR, limma
7.Functional analysis - EnrichR, GSEA, GOstats

FASTQ File Format
Each read represented by 4 lines
1.@ followed by the read ID
2.The sequence
3.Begins with + sign - optionally followed by a read ID
4.The Phred Quality scores for each base call in the
sequence line - same length as the sequence
A typical sequence reads with
400,000,000 reads will generate a
ﬁle containing 1.6 billion lines of
data

Phred Quality scores
●Phred quality scores measure base call accuracy in sequencing.
●They range from 0 to 40, with higher scores indicating higher quality and lower error
probability.
●Scores are logarithmic and often represented using ASCII characters for storage and
visualization.
●The Phred score (Q) is related to the probability (P) of an incorrect base call as follows

FastQC
Sequence Quality Analysis:
Evaluates base quality scores to identify potential sequencing errors and low-quality regions.
Assesses sequence content and GC distribution to detect biases or contamination.

Adapter and Contaminant Detection:
Identiﬁes adapter sequences that may affect downstream analysis.
Detects overrepresented sequences and potential contaminants present in the data.

Sequence Length Distribution:
Analyzes the length distribution of sequences to ensure consistency and identify any anomalies
or outliers.

Per-Base and Per-Sequence Quality Scores:
Generates plots and statistics to assess the quality scores of each base and the overall quality
across sequences.

FastQC

Trimmomatic
trimmomatic SE -phred33 SRR00001.fastq SRR00001_trimmed.fastq
ILLUMINACLIP:/mnt/d/Tools/Trimmomatic/Trimmomatic-0.39/adapters/TruSe
q3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOWN:4:20 MINLEN:36
●Based on the results of FASTQC, one might want to ﬁlter and trim the sequences

●It helps remove low-quality reads, adapter sequences, and other artifacts, improving
data quality for downstream analysis.

●To check the adapter contamination we give adapter ﬁle from illumina

Alignment

●Reference-based Alignment:
The most common approach is aligning RNA-seq reads to a reference genome using
alignment algorithms such as STAR, HISAT2, or Bowtie.
Reference-based alignment allows for precise mapping of reads to known genomic
regions, including exons, introns, and splice junctions.

●Transcriptome-based Alignment:
In transcriptome-based alignment, RNA-seq reads are aligned to a reference
transcriptome, such as RefSeq or Ensembl, using tools like Salmon or Kallisto.
Transcriptome alignment enables the quantiﬁcation of transcript-level expression and
identiﬁcation of novel transcripts.

Bowtie and BWA use a data structure called the Burrows-Wheeler transform (BWT),
which stores the reference genome in a highly compressed form. Through the use of a
special indexing scheme called the Ferragina-Manzini (FM) index

HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) employs
two types of indexes for alignment:
●Global, whole-genome index and tens of thousands of small local
indexes. Using the same BWT/FM index algorithm as Bowtie
●Quality-Aware Alignment
●Tophat

HTSeq

Diﬀerential gene expression analysis from assembled gene
expression
1. Launch RStudio and load necessary library

> library(“DESeq2”)

2. Create necessary data object.

> sample.names <- sort(paste(c(“MT”, “WT”), rep(1:3, each=2), sep=““))
> ﬁle.names <- paste(“../”, sample.names, “/”, sample.names, “.count.txt”,
sep=““)
> conditions <- factor(c(rep(“MT”, 3), rep(“WT”, 3)))
> sampleTable <- data.frame(sampleName=sample.names,
ﬁleName=ﬁle.names,
condition=conditions)
> # read in the HTSeq count data
>
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
directory=“.”,
design=~ condition )

Yalamanchili et al., 2017
Data analysis pipeline for
RNA-seq experiments

3. Run differential gene analysis.
> ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) > 10, ]
> dds <-DESeq(ddsHTSeq)

4. Quality checks on the samples.

> rld <- rlogTransformation(dds, blind=FALSE)
> # Plot PCA plot
> plotPCA(rld, intgroup=“condition”, ntop=nrow(counts(ddsHTSeq)))

> # Plot correlation heatmap
> cU <-cor( as.matrix(assay(rld)))
> cols <- c( “dodgerblue3”, “ﬁrebrick3” )[condition]
> heatmap.2(cU, symm=TRUE, col= colorRampPalette(c(“darkblue”,”white”))(100),
labCol=colnames(cU), labRow=colnames(cU),
distfun=function(c) as.dist(1 - c), trace=“none”, Colv=TRUE,
cexRow=0.9, cexCol=0.9, key=F, font=2,
RowSideColors=cols, ColSideColors=cols)

PCA plot (a) and heatmap(b) on correlation coefﬁcient between samples (b) based on
gene expression proﬁles of the six samples.

5.5. Output differential gene analysis results
#> res <- results(dds, contrast=c(“condition”, “MT”, “WT”))
> grp.mean <- sapply(levels(dds$condition), function(lvl)
rowMeans(counts(dds,normalized=TRUE)[,dds$condition == lvl] ) )

> norm.counts <- counts(dds, normalized=TRUE)
> all <- data.frame(res, grp.mean, norm.counts)
> write.table(all, file=“DESeq2_all_rm.txt”, sep=“\t”)

6. Generate figures on significantly differentiated genes

Results of differential gene
expression analysis.

MA plot (a) and expression
heatmap on the DEGs (adjusted
P < 0.01) (b).

RNA-Seq Data Analysis: An abstract Guide

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

RNA-Seq Data Analysis: An abstract Guide

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 12

Slide 13

Slide 15

Slide 20

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx