Topic Modeling - NLP

RupakRoy4 896 views 11 slides Jan 14, 2022
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.

Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy


Slide Content

Topic Modeling
By
LDA
Laten Dirichlet Allocation

Topic Modeling
Topic modeling: is technique to uncover the underlying topic from the
document, in simple words it helps to identify what the document is
talking about, the important topics in the article.

Types of Topic Models
1)Latent Semantic Indexing (LSI)
2)Laten Dirichlet Allocation (LDA)
3)Probalistic Latent Semantic Indexing (PLSI)

Document  topic  words


Rupak Roy

Topic Modeling - LDA
Topics Technology Healthcare Business

%topics in the
documents 30 % 60% 17%

Bag of words Google, Dell Radiology, Transactions,
Apple, Microsoft Diagnose Bank, Cost


DOCUMENT

Behind LDA
Topic 1: Technology: Google, Dell, Apple, Microsoft
Topic 2: Healthcare: Radiology, Diagnose, Ct Scan
Topic 3: Business: Transactions, Banks, Cost.
Rupak Roy

Topic Modeling
How often does “Diagnose appear in topic Healthcare ?
If the ‘Diagnose’ word often occurs in the Topic Healthcare, then this
instance of ‘Diagnose’ might belong to the topic Healthcare.

Now how common is the topic healthcare in the rest of the document?
This is actually similar to Bayes theorem.

To find the probability of possible topic T
Multiply the frequency of the word type W in T by the number of other
words in document D that already belong to T

Therefore the output is
The probability that this word came from topic T=>
=> P(T\W,D) = )words W in the topic T/words in the document )* words
in D that belong to T
Rupak Roy

Topic Modeling - LDA
library(RTextTools)
library(topicmodels)

tweets<-read.csv(file.choose())
View(tweets)

names(tweets)

tweets1<-data.frame(tweets$text)
tweets1<-tweets[,c(6,11)]

names(tweets1)
dim(tweets1)
names(tweets1)[2]<-"tweets"
View(tweets1)

Rupak Roy

Topic Modeling - LDA
#Create a Document Term Matrix
matrix= create_matrix(cbind(as.vector(tweets1$airline),as.vector(tweets1$tweets)),
language="english",removeNumbers=TRUE, removePunctuation=TRUE,
removeSparseTerms=0,
removeStopwords=TRUE, stripWhitespace=TRUE, toLower=TRUE)

inspect(tweets.corpus[1:5])


#Choose the number of topics
k<- 15

#Split the Data into training and testing
#We will take a small subset of data
train <- matrix[1:500,]
test <- matrix[501:750,]

#train <- matrix[1:10248,]
#test <- matrix[10249:1460,]
Rupak Roy

Topic Modeling - LDA
#Build the model on train data
train.lda <- LDA(train,k)

topics<-get_topics(train.lda,5)
View(topics)
#by default it gives the highest topic with the document

terms<-get_terms(train.lda,5)
View(terms)
#by default it gives the most highly probable word in each topic


#Get the top topics
train.topics <- topics(train.lda)

#Test the model
test.topics <- posterior(train.lda,test)
test.topics$topics[1:10,1:15]
#[row, number of topics(upto 15topics)that is the value of K =15]
test.topics <- apply(test.topics$topics, 1, which.max)
#gives topic with highest probability
Rupak Roy

Topic Modeling - LDA
#Join the predicted Topic number to the original test Data
test1<-tweets[501:750,]
final<-data.frame(Title=test1$airline,Subject=test1$text,
Pred.topic=test.topics)
View(final)

table(final$Pred.topic)
#View each topic
View(final[final$Pred.topic==10,])
Rupak Roy

Topic Modeling - LDA
#---------------Another method to get the optimal number of topics ---------#

library(topicmodel)
best.model <- lapply(seq(2,20, by=1), function(k){LDA(matrix,k)})
#seq(2,20) refers range of K values

best_model<- as.data.frame(as.matrix(lapply(best.model, logLik)))
#one of the methods to measure the performance is loglikehood & to find out
#whether a model is good model or average model or bad model based on the
parameter model uses.

final_best_model <- data.frame(topics=c(seq(2,20, by=1)),
log_likelihood=as.numeric(as.matrix(best_model)))
#The higher the loglikelihood the better the model.
#finds out ideal topic for every doc

head(final_best_model)

library(ggplot2)
with(final_best_model,qplot(topics,log_likelihood,color="red"))
#the higher the likelihood value in the graph better the topics are.
Rupak Roy

Topic Modeling - LDA
#Get the best value from the graph
k=final_best_model[which.max(final_best_model$log_likelihood),1]

cat("Best topic number k=",k)
Rupak Roy

Steps Topic Modeling
1)Data
2)Create TDM
3)Choose number of topics (K)
4)Divide the data into train & test
5)Building model on train data
6)Get the topic
7)Test the model
8)Joining the predicted Topic Number to the original dataset
9)Analyize
Rupak Roy