topic modelling through LDA and bertopic model

AngelShina 164 views 24 slides Jul 20, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

topic modelling is a technique to extract topic from vast of data


Slide Content

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL Department of Mathematical and Computational Sciences Topic modelling Under The Guidance of Dr. Chandhini G Submitted By: Anil pushpad ( 222CD003 )

Introduction What is topic modelling: Topic Modeling.' This technique has revolutionized the way we understand and organize text data, making it an indispensable tool for various applications." T he purpose and benefits of topic modelling: we're inundated with vast amounts of text data, from news articles and social media posts to research papers and customer reviews. Making sense of this text data is a challenging but essential task . This project is focus on train LDA model and identifying the topic from the vast of text data

What is topic modelling

Problem statement To train a model extract of information or topic from the text data by using the Latent Dirichlet Allocation (LDA)

Methodology Data preprocessing Create Bigram and trigram Lemmatization by using spacy and POS tagging Data encoding Build the LDA model Optimal number of topic

Data preprocessing removing the emails addresses , special character and numeric value by using regular expression convert the all text in lower case Remove the stopword Exploratory data analysis

Bigram and Trigram Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring example are: ‘front_bumper’, ‘ oil_leak’, maryland_college_park ’ etc . Phrases model detecting phrases (multi-word expressions) in a corpus of text . Phraser transform a Phrases model into a more memory-efficient form, making it faster for repeated use

Lemmatization and POS tagging Example: university => NOUN maryland_college_park => PROPN Lines => NOUN wondering => VERB anyone => PRON could => AUX filtering the bag of word by removing some tagging word like: AUX,DET,INTJ

example

Data encoding E ncode the whole corpus into some numbers D ictionary(id2word ) I using the Gensim library to create a dictionary (id2word) from a bag-of-words representation ( bag_of_words

LDA Model The LDA model consists of 2 different priors distribution:  probability distribution of topic to document (topic-document probability) probability distribution of words to topics (word-topic probability). The 3 main parameters of the LDA model : Number of topics The number of words per topic The number of topics per document .

Example

Visualisation of topic

Model Perplexity & Coherence score Model Perplexity Model perplexity is a measure used to evaluate the performance of probabilistic models where a lower negative value indicates a better-performing model C oherence score The coherence score is a metric used to evaluate the quality of topics generated by a topic model Measures the relationships between words within each topic . A high coherence score indicates that the words within a topic are related and make sense as a coherent theme

Optimal number of topic Picking an even higher value can sometimes provide more granular sub-topics. If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k ’(num of topic) is too large . The compute coherence_values (see below) trains multiple LDA models and provides the models and their corresponding coherence scores. Perplexity: -18.78438774493046

Final result

Future Work Continued research in topic modeling algorithms, including LDA variants and deep learning approaches, can lead to more accurate and efficient methods for discovering latent topics within text data . The quality of the topics is crucial. Evaluate the topics generated by different models. Are they meaningful and coherent Experiment with different hyper parameters, such as the number of topics (k), chunk size, and passes. Hyper parameter tuning can significantly impact model performance. Using other method for improving the performance of our project because some research paper mention the higher performance for NLP task like: LSA(Latent semantic analysis), PLSA(probabilistic Latent semantic analysis)

References David M. Blei. Computer Science Division University of California Berkeley and Andrew Y. Ng. Computer Science Department Stanford University Stanford and Michael I. Jordan. Computer Science Division and Department of Statistics University of California Berkeley, Developed the Latent Dirichlet Allocation(LDA ) Blei DM. Probabilistic topic models. Commun ACM. 2012;55(4):77–84. https://doi.org/10.1145/2133806.2133826. Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. Int J Adv Comput Sci Appl. 2015;6(1 ):7. https://doi.org/10.14569/IJACSA.2015.060121. Topic model — Wikipedia. https://en.wikipedia.org/wiki/Topic_model

Thank You
Tags