topic modelling through LDA and bertopic model

AngelShina 164 views 24 slides Jul 20, 2024

Slide 1 of 24

About This Presentation

topic modelling is a technique to extract topic from vast of data

Size: 4.27 MB

Language: en

Added: Jul 20, 2024

Slides: 24 pages

Slide Content

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL Department of Mathematical and Computational Sciences Topic modelling Under The Guidance of Dr. Chandhini G Submitted By: Anil pushpad ( 222CD003 )

Introduction What is topic modelling: Topic Modeling.' This technique has revolutionized the way we understand and organize text data, making it an indispensable tool for various applications." T he purpose and benefits of topic modelling: we're inundated with vast amounts of text data, from news articles and social media posts to research papers and customer reviews. Making sense of this text data is a challenging but essential task . This project is focus on train LDA model and identifying the topic from the vast of text data

What is topic modelling

Problem statement To train a model extract of information or topic from the text data by using the Latent Dirichlet Allocation (LDA)

Methodology Data preprocessing Create Bigram and trigram Lemmatization by using spacy and POS tagging Data encoding Build the LDA model Optimal number of topic

Data preprocessing removing the emails addresses , special character and numeric value by using regular expression convert the all text in lower case Remove the stopword Exploratory data analysis

Bigram and Trigram Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring example are: ‘front_bumper’, ‘ oil_leak’, maryland_college_park ’ etc . Phrases model detecting phrases (multi-word expressions) in a corpus of text . Phraser transform a Phrases model into a more memory-efficient form, making it faster for repeated use

Lemmatization and POS tagging Example: university => NOUN maryland_college_park => PROPN Lines => NOUN wondering => VERB anyone => PRON could => AUX filtering the bag of word by removing some tagging word like: AUX,DET,INTJ

example

Data encoding E ncode the whole corpus into some numbers D ictionary(id2word ) I using the Gensim library to create a dictionary (id2word) from a bag-of-words representation ( bag_of_words

LDA Model The LDA model consists of 2 different priors distribution: probability distribution of topic to document (topic-document probability) probability distribution of words to topics (word-topic probability). The 3 main parameters of the LDA model : Number of topics The number of words per topic The number of topics per document .

Example

Visualisation of topic

Model Perplexity & Coherence score Model Perplexity Model perplexity is a measure used to evaluate the performance of probabilistic models where a lower negative value indicates a better-performing model C oherence score The coherence score is a metric used to evaluate the quality of topics generated by a topic model Measures the relationships between words within each topic . A high coherence score indicates that the words within a topic are related and make sense as a coherent theme

Optimal number of topic Picking an even higher value can sometimes provide more granular sub-topics. If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k ’(num of topic) is too large . The compute coherence_values (see below) trains multiple LDA models and provides the models and their corresponding coherence scores. Perplexity: -18.78438774493046

Final result

Future Work Continued research in topic modeling algorithms, including LDA variants and deep learning approaches, can lead to more accurate and efficient methods for discovering latent topics within text data . The quality of the topics is crucial. Evaluate the topics generated by different models. Are they meaningful and coherent Experiment with different hyper parameters, such as the number of topics (k), chunk size, and passes. Hyper parameter tuning can significantly impact model performance. Using other method for improving the performance of our project because some research paper mention the higher performance for NLP task like: LSA(Latent semantic analysis), PLSA(probabilistic Latent semantic analysis)

References David M. Blei. Computer Science Division University of California Berkeley and Andrew Y. Ng. Computer Science Department Stanford University Stanford and Michael I. Jordan. Computer Science Division and Department of Statistics University of California Berkeley, Developed the Latent Dirichlet Allocation(LDA ) Blei DM. Probabilistic topic models. Commun ACM. 2012;55(4):77–84. https://doi.org/10.1145/2133806.2133826. Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. Int J Adv Comput Sci Appl. 2015;6(1 ):7. https://doi.org/10.14569/IJACSA.2015.060121. Topic model — Wikipedia. https://en.wikipedia.org/wiki/Topic_model

topic modelling through LDA and bertopic model

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

topic modelling through LDA and bertopic model

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......