Using Generative AI to better understand B2B audiences: From Topic Modelling to Text Classification
LourensWalters1
24 views
33 slides
Sep 12, 2024
Slide 1 of 33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
About This Presentation
Using Generative AI to better understand B2B audiences: From Topic Modelling to Text Classification
Size: 8.74 MB
Language: en
Added: Sep 12, 2024
Slides: 33 pages
Slide Content
Using Generative AI to
better understand B2B
audiences
From Topic Modelling to Text Classification
Luca Valer
Lourens Walters
2024/04/15
3
The Informa Group: Connecting people with knowledge
Portfolio Growth
InvestmentsB2B Live & On-Demand EventsB2B Digital ServicesAcademic Markets
300+ Brands,
20+ specialist markets
(Pharma, Health & Nutrition,
Aviation, Beauty, Infrastructure &
Construction, Luxury
IIRIS (Proprietary First Party B2B Data Platform)
$0.5bn+ / £0.4bn+
Revenues c.$4.5bn / c.£3.65bn
Group Revenuesc.$2.2bn / c.£1.75bn
Revenues $0.75bn+ / £0.6bn+
Revenuesc.$1.1bn / c.£0.9bn
Revenues
400+ Brands,
6 growth markets:
Biotech & Life Sciences, Finance,
Foodservice, Anti-Aging &
Aesthetics, Lifestyle, Technology
220+ Specialist B2B Brands,
c.50m permissioned
First Party B2B audience data,
Demand Gen &
Buyer Intent platforms
6 publishing imprints,
2700+ peer review journals
(300+ Open titles),
170k reference titles across
75+ specialist subjects
NorstellaPharma Intell.6.7%
Lloyd’s ListMaritime Intell20.0%
Founder’s
Forum
B2B Events22.3%
ITNProduction20.0%
PA MediaSpecialist Media18.2%
Bologna
Fiere
B2B Events13.5%
Bridge
Events
Events Tech14.9%
Transaction-led
Live & On Demand B2B Events
Content-led
Live & On-Demand B2B Events
B2B Data &
Market Access Platform
Specialist Academic Research,
Advanced Learning &
Open Research
* Figures relate to 2024, including annualised figures for New TechTarget, assuming proposed
combination between Informa Tech’s digital businesses and TechTarget completes as planned
BrandCategoryEquity
4
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
4
Transaction-led Live & On-Demand EventsContent-led Live & On-Demand Events
c.$2.2bn
Revenue300+
Brands20+6.2m+
Attendeesc.$1.1bn
Revenue400+
Brands6
Growth Markets670k+
Attendees
Healthcare & PharmaceuticalsHealth & Nutrition
Infrastructure, & ConstructionBeauty
LuxuryAviation & Aerospace
Tech Finance
BioTech & Life SciencesFood Services
Anti-Aging & AestheticsLifestyle
IIRIS (Proprietary First Party B2B Data Platform)
2.5B audience interactions - Media web, Emails, Smart events
Specialist Markets
A Leader in Live & On-Demand B2B Events
Why & how do we learn more about our audiences?
5
While we capture individual data such as
geographical location, job title via form we
have the issue of standardised them across
sector and taxonomies to understand who
decision makers are
How do we capture the interests of our audiences, what
they read, click, attend, talk.
Objectives
Challenges
Solutions
Our clients
Marketing team Identify the right audience & the right message for them to sell, retain,
and/or engage with Informa products
Offering details about the audience who engage with them
Standardise values at scaleUnderstanding audience interests
Finetuned LLMs to map taxonomyExtract topics from web content
Use Case 1: Reference data mapping
6Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•Business domains have different
reference data standards
•Event Registration Sites
•Mapping data is time consuming and
labour intensive
Business Problem
•Automated reference data mapping
Automated reference data mapping
Solution
Finetuned LLMs to map taxonomy
Visitor
Visitor
Visitor
Mapping
Group
Visitor
Attendee
Attendee
Early AccessVisitor
Group
Visitor
Attendee
Attendee
Early Access
7Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
The mapping dataset
7
29 unique values70k unique values1k unique values
Dataset mapped by reference data team: 96k mapped (labelled) records•State of the art enterprise
reference data/ unification
software to perform mapping
using built in ML functionality
result < 60% accuracy
•Microsoft Azure Auto-ML to fine-
tune proprietary LLMs within VPC
result < 50% accuracy
•Prompt engineering with
proprietary LLM using few shot
learning
result < 30% accuracy
•Goal > 80% accuracy
Previous Approaches
Finetuned LLMs to map taxonomy
Our approach: Finetune open-source Hugging Face LLMs
8
Candidate Models
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Training set size: 72k
Test set size: 24k
•BERT is SOTA for text classification problems – bidirectional encoder LLM
•Baseline model
•Context length is small – not great for large text problems
•Lodestone is BERT based model with enlarged context length – fine tuned for Topic classification
•Llama not great for text classification – one directional decoder LLM (same as GPT)
•Context length is large
•Generalizes better – more parameters and trained on large datasets
•Multilingual support (future release)
•Recent (2023) paper removes Llama causal mask resulting in SOTA results by enabling bidirectional context finetuning*
* Label Supervised LLaMA Finetuning - arXiv:2310.01208
Finetuned LLMs to map taxonomy
Finetuning LLMs using LORA and QLORA
9Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Original LLM (BERT/ Llama)
Transfer
the network weights
Test
Training
Unfreeze and train △WFrozen weights
Fine Tuning Reference Data
CorpusTokenizer
Target
Reference Data
Mapping
LinearSoftmax
Practical Tips for Finetuning LLMs Using LoRA–Sebastian Raschka
Low-Rank Adaptation - LORA
Classification Head
Finetuned LLMs to map taxonomy
Methodology
•Train on Nvidia A10 with 25GB GPU memory
•Use Hugging Face Trainer API
•BERT – Full finetuning
•Llama 2 & 3: Parameter Efficient Finetuning (PEFT)
10Using Generative AI to better understand B2B audiences: from Topic Modelling to Text ClassificationQLORA: Efficient Finetuning of Quantized LLMs - arXiv:2305.14314
Finetune Llama 2 & 3
Start with simple solution:
•Small batch size – 2
•Low LORA R and Alpha – 2 and 4
•LORA target modules - q_proj and v_proj
•Reduced data size (context length)
•Increase parameters until memory fails
•Trainable params: 10M
•All params: 7.6B
•Trainable %: 0.1
Finetune Llama 2 & 3
Add 4 Bit quantisation
•Slower
•Accuracy results remain the same
•Increase parameters until memory fails
•Trainable params: 90M
•All params: 7.6B
•Trainable %: 1.2
•Large number of epochs not necessary – 10 enough
•Special separator token significant
•Using all LORA target modules significant
Finetune BERT
•345M parameters
Finetuned LLMs to map taxonomy
Results
11Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Multi-class classification: Predict 1k unique target values!
Learning curves Llama 3 – finetuning
Goal of 80% accuracy achieved!
Finetuned LLMs to map taxonomy
Deployment – work in progress
12Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•Deploy on EC2 container using Nvidia A10 (require 25GB GPU memory) – not very costly
•Simple case of loading model and inferring
•Code showing model inference
Finetuned LLMs to map taxonomy
Use Case 2: Topic Modelling
13Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•800 of websites for different business
domains
Informa web products
•Not all websites have tagged content,
tagging very often not appropriate for
use case
Business Problem
•Automated topic tagging using
traditional NLP & GenAI
Topic modelling
Solution
Extract topics from web content
•Domain specificity (LLMs are general)
•Consistency over time (Proprietary LLMs APIs change
faster than topics)
•Model validation (Unsupervised learning)
•Data pre-processing complex (Traditional NLP)
•Hyperparameter sensitivity (Traditional & LLM)
•Scalability due to complexity (Algorithm complexity)
•Model updates with new topics (Taxonomy management
using unsupervised learning)
14
Topic Modelling challenges
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Extract topics from web content
15Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
1940s
•N-gram model
•Predicting next
item in a
sequence based
on previous n-1
items
1960s
1970s
•AI Winter
•Ontologies
•TF/IDF
•Case Grammars
1990s
•Machine Learning
such as Decision
Tree and MLP
•RNNs
1950s
•Distributional
hypothesis
•Bag of Words (BOW)
•Interest in translation
•Syntactic structures
1980s
•Symbolic Models
•Latent Semantic
Analysis (LSA)
Can history help us?
2000s
•Word embeddings
•Google translate
•Latent Dirichlet
Allocation (LDA)
2010s
•Word2Vec
•LSTMs and CNNs
•Encoder-Decoder
•Attention and
Transformers
•Pretrained models
and Transfer
Learning
2020s
•ChatGPT
•LLM revolution
“a word is characterized by the company it keeps”
Extract topics from web content
Which technology to use? Is traditional NLP relevant?
16Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•Matrix decomposition:
•Latent Semantic Analysis (LSA)
•Non-Negative Matrix Factorization (NMF)
•Probabilistic:
•Probabilistic Latent Semantic Analysis (PLSA)
•Latent Dirichlet Allocation (LDA)
•Bayesian Models:
•Hierarchical Dirichlet Process (HDP)
•Neural Networks:
•Autoencoder models
•Transformer-based models (LLMs)
Extract topics from web content
17Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Solution using both LLMs and traditional NLP
Web Data
Topics
HDBSCAN
UMAPDimensionality
Reduction
Clustering
C-TF-IDF
ChatGPT or Llama 3
Topic Word
Derivation
Topic Description
Combine BERT/ Llama 3
and LDAFinal Topics
Fine Tune Llama 3
ClassifierTrain Classifier
LDA ModelDomain Specific
General Topics
LDA + BERT or Llama 3
embeddings
Domain Specific
Embeddings
Framework adapted from:
BERTopic: Grootendorst
Extract topics from web content
Train a domain specific LDA model for each business vertical
18
Example domain:
•Aviation web – 250k webpages
•On average around 350 words per document
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Extract topics from web content
Validation:
•Train/ Test split - 80/20
•Internal:
•Coherence, perplexity, silhouette, visual,
•External validation
•Subject Matter Expert (SME) validation
Results:
•SME validation of 7/10 (detractor score)
•Coherence scores for BOW very good
•We use BOW in our model
19
LDA modelling – Results
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
BOW
coherence: -13.58
perplexity: -24.21
coherence: -2.25
perplexity: -7.28
Note: LDA require extensive pre-processing:
•Lowercase, punctuation removal, stop word
removal, lemmatize, remove slang
Only run once per domain
TF/ IDF
Extract topics from web content
Addressing domain specificity problem with LLMs
•Build LDA on domain specific data – done
•Create contextualized embeddings for Semantic Text Similarity
(STS) and Topic Clustering purposes
•Extract last hidden state from BERT for each
embedded document
•Combine BERT representation of document with LDA
topic representation for document
•Can be applied to both BERT and Llama 3 embeddings
20
Combined LDA topics and BERT embeddings = contextual embeddings
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
*Peinelt, N., Nguyen, D., Liakata, M.: tBERT: Topic models and BERT joining forces for semantic similarity detection. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7047–7055. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.630.
https:// aclanthology.org/2020.acl-main.630
*
Extract topics from web content
Does it work?
21
Contextual embeddings improves topic
model substantially
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•Clusters are better separated and more compact
•SME validation in favour of contextual embeddings
0.67 0.75
Silhouette score before embeddings: 0.67 Silhouette score after embeddings: 0.75
Extract topics from web content
22Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Next: Topic generation by clustering contextualised embeddings
Web Data
Topics
C-TF-IDF
ChatGPT or Llama 3
Topic Word
Derivation
Topic Description
Combine BERT/ Llama 3
and LDAFinal Topics
Fine Tune Llama 3
ClassifierTrain Classifier
LDA ModelDomain Specific
General Topics
LDA + BERT or Llama 3
embeddings
Domain Specific
Embeddings
UMAPDimensionality
Reduction
HDBSCANClustering
Use:
•Dimensionality reduction
•Clustering
Extract topics from web content
Dimensionality reduction and clustering results
•Can use any dimensionality reduction algorithm e.g. PCA
or t-SNE
•We used UMAP to reduce dimensionality
•Experimented with various dimensions (2 - 500)
•Settled on 15 dimensions
•Allows clustering of high dimensional space
•Retains relationships and local structure in data
•Use HDBScan for clustering on top of UMAP:
•Automatically determines No. of clusters (150)
•Deals with outliers
•Can cluster complex shapes
•Results are promising as can be seen from plots –
especially for domain contextualized embeddings
23Dimensionality in plots reduced to 2 for visualisation purposesUsing Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Extract topics from web content
24Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
How to move from clusters to topics?
Web Data
Topics
HDBSCANClustering
Combine BERT/ Llama 3
and LDAFinal Topics
Fine Tune Llama 3
ClassifierTrain Classifier
LDA ModelDomain Specific
General Topics
LDA + BERT or Llama 3
embeddings
Domain Specific
Embeddings
UMAPDimensionality
Reduction
C-TF-IDFTopic Word
Derivation
ChatGPT or Llama 3Topic Description
Extract topics from web content
Class based TF-IDF (C-TF-IDF)
•Class based TF-IDF:
•Same as TF-IDF, but at a cluster rather than
document level
•Finds most important words in each cluster
within corpus as opposed to documents
•Example:
•Terms related to Airline Routes and Schedules
can be observed
25
Formula: *
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
* Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv:2203.05794 [cs]. [online] doi:https://doi.org/10.48550/arXiv.2203.05794.
Extract topics from web content
Use Llama3 to obtain description for cluster terms
26
Use locally fine-tuned Llama3 model to generate topic descriptions
ChatGPT or another proprietary API could also be used
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•There are two prompt tags that are of interest,
namely[DOCUMENTS]and[KEYWORDS]:
•[DOCUMENTS]contain the top 5 most relevant
documents to the topic
•[KEYWORDS]contain the top 10 most relevant
keywords to the topic as generated through c-TF-IDF
•This template will be filled accordingly to each topic.
Extract topics from web content
27Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Final topics generated
Topic descriptions are
coherent and relevant
Extract topics from web content
28Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Finally: Consolidate topics and train classifier
Web Data
Topics
HDBSCANClustering
C-TF-IDF
ChatGPT or Llama 3
Topic Word
Derivation
Topic Description
LDA ModelDomain Specific
General Topics
LDA + BERT or Llama 3
embeddings
Domain Specific
Embeddings
UMAPDimensionality
Reduction
Combine BERT/ Llama 3
and LDAFinal Topics
Fine Tune Llama 3
ClassifierTrain Classifier
Extract topics from web content
Use hierarchical clustering to merge topics
29
LDA and BERT topics can be merged, as well any
other topics generated by internal taxonomies
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
•Compare c-TF-IDF vectors (topic embeddings)
•Linkage function – Ward
•Dissimilarity function – Cosine similarity
•Mergethe most similar ones
•Re-calculatethe c-TF-IDF vectors to update the
representation of our topics
•Next step is to build a classifier to predict topics from
text
Extract topics from web content
Why train classifier and not use cluster allocations for topic
generation?
•Easier to deploy – one model (as opposed to LDA and
BERTopic/ Llama 3)
•Can combine LDA and BERT topics (or any other existing
taxonomy)
•No need for extensive text pre-processing – as per LDA
•Can classify text of varying lengths e.g. short emails as
well as long web pages
•Can generalize model to different languages – via LLM (if
LLM supports different languages)
•Need to ensure accuracy – domain contextualized LLM
embeddings as features required in our case
•Embeddings need to have long context length i.e. at least
1K tokens for Use Cases with longer text e.g. web
30
Several advantages
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Specific requirements
Extract topics from web content
31Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Results: Multi label classification
Costs
Results Llama 3:
•Training set size: 8k
•Test set size: 2k
•150 topics
•Avg 3 topics per document
ROC-AUC:
•C-statistic: 72%
Still busy optimising
Extract topics from web content
Other Approaches – Topic modelling
•Neural Topic Models:
•Variational Auto-Encoder (VAEs)
•Taxonomy based approach:
•Existing Taxonomy e.g. OpenAlex,
•Graph or other representation to maintain
•Transform to embedding – maintains structure
•Use embedding to: identify new topics
(clustering), merge topics (entailment), predict
(classify)
•Update original graph/ taxonomy as required
•Very powerful as structure of taxonomy
maintained and stable over time
•Can model hierarchical taxonomies easily
32Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification
Extract topics from web content
Conclusion
1.LLMs can be combined with traditional NLP to create value:
•Accurately classify short text to large number of classes (thousands)
•Automate reference data mapping
•Use contextualized LLM embeddings to identify latent topics in a corpus
•Create meaningful topic descriptions
•Manage topic taxonomies through Semantic Textual Similarity (STS) merging
2.Potential further uses:
•Use STS to find audience with similar interests to topics found
•Gain deeper understanding into audience interests
33
Using Generative AI to better understand B2B audiences: from Topic Modelling to Text Classification