RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - presented in JCDL 2024

suchanadatta3 58 views 24 slides Mar 07, 2025
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Dataset Mention Extraction (DME) is a critical task in the field of scientific information extraction, aiming to identify references to datasets within research papers. In this paper, we explore two advanced methods for DME from research papers, utilizing the
capabilities of Large Language Models (L...


Slide Content

RAGing Against the Literature: LLM-Powered Dataset Mention Extraction
Priyangshu Datta,Suchana Datta,Dwaipayan Roy
March 2, 2025
D. Roy (IISER-K) March 2, 2025

Motivation
D. Roy (IISER-K) March 2, 2025

Motivation
D. Roy (IISER-K) March 2, 2025

Motivation
Dataset Mention Extraction
The process of identifying and extracting references to
datasets mentioned in academic papers.
D. Roy (IISER-K) March 2, 2025

Motivation
Why Focus on Datasets in Research?
Datasets→essential for scientific validation, reproducibility, and benchmarking.
Cited datasets often indicate foundational resources or key contributions to the research domain.
Challenges in Finding Dataset Mentions
Dataset mentions→not
The lack of structured metadata in academic papers.
D. Roy (IISER-K) March 2, 2025

Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025

Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025

Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025

Schematic diagram
D. Roy (IISER-K) March 2, 2025

Schematic diagram
D. Roy (IISER-K) March 2, 2025

Schematic diagram
D. Roy (IISER-K) March 2, 2025

Schematic diagram
D. Roy (IISER-K) March 2, 2025

Dataset Mention Extraction in Benchmarks
experimental datasets
Dataset # papers content # dataset mentions # queries
DMDD [Pan et al. - ACL 2023] 31,219 whole paper over 449000 450
CoCon [Saier et al. - JCDL 2023] 340,965 abstract 7,592 500
D. Roy (IISER-K) March 2, 2025

Baselines
Model Type Embedding NER Input
SciBERTsentenceBaseline ✓ ✓ Sentence
SciBERTsample Baseline ✓ ✓ Sentence
SciBERTf ull Baseline ✓ ✓ Paper
BiLSTM-CRF Baseline ✓ ✓ Paper
LLM-DME Proposed ✗ Paper
RAG-DME Proposed ✓ Paper
D. Roy (IISER-K) March 2, 2025

Results
DMDD Dataset CoCon Dataset
Methods Precision Recall F-score Precision Recall F-score
Baselines
SciBERTsentence 0.6583 0.4987 0.5675 0.8046 0.6573 0.7235
SciBERTsample 0.7537 0.5389 0.6285 0.8176 0.6513 0.7250
SciBERTf ull 0.7011 0.5099 0.5904 0.7944 0.6431 0.7108
BiLSTM-CRF 0.7801 0.54250.64000.83220.6481 0.7287
Ours
LLM-DME
0.7701 0.6828 0.7238 0.8539 0.8250 0.8392
(-1%) (26%) (13%) (2%) (26%) (15%)
RAG-DME
0.8922 0.7408 0.8094 0.8562 0.8315 0.8436
(14%) (37%) (26%) (2%) (27%) (16%)
D. Roy (IISER-K) March 2, 2025

Execution time
Dataset Methods # queries Total (min.) Average (sec.)
DMDD
SciBERTsentence
450
296.17 39.49
SciBERTsample 321.10 42.81
SciBERTf ull 389.70 51.96
BiLSTM-CRF 411.08 54.81
LLM-DME 99.36 13.52
RAG-DME 151.05 20.14
CoCon
SciBERTsentence
500
135.82 16.30
SciBERTsample 158.31 18.99
SciBERTf ull 182.08 21.85
BiLSTM-CRF 201.30 24.16
LLM-DME 17.57 2.11
RAG-DME 29.19 3.50
D. Roy (IISER-K) March 2, 2025

Execution time
LLM-DME and RAG-DME exhibit
least turnaround time on both
benchmarks.
LLM-DME exhibits lower turnaround
than the RAG-DME.
LLM in RAG-DME processes less
text per-query compared to the
traditional prompt-based LLM
approach LLM-DME
There is an increase in execution
time reported per-query in both
benchmarks.
The increase in turnaround is
approximately 7 seconds per query
from DMDD and less than 2 seconds
for CoCon.
D. Roy (IISER-K) March 2, 2025

Varying Temperature
(a)(b)
D. Roy (IISER-K) March 2, 2025

exData
D. Roy (IISER-K) March 2, 2025

exData
D. Roy (IISER-K) March 2, 2025

exData
D. Roy (IISER-K) March 2, 2025

exData
D. Roy (IISER-K) March 2, 2025

Conclusions
codes
THANK YOU!
D. Roy (IISER-K) March 2, 2025

D. Roy (IISER-K) March 2, 2025