RAGing Against the Literature: LLM-Powered Dataset Mention Extraction - presented in JCDL 2024
suchanadatta3
58 views
24 slides
Mar 07, 2025
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
Dataset Mention Extraction (DME) is a critical task in the field of scientific information extraction, aiming to identify references to datasets within research papers. In this paper, we explore two advanced methods for DME from research papers, utilizing the
capabilities of Large Language Models (L...
Dataset Mention Extraction (DME) is a critical task in the field of scientific information extraction, aiming to identify references to datasets within research papers. In this paper, we explore two advanced methods for DME from research papers, utilizing the
capabilities of Large Language Models (LLMs). The first method employs a language model with a prompt-based framework to extract dataset names from text chunks, utilizing patterns of dataset mentions as guidance. The second method integrates the Retrieval-Augmented Generation (RAG) framework, which enhances dataset extraction through a combination of keyword-based filtering, semantic retrieval, and iterative refinement.
Size: 3.71 MB
Language: en
Added: Mar 07, 2025
Slides: 24 pages
Slide Content
RAGing Against the Literature: LLM-Powered Dataset Mention Extraction
Priyangshu Datta,Suchana Datta,Dwaipayan Roy
March 2, 2025
D. Roy (IISER-K) March 2, 2025
Motivation
D. Roy (IISER-K) March 2, 2025
Motivation
D. Roy (IISER-K) March 2, 2025
Motivation
Dataset Mention Extraction
The process of identifying and extracting references to
datasets mentioned in academic papers.
D. Roy (IISER-K) March 2, 2025
Motivation
Why Focus on Datasets in Research?
Datasets→essential for scientific validation, reproducibility, and benchmarking.
Cited datasets often indicate foundational resources or key contributions to the research domain.
Challenges in Finding Dataset Mentions
Dataset mentions→not
The lack of structured metadata in academic papers.
D. Roy (IISER-K) March 2, 2025
Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025
Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025
Our Contributions
A novel approach that systematically
together with the LLMs
A
on diverse datasets.
We releaseexData, a publicly accessible web-based tool
practical application of our proposed model
D. Roy (IISER-K) March 2, 2025
Schematic diagram
D. Roy (IISER-K) March 2, 2025
Schematic diagram
D. Roy (IISER-K) March 2, 2025
Schematic diagram
D. Roy (IISER-K) March 2, 2025
Schematic diagram
D. Roy (IISER-K) March 2, 2025
Dataset Mention Extraction in Benchmarks
experimental datasets
Dataset # papers content # dataset mentions # queries
DMDD [Pan et al. - ACL 2023] 31,219 whole paper over 449000 450
CoCon [Saier et al. - JCDL 2023] 340,965 abstract 7,592 500
D. Roy (IISER-K) March 2, 2025
Baselines
Model Type Embedding NER Input
SciBERTsentenceBaseline ✓ ✓ Sentence
SciBERTsample Baseline ✓ ✓ Sentence
SciBERTf ull Baseline ✓ ✓ Paper
BiLSTM-CRF Baseline ✓ ✓ Paper
LLM-DME Proposed ✗ Paper
RAG-DME Proposed ✓ Paper
D. Roy (IISER-K) March 2, 2025
Execution time
Dataset Methods # queries Total (min.) Average (sec.)
DMDD
SciBERTsentence
450
296.17 39.49
SciBERTsample 321.10 42.81
SciBERTf ull 389.70 51.96
BiLSTM-CRF 411.08 54.81
LLM-DME 99.36 13.52
RAG-DME 151.05 20.14
CoCon
SciBERTsentence
500
135.82 16.30
SciBERTsample 158.31 18.99
SciBERTf ull 182.08 21.85
BiLSTM-CRF 201.30 24.16
LLM-DME 17.57 2.11
RAG-DME 29.19 3.50
D. Roy (IISER-K) March 2, 2025
Execution time
LLM-DME and RAG-DME exhibit
least turnaround time on both
benchmarks.
LLM-DME exhibits lower turnaround
than the RAG-DME.
LLM in RAG-DME processes less
text per-query compared to the
traditional prompt-based LLM
approach LLM-DME
There is an increase in execution
time reported per-query in both
benchmarks.
The increase in turnaround is
approximately 7 seconds per query
from DMDD and less than 2 seconds
for CoCon.
D. Roy (IISER-K) March 2, 2025
Varying Temperature
(a)(b)
D. Roy (IISER-K) March 2, 2025
exData
D. Roy (IISER-K) March 2, 2025
exData
D. Roy (IISER-K) March 2, 2025
exData
D. Roy (IISER-K) March 2, 2025
exData
D. Roy (IISER-K) March 2, 2025
Conclusions
codes
THANK YOU!
D. Roy (IISER-K) March 2, 2025