Thesaurus based Index Term Extraction

CIARD_AIMS 1,849 views 11 slides Mar 04, 2011
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

No description available for this slideshow.


Slide Content

Thesaurus-Based
Index Term Extraction
Olena Medelyan
Digital Library
Laboratory

•Describe the topics in a document
•Index terms: controlled vocabulary
(e.g. predatory birds, damage, aquaculture)
•Keyphrases: freely chosen
(e.g. techniques, bird predation, aquaculture)
•Purposes:
–Organize library’s holding
–Provide thematic access to documents
–Represent documents as brief summary
–Aid navigation in search results
•Manual assignment: expensive, time-consuming
Index Terms vs. Keyphrases
Overview of Techniques
for Reducing
Bird Predation
at Aquaculture Facilities

Extraction vs. Assignment
•Select significant n-grams or
NPs according to their
characteristics
•Classify documents according to
their content words into classes
(lables = keyphrases)
- Restriction to syntax
- Bad quality phrases
- No consistency
+ Easy and fast implementation
+ Not much training required
- Need large corpora
- Long compuational time
- Not practical
+ Word coocurrence
+ High accuracy

KEA++
•Combines extraction with controlled
vocabulary
•Considers semantic relations
•Controlled vocabulary = thesaurus
•Experiment:
–agricultural documents (www.fao.org/docrep)
–Agrovoc thesaurus (www.fao.org/agrovoc)

How does it Work?
1.Extract n-grams, transform them to pseudo-phrases, map to
pseudo-phrases of thesaurus´ descriptors
bird predation ® predat bird
2.Each document = set of candidate phrases
3.Training (document + manually assinged phrases)
a.Compute the features
b.Compute the model
4.Testing (new documents, no phrases)
a.Compute the features
b.Compute probabilities according to the model
5.Classification model: Naïve Bayes

Features
•TF×IDF – phrases that are specific for a given
document are significant
•First Position – phrases that are in the beginning (or
the end) of the document are significant
•Phrase Length – phrases with certain number of
words are significant (2!)
•Node Degree – phrases that are related to the most
other phrases in the document are significant

Example
fisheries
fish culture
aquaculture
fish
ponds
aquaculture
techniques
bird
controll
predatory
birds
noxious
birds
scares
pest
conroll
controll
methods
monitoring
methods
equipmentprotective
structures
electrical
installation
fencing
Indexers:
1 2 3 4 5 6
Agrovoc relation:
KEA++:
damage
noise
north
america
techniques
fishery
production
predation
predators
birds
ropes
fishing
operations

Evaluation I
•Standard Evaluation:
–Number of exact matches in the test set
–Precision, Recall, F-measure
•Problem:
–Semantic similarity is not considered
–Comparison only to one indexer, although
indexing is subjective

Evaluation II
•Inter-indexer consistency, e.g. Rolling’s measure:
Indexersvs. other vs. KEA vs.KEA++
indexers
1 42 7 29
2 39 8 28
3 37 9 26
4 37 6 31
5 37 6 25
6 36 4 20
avg 38 7 27
Rolling‘s IIC =
2C
A+B
C – number of phrases in common
A – number of phrases in the first set
B – number of phrases in the second set
-11%

“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”
.
Results
Indexer KEA++
Exact aquaculture aquaculture
damage damage
fencing fencing
scares scares
noise* noise*
Similar bird control birds
predatory birds predators
fish culture fishing operations
fishery production
No match noxious birds
control methods
ropes
*Selected by only one indexer

Problems & Future Work
•Trivial problems (e.g. stemming errors)
•Document chunking
–What are important and disturbing parts of the
document?
•Topic coverage
–exploring thesaurus’ structure
–Lexical chains
•Term occurrence
–Including other NLP resources (e.g. WordNet)
•Multi-linguality, other domains