782893827-7-NER-in-Details.ppt JNHJBHJJBGBGU

rajubandam694 12 views 49 slides Mar 04, 2025
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

JHBJHB


Slide Content

Named Entity Recognition
NER

Name Entity Recognition
•The task of identifying proper names of people, organizations, locations, or
other entities is a subtask of information extraction from natural language
documents.
•Since whether or not a word is a name and the entity type of a name are
determined mostly by the context of the word as well as by the entity type
of its neighbours.

Why do NER?
•Key part of Information Extraction system
•Robust handling of proper names essential for many applications
such as Summarization, IR, Anaphora etc
•Pre-processing for different classification levels
•Information filtering
•Information linking

What is NER ?
•NER involves identification of proper names in texts, and
classification into a set of predefined categories of interest.
•Three universally accepted categories:
•Person, location and organisation
•Other common tasks: recognition of date/time expressions,
measures (percent, money, weight etc), email addresses etc.
•Other domain-specific entities: names of Drugs, Genes, medical
conditions, names of ships, bibliographic references etc.

NER Definition
•Named entity recognition (NER) (also known as entity
identification (EI) and entity extraction) is the task that locate and
classify atomic elements in text into predefined categories such as
the names of persons, organizations, locations, expressions of
times, quantities, monetary values, percentages, etc.
Rohan sold 5 companies in 2002.
<ENAMEX TYPE="PERSON">Rohan
</ENAMEX> sold
<NUMEX TYPE="QUANTITY">5
</NUMEX> companies in
<TIMEX TYPE="DATE">2002</TIMEX>.

What is not NER?
•NER is not event recognition.
•NER does not create templates,
•NER does not perform co-reference or entity linking,
•though these processes are often implemented alongside
NER as part of a larger IE system.
•NER is not just matching text strings with pre-defined
lists of names.
It recognises entities which are being used as entities
in a given context.

Descriptivist's theory of Names
Proper names either are synonymous with descriptions, or have
their reference determined by virtue of the name's being
associated with a description or cluster of descriptions that an
object uniquely satisfies.
Causal theory of Reference
Proper names refer to an object by virtue of a causal connection
with the object as mediated through communities of speakers.
That is , proper names, in contrast to descriptions, are rigid
designators.
Rigid designators :A proper name refers to the named object in
every possible world in which the object exists.
Descriptions designate : a proper name as different objects in
different possible worlds.

Proper Names and Definite Descriptions
•A meaning of a Sentences involving Proper names could
be substituted by a contextually appropriate description
for a name.
eg: Otto von Bismarck can be known or described as the
first Chancellor of the German Empire
Kripke argues that definite descriptions cannot be rigid
designators . Because definite descriptions cannot be
same/similar in all possible worlds

EXAMPLES for Named Entity and not a Named entity
•Hotel & Taj Hotel
•Flower & Rose Flower
•Beach & Juhu Beach
•Airport & Indira Gandhi International airport
•The School & Good Shepherd School
•Prime Minister & Mr. Narendra Modi

Some problems in indentifying NE
•Variation of NEs
•Manmohan Singh, Manmohan, Dr. Manmohan Singh
•Ambiguity of NE types:
•1945 (date vs. time)
•Washington (location vs. person)
•May (person vs. month)
•Tata (person vs. organization)

Ambiguity Examples
•Person vs Location
•Sir C. P Ramaswamy was the Divan of Travancore (Per)
•Sir C.P Ramaswamy Road is in Noida (Loc)
•Person vs Organization
•Anil Ambani opened Reliance Fresh (Per)
•Reliance Fresh is under Anil Amabani Group Ltd (Org)

More complex problems in NER
Issues of style, structure, domain, genre etc.
•Punctuation, spelling, spacing, formatting, ….all have an
impact
Dept. of Computing and Information Science
Manipal University
Manipal
United Kingdom

Tagset for Named Entity
•ACE tagset is Hierarchical
•ACE-Automatic Content Extraction
•The tagset
•CLIA-is Hierarchical -Similar to ACE
•Developed for two domains
•Tourism and Health

TAGSET
•ENAMEX
•Person
•Individual
•Family name
•Title
•Group
•Organization
•Government
•Public/private company
•Religious
•Non-government
•Political Party
•Para military
•Charitable
•Association
•GPE (Geo-political Social Entity)
•Media
•Location
•Place
•District
•City
•State
•Nation
•Continent
•Address
•Water-bodies
•Landscapes
•Celestial Bodies
•Manmade
•Religious Places
•Roads/Highways
•Museum
•Theme parks/Parks/Gardens
•Monuments
•Facilities
•Hospitals
•Institutes
•Library
•Hotel/Restaurants/Lodges
•Plant/Factories
•Police Station/Fire Services
•Public Comfort Stations
•Airports
•Ports
•Bus-Stations
•Locomotives
•Artifacts
•Implements
•Ammunition
•Paintings
•Sculptures
•Cloths
•Gems & Stones
•Entertainment
•Dance
•Music
•Drama/Cinema
•Sports
•Events/Exhibitions/Conferences
•Cuisine’s
•Animals
•Plants

Tagset Continued
•NUMEX
•Distance
•Money
•Quantity
•Count
•TIMEX
•Time
•Date
•Day
•Period
Tagset Counts
First Level Tags -3
Second Level -43
Third Level – 40
Total - 86

How to Annotate
•1.ENAMEX
•1.1 Person
•1.1.1 Individual
•These refer to names of each individual person, also includes names of
fictional characters found in stories/novels etc.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul Kalam</ENAMEX>

Annotation continued
Family name
In general we find that a person name consists of a family name.
Whenever an instance of individual name occurs with family name, then
that part of the name, which refers to family name, must be tagged
specifically with subtag “FAMILYNAME” as shown below.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”
SUBTYPE_2= “FAMILYNAME”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Narendra
Modi<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”
SUBTYPE_2= “FAMILYNAME”> Modi</ENAMEX></ENAMEX>

NE Types
NE TYPES
ENAMEX
NUMEX
TIMEX
The Named entity hierarchy is divided into three major classes Entity
Name, Time and Numerical expressions.

Entity Types


Persons are entities limited to humans. A person may be a single
individual or a group. Individual refer to names of each individual person.
Group refers to set of individual

Location entities are limited to geographical entities such as geographical
areas like names of countries, cities, continents and landmasses, bodies of
water, and geological formations.

Organization entities are limited to corporations, agencies, and other
groups of people defined by an established organizational structure
Entity Name Types

En: [Sita]
PERSON
is working at [HCL]
ORGANIZATION ,
which is in [Noida]
LOCATION
Hi: [Seetha]
PERSON [HCL]
ORGANIZATION main kaam kar rahi hai, jo ki
En: Sita HCL work is which
[Noida]
LOCATION
main hain.
Noida in
Examples for Entity Name Types

Facilities are limited to buildings and other permanent man-made structures
and real estate improvements like hospitals, airport, colleges, libraries etc.
En: [Appolo Hospital]
FACILITY is in Noida
LOCATION
Ta: [Appallo maruthuvamanAi]
FACILITY
[Noidayil]
LOCATION

irukkirathu
Ml: [Appolo Asupathri]
FACILITY
[Noidayil]
LOCATION
aaN
Hi: [Appolo aspathaal]
FACILITY [Noida]
LOCATION mein haim.
Entity Name Types

A locomotive entity is a physical device primarily designed to move an object
from one location to another, by carrying, pulling, or pushing the transported
object.
En: [Ananthapuri Express]
LOCOMOTIVE departs from [Noida]
LOCATION at
[7.30pm]
Time.
Hi: [Ananthapuri express]
LOCOMOTIVE [Noida]
LOCATION se [rath 7.30]
TIME ko
ravana hoga
Ml: [Ananthapuri eksprass]
LOCOMOTIVE [Noidayilninn]
LOCATION [raathri 7.30
maNikk]
TIME puRappetum.
Ta: [Ananthapuri viraivu rayil]
LOCOMOTIVE [Noidayilirunthu]
LOCATION [iRavu 7.30
maNikku]
TIME
puRappatukirathu
Entity Name Types

Artifact entities are objects or things, produced or shaped by human craft,
such
as tools, weapons/ammunition, art paintings, clothes, ornaments, medicines
En: [Vinayaga Statue]
ARTIFACT is looking beautiful.
Hi: [Vinayaka moorthi]
ARTIFACT achi lagh rahi haim.
Other Language
Ta: [Vinayakarin Silai]
ARTIFACT
pArpatharkku alakAkAkairukkirathu
Ml: [ganapathi vigraham]
ARTIFACT
baMgiyaayi irikkunnu.
Entity Name Types

Entertainment entities denote activities, which are diverting and hold human
attention or interest, giving pleasure, happiness, amusement especially
performance of some kind such as dance, music, sports, events.
En: [Flower Exhibition]
ENTERTAINMENT
is held at [Hyderabad]
LOCATION
Hi: [phool pradarshnii]
ENTERTAINMENT
[hyderabad]
LOCATION
meN Ayojith kiyaa
jAthA hai
Other Language
Ta: [Malar kankAtchi]
ENTERTAINMENT [hyderabaadil]
LOCATION Nadaiperukirathu
Ml: [pushpa pradarshanam]
ENTERTAINMENT [hyderabaadil]
LOCATION natakkunnu
Entity Name Types

Materials refer to the names of food items, cuisines, chemicals and
cosmetics

En: [Honey]
MATERIALS
is good for face
Hi: [Shahad]
MATERIALS
chehare ke liye achcha hai.
Entity Name Types

ORGANISMS: These are the names of different animal species including
birds, reptiles, viruses, bacteria and names of herbs, medicinal plants, shrubs,
trees, fruits, flowers etc.
En: [Peacock]
ORGANISM is the national bird of [India]
LOCATION
Hi: [Mor]
ORGANISM [bhaarat]
LOCATION ka raashtriya pakshi hai.
Entity Name Types

Disease: Names of disease, symptoms, diagnosis and treatment are comes
under this type.
En: Smoking Causes [Cancer]
DISEASE
Hi: dhumrapan [kaansar]
DISEASE
ka karan banata hai.
Entity Name Types

Numerical Expressions
NUMEX
DISTANCE
QUANTITY
COUNT
MONEY


Distance refers to the distance measures such as kilometers, Centimeters,
meters, acres, feet etc.
Example: 10 cm., twenty feet, 15 hectares

Money specifies the different currency value such as rupee, euro, Dinar,
dollar etc.
Example: Rs. 1000, 250 Euro, $160

Count denotes the number (or counts) of Items/ articles/things etc.
Example: 5 subjects, 12 students, 20 books

Quantity measurements like liters, tons, grams, volts etc. are comes under
this category.
Example: 20 litres, 22 kg, 50g, 100 volts
Numerical Expressions

Time Expressions
TIMEX
MONTH DATE
TIME
YEAR
PERIODDAY SPECIAL DAY


Temporal expressions are the entities refers to time, date, year, month and day

Time: These refer to expressions of time, includes different forms

of expressing time. This also includes Hours, minutes and seconds.

Example

5’o clock in the morning

9.30 a.m.

Evening 6.30 p.m.

Date: This refers to expressions of Date such as 13/12/2001 etc in

different forms. This also includes month, date and year

Example

August 15 1947

1956

September 11
Temporal Expressions

Day: These are expressions, which convey days in a year. Also it can include
days occurring weekly /fortnightly/ monthly /quarterly/ biennial etc.
Example
Sunday
Tomorrow
Today
Yesterday
Special Day: refers to special days in a year
Example
Gandhi Jayanthi
Rama Navami
Temporal Expressions

Period: refers to expressions, which express duration of time or
time periods or time intervals.
Example
17 th century
10 minutes
10 a.m. to 12 p.m.
One year
Temporal Expressions

Methodologies
Methods:
1)Rule Based
2)Machine Learning
Hidden Markov Model (HMM)
Naïve Bayes Classifier
Maximum Entropy Markov Model (MEMM)
Conditional random Fields (CRF)
3) Hybrid Approach

Following are the major challenges encountering in Indian
Languages.
Ambiguity
Between Proper and common nouns
Between named entities
Lack of Capitalization
Challenges of NER in Indian Languages

Ambiguity
Comparatively Indian languages suffer more due to the ambiguity that
exists between common & proper nouns and between named entities itself.
In some cases same word can refer to different named entity types. Those
instances can recognized by contextual information.
Examples:
Hi: Akash - Person name and Sky
Hi: Sooraj - Person name and Sun
Hi: Chaand – Moon and Silver
Hi: Aam – Mango and Common
Ml: Roopa – Person name and Rupee
Ml: Madhu – Person name and Honey
Ml: Mala – Person name and Garland
Challenges of NER in Indian Languages

Ta: Thinkal - Day and Month
Ta: Malar - Person name and Flower
Ta: Chevvai - Day and planet
Ta: Shakti – Person name and Power
Challenges of NER in Indian Languages

Spell Variation: Due to the different writing styles same entity is
represented in various word forms. In Tamil, sanskirit letters
such as “ja”, “sha”, “sri” “Ha” are replaced by “sa”,“ciri”, “ka”
Example:
Roja can be written as Rosa
Srimathi - cirimathi
Raja - rasa
Challenges of NER in Indian Languages

Lack of Capitalization

In English and some other European languages capitalization is considered
as the important feature to identify proper noun.

It plays a major role in NE identification.

Unlike English capitalization concept is not found in Indian languages.
Challenges of NER in Indian Languages

Nested Entities: Refers to the named entities which occurs within another
named entities. Also called as embedded entities.
Hi: [[Rajeev]
PERSON Marg]
ROAD
En : Rajeev Road
Ta: [[Mathurai]
LOCATION
[MeenAtchi Amman]
PERSON
Koyil]
RELPLACE
Ml: [[Nittoor]
PERSON Srinivasa rao]
PERSON
En: Mathurai Meenatchi Amman Temple
En : Nitoor Srinivasa rao



Nested Entities

Approaches in Named Entity
Resolution
•Dictionary Look-up
•Rule based ( Using lexical, contextual and morphological information)
•Maximum entropy theory based
•Hidden Markov Model
•Conditional Random Fields
•Hybrid methods (Statistical+ Linguistics)

Dictionary (Gazetteers) Look-up
Approach
•Uses Dictionaries for identifying NERs
•Gazetteer contains NEs from all domains
•Advantage
•Very simple approach
•Gives very high precision

Disadvantages of Dictionary Approach
•Preparation of exhaustive dictionary is a tedious and expensive process.
•The dictionary should cover the different spellings of the same place.

Rule Based Approach
•Rule Based System
•Needs more rules to tag all kinds of NE
•Advantages:
•Rich and expressive rules
•Good results
•Disadvantages:
•Requires huge experience and grammatical knowledge
•Experts to craft rules are expensive
• Highly domain specific ( not portable to a new domain)

General difficulties
“Italy's business world was rocked by the announcement last
Thursday that Mr. Verdi would leave his job as vice-president of
Music Masters of Milan, Inc. to become operations director of Arthur
Andersen".
• Capitalization useless for first word
• S not part of name "Italy"
• Date is "last Thursday" not "Thursday"
• Milan is location, not organization
• Arthur Andersen is organization, not person

Rules success and failure
Title Capitalized_Word Title Person_Name
Correct: Mr. Jones
Incorrect: Mrs. Field's Cookies (corporation)
Month_name number_less_than_32 Date
Correct: February 28
Incorrect: Long March 3 (a Chinese Rocket)
From Date to Date Date
Correct: from August 3 to August 9
Incorrect: I moved my trip from April to June (two
separate dates)

Statistical based approach
•Need to identify features
•Feature selection has to be correct for all types of NE
•Development of Tagged Corpus
•The Corpus should contain all types of tags in appropriate number
•Domain based corpus has to be generated.

Automated approaches
Address drawbacks of hand-coded system
Automated training
• Human-annotated (with desired output
standards) training data
• Annotation requires less effort and expertise
than hand-coding rules
• Annotation accuracy
• Two annotators for checking, third annotator to resolve disputes
Tags