Introduction to Natural
Language Processing
Lecture 1 –NLP (Elective)
Tanveer J Siddiqui
J. K. Institute of Applied Physics
University of Allahabad
Objective of NLP ?
To build computational models of NL for its
analysis and generation.
Motivations:
2
Motivations:
Technological
Cognitive and Linguistic
tjs
Natural language processing originated from
machine translation research.
NLP vs. NLU
3
NLU involves interpretation of language
natural language processing includes both
understanding (interpretation) and generation
(production).
tjs
Tools
Grammar formalism
Algorithm and data structure
Formalism for representing world knowledge
Inherit results from AI, CS, Linguistics, logic
4
Inherit results from AI, CS, Linguistics, logic
and philosophy
tjs
Theoretical Linguists are interested in identifying
rules that capture linguistic generalization.
Psycholinguistics is interested in producing
theories that explains how human produce and
comprehend natural language.
5
CL study language from a computational point of
view.
-deals with application of linguistic theories and
computational techniques for natural language
processing.
tjs
computational models : knowledge-
driven’ and ‘data-driven’
6tjs
Goal of a Language ?
Serves communicative function
NLP focuses on the study of language as a
means of communication
7
means of communication
…
Communication requires a common language
& shared knowledge about the domain in
question.
tjs
Information Transfer
1.The speaker wants to convey some
information.
2.Decide what?
8
3.Decide how to code it in language?
-Utterance is the only thing actually received by
hearer, using which she gets the information
Hearer must extract it
How ? –by decoding
tjs
We can analyze various phenomenon in
language from the viewpoint of how they
code information
word order, case endings,..
9
word order, case endings,..
conflict -
tjs
Information-based approach
provides natural connection between
-syntax
-semantics &
10
-semantics &
-Pragmatics
Provide theory of communication at different
level integrate them to give a general theory
of communication
+ KR & its use
tjs
Problems in coding
Gender of the speaker is not coded in the
pronoun I’ or verb
11
Hearer is able to decode
There are several sources of knowledge that are
used in decoding the information from an
utterance.
tjs
Sources of Knowledge
Language Knowledge
Grammar
Lexicon
Pragmatics & Discourse
Background Knowledge
12
Background Knowledge
General World Knowledge (Common Sense)
Domain Specific
Context
Culture
…
Listener model…
tjs
Other factors?
Language does try to maintain regularity
across construction for
ease of acquisition
13
ease of coding(or decoding)
tjs
Where the Grammar fits in ?
System of rules that relates information to its
coding in language
(There is a Computational requirement)
Syntax ?
14
Syntax ?
When the system of rules relates information to
coding devices at the language level and not at
the world knowledge level, it is called syntax.
However, World Knowledge have strong
influences on coding
tjs
How World Knowledge influences
Coding ?
1.It influences fundamental coding convention
2.It also affects coding being used
15
Blurs the boundary between syntax and semantics
The separation is because of ease of processing &
grammar writing.
Syntax uses language coding devices
Semantics –
Anaphora…
tjs
Syntax will not be studied to identify an innate
autonomous level, but to relate it to semantics
& world knowledge to accomplish the overall
task of communication of information.
16
task of communication of information.
tjs
Natural language processing concerns the
development of computational models of aspects
of human language processing such as
-Reading and interpreting a textbook
17
-Reading and interpreting a textbook
-Writing a letter
-translating a document
-Searching for useful information
tjs
NLP is Interdisciplinary
Theoretical Linguistics
Computational Linguistics
Artificial Intelligence
18
Artificial Intelligence
This list is not exhaustive.
tjs
Theoretical Linguistics
Typical Questions
What is language ?
What is knowledge of language ?
How can it be finitely characterized ?
What linguistic forms are there ?
19
What linguistic forms are there ?
How linguistic forms constrain meaning ?
How is knowledge of language acquire given
limited exposure ?
Formal language theory
(Syntactic Structures, 1957 by Chomsky)
tjs
Theoretical Linguistics: Tools
and Methods
Empirical studies (study of frequencies,
conditional probabilities etc)
Formal Language Theory –Provides usable
definitions of grammatical knowledge
20
definitions of grammatical knowledge
Transformational Knowledge –for handling
identity of meaning between non-identical
sentences
tjs
Transformational Grammar
Chomsky’s Problem: Linguistic wanted to explain how
the sentences
Pooja plays veena.
Veena is played by Pooja.
have same meaning, despite having different surface
21
have same meaning, despite having different surface
structure (role of subject and object are inverted).
Chomsky’s Answer –
Both the sentences are being generated from the
same “deep Structure” in which the “deep subject”
is Pooja and “deep object” is Veena for both
sentences.
These consideration led to a model for NL grammar
that employs two levels of syntactic representation.
tjs
Deep and Surface structure
22
Pooja Plays Veena Veena is played by Pooja
Surface structure
tjs
Deep and Surface Structure
23
Pooja Plays Veena
Deep structure
tjs
Computational Linguistics
How can linguistic theory be made concrete
enough to test?
How can we represent grammatical and
lexical knowledge efficiently ?
24
lexical knowledge efficiently ?
Given a grammar and a lexicon, how is the
structure of the sentence actually identified?
What are the properties of particular grammar
formalisms?
tjs
Tools and Methods
Analysis and Generation algorithms
Grammar Formalism
25tjs
Artificial Intelligence
Role of language in Intelligent agent ?
How language function as a communicative
activity for shared problem solving ?
26
Tools and Methods
Knowledge Representation
Formal Reasoning
Agent Technology
tjs
Examples of NLP Applications
Text-based application
NL understanding
Dialogue Systems
27
Dialogue Systems
Multi-modal
tjs
Text-based Applications
Processing of written texts, such as books,
newspaper, reports etc.
-Finding appropriate documents on certain topics
from a text database
28
from a text database
-Extracting information from messages, articles,
web pages etc.
-translating documents from one language to
another
tjs
NL Understanding
Some applications require a deeper level
of analysis
29
The system must extract enough
information to determine whether the
articles meet the criterion defined by the
query.
How much understanding is necessary to
achieve the purpose?
tjs
Dialogue based Applications
Dialogue based Applications involve man-
machine communication
-NL database query system
30
-automated customer service
Difference from text-based Applications
-language used is less formal
-Use of acknowledgments, clarification sub
dialogues
tjs
Multimodal Applications
Involve two or more modalities of
communication
-text
-speech
31
-speech
-Gesture
-Image
Textspeech , speech text, multimodal
document generation, spoken translation
systems and spoken dialogue systems.
tjs
Successful NLP depends on…
Explicit criteria o success
Putting limits on the needs of outside knowledge
and experience
32
Availability of cheap computing power –no more
a problem
Encoding knowledge of how language works-
Still long way …. to go
tjs
Some NLP systems
Generating Weather report
Babel fish translation system –AV
Eurotra (MT fro European languages), MU for
33
Eurotra (MT fro European languages), MU for
Japanese and English, Anusarak among
Indian languages
tjs
ELIZA (Weizenbaum, 1966) -ELIZA is one of
the earliest natural language understanding
program. It uses syntactic patterns to mimic
human like conversation with the user. Here is a
sample conversation:
34
sample conversation:
Eliza : Hello. I am ELIZA. How can I help you?
You: I am feeling a little bit sleepy.
Eliza: How long have you been feeling a little bit
sleepy?
You: for almost half an hour
Eliza: Please go on.
tjs
History (J & M)
1940s and 1950s
Automaton and Probabilistic (Information
–theoretic)
35
Chomsky first defined a finite-state
language as a language generated by
finite state grammar. These early models
led to the field of formal language theory.
Second foundational insight was the
development of Probabilistic algorithms
(inspired from Shannon’s work)
tjs
History : 1957-1970
Two camps
-Symbolic
Two lines of research :
1. Inspired from Chomsky’s & others work
36
1. Inspired from Chomsky’s & others work
on formal language theory & the work of
linguistics & computer scientists on parsing
2. From AI (focus on reasoning and logic)
-Stochastic : statistics
Bayesian system for text recognition
First on-line corpora: the Brown Corpustjs
1970-1983
Four Paradigms:
Stochastic Paradigm
Logic-based paradigm
37
Logic-based paradigm
Natural Language Understanding
(SHRDLU-NLU, LUNAR-Q/A)
Discourse modeling
tjs
1983-1993
Return of
-finite state models
and -Empiricism
38
Which lost popularity in late 1950s
and early 1960s
tjs
Merging of fields
-Probabilistic and data-driven
39
-Probabilistic and data-driven
models had become standard
tjs
How Child learns language ?
All children are born with the ability to learn language( Noam
chomsky). He believed that all babies possess a "language
acquisition device." Children are born with the ability to
produce speech simply by hearing words and sentences
spoken by adults around them. (Vander Zanden)
40
spoken by adults around them. (Vander Zanden)
If this were the case they would not be able to create original,
unique sentences of their own. Instead, children listen to
adults speak and then form a rule system that they then apply
in other situations.
tjs
Innateness Hypothesis
Innateness Hypothesis holds that, to a large
extent, the organization of human language
(i.e. the "grammar") is innate, that is, inborn.
41tjs