Amity NLP Notes

AshutoshAgrahari7 1,479 views 25 slides Jul 06, 2020
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Natural Language Processing Notes aggregated by Ashutosh Agrahari.
These are module-wise notes corresponding to the course conducted at Amity University, India.


Slide Content

NLP Notes
By Ashutosh Agrahari

Introduction

Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems
using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like robot to perform as
per your instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the natural languages humans
use. The input and output of an NLP system can be −
• Speech
• Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
• Mapping the given input in natural language into useful representations.
• Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of natural language from
some internal representation.
It involves −
• Text planning − It includes retrieving the relevant content from knowledge base.
• Sentence planning − It includes choosing required words, forming meaningful phrases, setting
tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.
The NLU is harder than NLG.
Difficulties in NLU
NL has an extremely rich form and structure.

It is very ambiguous. There can be different levels of ambiguity −
• Lexical ambiguity − It is at very primitive level such as word-level.
• For example, treating the word “board” as noun or verb?
• Syntax Level ambiguity − A sentence can be parsed in different ways.
• For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a
beetle that had red cap?
• Referential ambiguity − Referring to something using pronouns. For example, Rima went to
Gauri. She said, “I am tired.” − Exactly who is tired?
• One input can mean different meanings.
• Many inputs can mean the same thing.
NLP Terminology
• Phonology − It is study of organizing sound systematically.
• Morphology − It is a study of construction of words from primitive meaningful units.
• Morpheme − It is primitive unit of meaning in a language.
• Syntax − It refers to arranging words to make a sentence. It also involves determining the
structural role of words in the sentence and in phrases.
• Semantics − It is concerned with the meaning of words and how to combine words into
meaningful phrases and sentences.
• Pragmatics − It deals with using and understanding sentences in different situations and how
the interpretation of the sentence is affected.
• Discourse − It deals with how the immediately preceding sentence can affect the interpretation
of the next sentence.
• World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps −
• Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a
language means the collection of words and phrases in a language. Lexical analysis is dividing
the whole chunk of txt into paragraphs, sentences, and words.
• Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and
arranging words in a manner that shows the relationship among the words. The sentence such
as “The school goes to boy” is rejected by English syntactic analyzer.

• Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The
text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the
task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
• Discourse Integration − The meaning of any sentence depends upon the meaning of the
sentence just before it. In addition, it also brings about the meaning of immediately succeeding
sentence.
• Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It
involves deriving those aspects of language which require real world knowledge.

Module 1: Sound

Place of Articulation
The place of articulation refers to “the point in the vocal tract where the speech organs restrict the
passage of air in some way so producing distinctive speech sounds” (Finch, 1999). As with manner of
articulation, places of articulation are more frequently used to describe consonants than vowels. The
following are the principal terms used in linguistics to describe these:




Bilabial. “Sounds formed by both lips coming
together” (Finch, 1999). Examples include /b/,
/p/ and /m/.



Labio-dental. “Sounds formed by the bottom
lip touching the upper teeth” (Finch,
1999). Examples include /v/ and /f/.


Dental. “Sounds formed by the tongue
touching the upper teeth” (Finch, 1999). These
are not common in English, although they can
sound like /t/ or /d/. If you imagine saying
‘Barcelona’ with a heavy Spanish accent, you
might hear it.


Alveolar. “Sounds formed by the tongue
coming into contact with the hard, or alveolar,
ridge immediately behind the upper teeth”
(Finch, 1999). The Alveolar sounds are
common in plosive English sounds such as /t/,
/d/ and /n/, and in fricative sounds such as /z/.

Post-alveolar. “Sounds formed by the tongue
curled behind the alveolar ridge” (Finch,
1999). Examples include the /ʃ/ and /ʒ/, or
the ‘sh’ sounds in words like ‘ship’, or the ‘s’
sound in words like ‘vision’.


Palato-alveolar. “Sounds formed by the
tongue in contact with both the roof of the
mouth, or hard palate, and the alveolar ridge”
(Finch, 1999). Examples include the /tʃ/ and
/dʒ/ sounds in ‘church’ and ‘judge’.


Palatal. “Sounds formed by the middle of the
tongue up against the hard palate” (Finch,
1999). The /j/ sound is the only consistent
example of a palatal sound in English. This
sound forms the ‘y’ in words like ‘yes’ and
‘yellow’.


Velar. “Sounds formed by the back of the
tongue against the soft palate, or velum”
(Finch, 1999). Think of the /k/ in ‘kick’, or the
/g/ in ‘go’. The ‘ng’ sound / ŋ/ in words like
‘sing’ and ‘tongue’ is also a velar sound.


Interdental. Produced by the tip of the tongue
protruding between the upper and lower
teeth. Interdental sounds include the ‘th’
sound /θ/ in words like ‘thing’ and ‘author’, or
the /ð/ in words like ‘this’ and ‘other’.

Uvular. Sounds formed by the root of the
tongue being raised against the velum. The ‘r’
in French (try saying the word ‘Paris’ with a
broad French accent), or the Arabic /q/ or /G/
are uvular sounds. English doesn’t have a
uvular sound.


Retroflex. There are other places of
articulation which are not really used in
English, and the retroflex is one of the. Here,
the tongue is curled back on itself to create a
rolling /r/ sound against the alveolar ridge.

Glottal. “Sounds formed from the space
between the vocal folds, or glottis” (Finch,
1999). There is no picture here because it is
rather difficult to illustrate. The glottal sound
/ʔ/ can be heard in the affirmative expression
‘uh-huh’, and in certain estuary or cockney
accents it is used to replace the /t/ sound in
words like ‘better’.


Manner of Articulation
So far we have seen that sound can be shaped as it passes through the vocal chords, and as the air is
passed from the lungs passed the pharyngeal cavity, the nasal cavity or the oral and labial cavities. The
sound variations created by these vocal apparatus are known as the manner of articulation. In other
words, the manner of articulation refers to the ways in which sound is altered by manipulation of the
flow of the airstream from the lungs. There are five principal types of manner for consonant sounds,
which are here adapted from Finch's Linguistic Terms and Concepts (1999):

Plosives
“Sounds in whose articulation the airstream is stopped by a brief closure of two speech organs and then
released in a quick burst” (Finch, 1999). Examples of plosives in English are /p/, /b/, /t/, /d/, /k/, /g/. You
can see a useful diagram of the plosive sound formation here:

http://www.ic.arizona.edu/~lsp/Phonetics/ConsonantsII/Phonetics3b.html.

Fricatives
“Sounds in whose articulation two speech organs narrow the airstream, causing friction to occur as it
passes through” (Finch, 1999). If you think of the sound /f/ or /s/, you might be able to hear how the
narrowing of the airstream by the lips being closed towards the upper teeth in the case of /f/, or by the
tongue being raised against the alveolar ridge in the case of /s/, creates a 'hissing' tone. This 'hissing' is
caused by the 'friction' of the air – hence 'frictives'. You can see a useful diagram of the fricative sound
formation here:

http://www.ic.arizona.edu/~lsp/Phonetics/ConsonantsII/Phonetics3c.html

Affricates
“Sounds in whose articulation the airstream is stopped as for a plosive and then released slowly and
partially with friction” (Finch, 1999). There are two affricate phonemes in English: /tʃ/ and /dʒ/. If you
think of the word 'church', notice that you begin the sound with the plosive /t/, but that this is
immediately followed by a fricative /sh/ sound. In the case of /dʒ/, think of the word 'judge'. Say the
letter 'd' and the letter 'j' alternatively one after the other. Do you notice that they both begin with the
same formation of the tongue? The difference is that the /dʒ/ sound in the 'j' is extended with a fricative
/sh/ sound again to make sound out the '-dge' in the word 'judge'.

Nasals
“Sounds in whose articulation the airstream is diverted through the nasal cavity as a consequence of the
passage through the oral cavity being blocked by the lowering of the soft palate, or velum” (Finch,
1999). Try saying the following out loud to yourself: 'tell me a story'. Can you notice how dramatically
the sound changes when you come to the /m/ of 'me'? The airflow, which is passing through the oral
cavity for the rest of this phrase, is at this point diverted by the lowering of the velum into the nasal
cavity. You can see a useful diagram of the nasal sound formation here:


http://www.ic.arizona.edu/~lsp/Phonetics/ConsonantsII/Phonetics3e.html

Approximants
“Sounds in whose articulation two speech organs approach each other and air flows continuously
between them without friction” (Finch, 1999). If you think of the /l/ sound, for example, you can sense
how the tongue tip touches the alveolar ridge in order to allow the air to flow laterally around the

tongue, but without the 'hissing' sound of fricatives (this is sometimes called a lateral or liquid
approximant). For consonants like /w/, the lips approach each other at the beginning of the sound and
then 'glide' away from each other towards the end (these sounds are sometimes referred to as glides).
All of this is, again, achieved without the 'hissing' sound of a fricative.

Vowel Sounds
In the case of vowel sounds, manner of articulation is “less precise than for consonant sounds” (Finch,
1999), largely because consonant restrict a good deal more than vowels do. There are two main ways in
which manner of articulation in vowels is shaped:

Tongue height. “This distinguishes sounds in relation to the height of that part of the tongue
which is closest to the palate. When the tongue is high in the mouth, vowels are described as
close, and when low, as open. Other reference points are half-close and half-open” (Finch, 1999).
For example, consider the vowel sound /iː/ (as in 'fleece', 'sea' or 'machine'). Notice with this
vowel sound that the body of the tongue is raised against the hard palate. With the vowel sound
of /ɒ/ though (as in 'lot', 'odd' or 'wash'), the tongue is low in the oral cavity.

Lip posture. “Vowels are produced with the lips in a rounded or spread posture. There are
degrees of rounding but it is conventional to classify vowels as being either rounded or spread”
(Finch, 1999). Let us again consider the vowel sounds /iː/ and /ɒ/. Notice that when you say the
word 'fleece', the lips are spread wide when pronouncing the vowel sound. In


Word Boundary Detection
One of the major issues of ECT system is to indicate the word boundaries in continuous speech.
Continuous speech is the sequence of sounds and words spoken continuously with very few pauses and
thus difficult to indicate lexical items like words. The problem of word boundary detection arises in the
context of human communication with machines. The problem can be described as “Given a continuous
speech, word boundaries are to be placed in the utterance”. This problem is relevant in the context of
speech input to a system where segments of an utterance are required for further processing. Normally
during conversation, speaker does not indicate the word boundaries in his speech. To convert speech into
the target output (text/speech), one requires identifying the word boundaries.
Issues in WBD
• Almost all of the word boundary hypothesisation techniques developed, use some
specific language features. Hence, for Hindi speech also, one may conduct study
addressing the problem of word boundary hypothesisation. The alternate to this
problem is to develop a language independent word boundary detection technique and

the problem has been discussed in later in the thesis.
• For better word boundary detection technique, a speech database of sufficient volume
is required. Database should contain the multi- lingual speech corpus, so that the
independent WBD system could be developed. This issue has been discussed in the
thesis. The proposed WBD model has been tested on English and Hindi speech
corpus.
• Various literatures established that pitch patterns can be used as a prosodic clue to
hypothesize the word boundary in a better way, so this parameter should also be used
for Hindi as well. Experimental work for Hindi using pitch pattern has been reported
and evaluated in this research.
• For better signature of word boundaries, time interval between words can also play an
important role. The threshold value of time interval should be chosen sufficiently
longer than the longest speech sound that contains silence. This issue has been
investigated and adopted in our methodology.

Rule Based Approach to WBD

A speech waveform contains three signals namely- quasi-periodic, quasi random and the silence. Pitch
detection algorithm (PDA) is required in quasi periodic portion of the speech waveform [Mandal et
al.2006]. Many PDA algorithms have been proposed based on short-term energy and zero crossing rates,
and found that these features do not work in noisy ambiance. Considering the fact of noisy ambiance, an
algorithm has been proposed that also worked in noisy situations. For verifying the effectiveness of the
proposed methodology, utterances have been recorded in noisy conditions. Algorithm is based on only
two simple time-domain parameters- Intensity and Pitch.

Argmax-based Computations
◼ Problem Definition : Given a sequence of speech signals, identify the words.
◼ 2 steps :
◼ Segmentation (Word Boundary Detection)
◼ Identify the word
◼ Isolated Word Recognition :
◼ Identify W given SS (speech signal)
??????
^
=????????????????????????????????????
??????
??????(??????|????????????)


Identifying word(W) given speech signal (SS)
??????
^
=????????????????????????????????????
??????
??????(??????|????????????)
=????????????????????????????????????
??????
??????(??????)??????(????????????|??????)

◼ P(SS|W) = likelihood called “phonological model “ → intuitively more tractable!
◼ P(W) = prior probability called “language model”
??????(??????)=
# W appears in the corpus
# words in the corpus



HMM and Speech Recognition
Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology
for speech recognition. Now, there are hybrid systems that combine HMM with Deep Learning and there
are systems that are HMM free.
The primary objective of speech recognition is to build a statistical model to infer the text
sequences W (say “cat sits on a mat”) from a sequence of feature vectors X.
One approach looks for all possible sequences of words (with limited maximum length) and finds one that
matches the input acoustic features the best.


This model depends on building a language model P(W), a pronunciation lexicon model, and an acoustic
model P(X|W) (a generative model) as below.

An HMM model composes of hidden variables and observables. The top nodes below represent the
phones and the bottom nodes represent the corresponding observables (the audio features). The
horizontal arrows demonstrate the transition in the phone sequence for the true label “she just …”.

Given an HMM model is learned, we can use the forward algorithm to calculate the likelihood of our
observations. Our objective is summing the probabilities of the observations for all possible state
sequences:

Module 2: Word and Words Forms

Morphology of Words
The study of morphology concerns the construction of words from more basic components corresponding
roughly to meaning units. There are two basic ways that new words are formed, traditionally classified as
inflectional forms and derivational forms.
Inflectional forms use a root form of a word and typically add a suffix so that the word appears in the
appropriate form given the sentence. Verbs are the best examples of this in English. For example, the
word sigh will take suffixes such as -s, -ing, and -ed to create verb forms sighs, sighing, and sighted
respectively.
Derivational morphology involves the derivation of a new word from the other forms. The new words may
be in completely different categories from their subparts. For example, the noun friend is made into the
adjective friendly by adding the suffix -ly.

Morphology of Indian Languages
Linguistic diversity is the foundation of the cultural and political edifice of India. The 200 languages
enumerated in the Census are a linguistic subtraction of over 1,600 mother tongues reported by the
people indicating their perception of their linguistic identity and linguistic difference. The linguistic
diversity in India is marked by the fluidity of linguistic boundaries between dialect and language, between
languages around State borders and between speech forms differentiated on cultural and political
grounds.
Morphemes
Stem
tree, go, fat
Affixes
Prefixes
post -
(postpone)
Suffixes
-ed(tossed)

The languages of India historically belong to four major language families namely, Indo – European,
Dravidian, Austro-Asiatic and Sino – Tibetan. The Indo – European has the sub – families, Indo – Aryan and
Dardic/Kashmiri, Austro-Asiatic has Munda and Mon-Khmer/Khasi, and Sino-Tibetan has Tibeto-Burman
and Thai/Kempti. The Indo-European which is commonly called 'Indo-Aryan’ has the largest number of
speakers followed by Dravidian, Austo-Asiatic, also called 'Munda' and Sino-Tibetan which is commonly
called the 'Tibeto-Burman'.
Refer: Lecture_21.ppt

Problems with Morphological Analysis
• Productivity: Property of a morphological process to give rise to new formations on a systematic
basis.



• False Analysis





Transitive Verb
(read)
-able
Productive
(readable)
Noun (game) -able
Not
Productive
(gameable)

• Bound Base Morphemes: Occur only in a particular complex word. Do not have independent
existence.



Finite State Machine Based Morphology
Formal languages are sets of strings: Strings composed of symbols drawn from a finite alphabet
Finite-state automata define formal languages: Without having to enumerate all the strings in the
language
Two views of FSAs:
• Acceptors that can tell you if a string is in the language
• Generators to produce all and only the strings in the language
Working FS snippet for ELIZA (a NLP bot of 60s) is as below.

Automatic Morphology Learning
• Identification of borders betwen morphemes
• Zellig Harris
• {prefix, suffix} conditional entropy
• bigrams and trigrams with high probability of forming a morpheme
• Learning of patterns or rules of mapping between pairs of words
• Global approach (top-down)
• Golsdmith, Brent, de Marcken

Goldsmith’s system based on MDL (Minimum Description Length)
• Initial Partition: word -> stem + suffix
• split-all-words
• A good candidate to {stem, suffix} splitting in a word has to be a good candidate in many
other words
• MI (mutual information) strategy
• Faster convergence
• Learning Signatures
• {signatures, stem, suffixes}
• MDL

Shallow Parsing
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which first identifies
constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units
that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.).
Typically we have a generative grammar that tells us how a sentence is generated from a set of
rules. Parsing is the process of finding a parse tree that is consistent with the grammar rules – in other
words, we want to find the set of grammar rules and their sequence that generated the sentence. A
parse tree not only gives us the POS tags, but also which set of words are related to form phrases and also
the relationship between these phrases.
Shallow parsing is the process of being able to get part of this information (parse tree). POS tagging is like
getting the last layer of the parse tree – only the part of speech tags like verb/noun/adjective… associated

with individual words. Chunking another common technique gets the POS tags and which words are
together to form phrases (This is like reading the last two layers of the parse tree).

Named Entities
In any text document, there are particular terms that represent specific entities that are more informative
and have a unique context. These entities are known as named entities , which more specifically refer to
terms that represent real-world objects like people, places, organizations, and so on, which are often
denoted by proper names. A naive approach could be to find these by looking at the noun phrases in text
documents.
Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used
in information extraction to identify and segment the named entities and classify or categorize them
under various predefined classes.

NER could be performed using four methods –
1. Regular expressions: Regular expressions (RegEx) are a form of finite state automaton. They are
very helpful in identifying patterns that follow a certain structure. For example, email ID, phone
number, etc. can be identified well using RegEx. However, the downside of this approach is that
one needs to be aware of all the possible exact words that occur before the claim number. This is
not a learning approach, but rather a brute force one
2. Hidden Markov Model (HMM): This is a sequence modelling algorithm that identifies and learns
the pattern. Although HMM considers the future observations around the entities for learning a
pattern, it assumes that the features are independent of each other. This approach is better than
regular expressions as we do not need to model the exact set of word(s). But in terms of
performance, it is not known to be the best method for entity recognition
3. MaxEnt Markov Model (MEMM): This is also a sequence modelling algorithm. This does not
assume that features are independent of each other and also does not consider future
observations for learning the pattern. In terms of performance, it is not known to be the best
method for identifying entity relationships either
4. Conditional Random Fields (CRF): This is also a sequence modelling algorithm. This not only
assumes that features are dependent on each other, but also considers the future observations
while learning a pattern. This combines the best of both HMM and MEMM. In terms of
performance, it is considered to be the best method for entity recognition problem

Maximum Entropy Model
Similar to logistic regression, the maximum entropy (MaxEnt) model is also a type of log-linear model. The
MaxEnt model is more general than logistic regression. It handles multinomial distribution where logistic
regression is for binary classification.
The maximum entropy principle is defined as modeling a given set of data by finding the highest entropy
to satisfy the constraints of our prior knowledge.
The feature function of MaxEnt model would be multi-classes. For example, given (x,y), the feature
function returns 0,1, or 2.
The maximum entropy model is a conditional probability model p(y|x) that allows us to predict class labels
given a set of features for a given data point. It does the inference by taking trained weights and performs
linear combinations to find the tag with the highest probability by finding the highest score for each tag.

Random Fields
The bag of words (BoW) approach works well for multiple text classification problems. This approach
assumes that presence or absence of word(s) matter more than the sequence of the words. However,
there are problems such as entity recognition, part of speech identification where word sequences matter
as much, if not more. Conditional Random Fields (CRF) comes to the rescue here as it uses word sequences
as opposed to just words.

There are 2 components to the CRF formula:
1. Normalization: You may have observed that there are no probabilities on the right side of the
equation where we have the weights and features. However, the output is expected to be a
probability and hence there is a need for normalization. The normalization constant Z(x) is a sum
of all possible state sequences such that the total becomes 1. You can find more details in the
reference section of this article to understand how we arrived at this value.
2. Weights and Features: This component can be thought of as the logistic regression formula with
weights and the corresponding features. The weight estimation is performed by maximum
likelihood estimation and the features are defined by us.

Module 3: Structures

Parsing Algorithms
Parsing in basic terms can be described as breaking down the sentence into its constituent words in
order to find out the grammatical type of each word or alternatively to decompose an input into more
easily processed components. In simple terms parsing is breaking down of sentence into atomic values.
To analyze data or a sentence for structure, content and meaning. For example, let’s consider a
sentence “John is playing game”. After parsing it will be stated in terms of its constituents, as “John”,
“is”, “playing”, “game”. The basic connection between a sentence and the grammar it derives from is
the parse tree, which describes how the grammar was used to produce the sentence. For the
reconstruction of this connection we need a parsing technique.
Refer to: NLP_Parsing_Algos.pdf

• Top-down Parsing: Using Top-Down technique, parser searches for a parse tree by trying to build from
the root node S down to the leaves. The algorithm starts by assuming the input can be derived by the
designated start symbol S. The next step is to find the tops of all the trees which can start with S, by
looking on the grammar rules with S on left hand side, all the possible trees are generated. Top down
parsing is a goal directed search. Predictive parsing is the solution for backtracking problem faced in
top-Down Strategy. Predictive Parsing is characterized by its ability to use at most next (k) tokens to
select which production to apply, referred to as lookahead. Making the right decision without
backtracking.

• Bottom-up Parsing: Bottom – Up parsing starts with the words of input and tries to build trees from
the words up, again by applying rules from the grammar one at a time. The parse is successful if the
parse succeeds in building a tree rooted in the start symbol S that covers all of the input. Bottom up
parsing is a data directed search[7].It tries to roll back the production process and to reduce the
sentence back to the start symbol S.

• Top-down Parsing: The primary control strategy of Top-Down parsing is adopted to generate
trees and then the constraints from the Bottom-up parsing are grafted to filter out the inconsistent
parses. The parsing algorithm initiates with top-down, depth-first, left-to-right strategy, and maintain
an agenda of search states, consisting of partial trees along with pointer to the next input word in the
sentence. Next step is to add the Bottom-up filter using left -corner rule, stated as, the parser should
not consider any grammar rule if the current input cannot serve as the first word along the left edge
of the derivation from that rule.

• Statistical Parsing: Statistical parsing is a probabilistic parsing which resolves the structural ambiguity
i.e. multiple parse trees for a sentence by choosing the parse tree with the highest probability value.
The statistical parsing model defines the conditional probability, P(T|S) for each candidate parse tree
T for a sentence S. The parser itself is an algorithm which searches for the tree T that maximizes P(T|S).
The statistical parser, uses probabilistic context-free grammars(PCFGs),context-free grammars in

which every rule is assigned a probability to figure out, how to(1) find the possible parses (2) assign
probabilities to them (3) pull out the most probable one.

• Dependency Parsing: The fundamental notion of dependency is based on the idea that the syntactic
structure of a sentence consists of binary asymmetrical relations between the words of the sentence
and Dependency parsing provide a syntactic representation that encodes functional relationships
between words. The dependency relation holds between the head and the dependent. Dependency
parsing uses the dependency structure representing head-dependent relations (directed arcs),
functional categories (arc labels) and possibly some structural categories (parts-of-speech).

For more refer to Lecture 38, Lecture 39 and Lecture 40.






Module 4: Meaning

Read from: nlp-winter-school-lexical-networks--6jan08.pdf

Coreference Resolution
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an
important step for a lot of higher level NLP tasks that involve natural language understanding such as
document summarization, question answering, and information extraction.

Module 5: Applications

Sentiment Analysis
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language
processing, text analysis, computational linguistics, and biometrics to systematically identify, extract,
quantify, and study affective states and subjective information. Sentiment analysis is widely applied to
voice of the customer materials such as reviews and survey responses, online and social media, and
healthcare materials for applications that range from marketing to customer service to clinical medicine.
Methods
Existing approaches to sentiment analysis can be grouped into three main categories: knowledge-based
techniques, statistical methods, and hybrid approaches.
Knowledge-based techniques classify text by affect categories based on the presence of unambiguous
affect words such as happy, sad, afraid, and bored. Some knowledge bases not only list obvious affect
words, but also assign arbitrary words a probable "affinity" to particular emotions.
Statistical methods leverage elements from machine learning such as latent semantic analysis, support
vector machines, "bag of words", "Pointwise Mutual Information" for Semantic Orientation, and deep
learning. More sophisticated methods try to detect the holder of a sentiment (i.e., the person who
maintains that affective state) and the target (i.e., the entity about which the affect is felt). To mine the
opinion in context and get the feature about which the speaker has opined, the grammatical relationships
of words are used. Grammatical dependency relations are obtained by deep parsing of the text.
Hybrid approaches leverage both machine learning and elements from knowledge representation such as
ontologies and semantic networks in order to detect semantics that are expressed in a subtle manner,
e.g., through the analysis of concepts that do not explicitly convey relevant information, but which are
implicitly linked to other concepts that do so.

Textual Entailment
Textual entailment (TE) in natural language processing is a directional relation between text fragments.
The relation holds whenever the truth of one text fragment follows from another text. In the TE
framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. Textual
entailment is not the same as pure logical entailment — it has a more relaxed definition: "t entails h" (t ⇒
h) if, typically, a human reading t would infer that h is most likely true. (Alternatively: t ⇒ h if and only if,
typically, a human reading t would be justified in inferring the proposition expressed by h from the
proposition expressed by t.) The relation is directional because even if "t entails h", the reverse "h entails
t" is much less certain. Many approaches and refinements of approaches have been considered, such as
word embedding, logical models, graphical models, rule systems, contextual focusing, and machine
learning.

Machine Translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-
aided translation, machine-aided human translation (MAHT) or interactive translation), is a sub-field of
computational linguistics that investigates the use of software to translate text or speech from one
language to another.
On a basic level, MT performs mechanical substitution of words in one language for words in another, but
that alone rarely produces a good translation because recognition of whole phrases and their closest
counterparts in the target language is needed. Not all words in one language have equivalent words in
another language, and many words have more than one meaning. In addition, two given languages may
have completely different structures.
Approaches
• Rule Based: The rule-based machine translation paradigm includes transfer-based machine
translation, interlingual machine translation and dictionary-based machine translation paradigms.
This type of translation is used mostly in the creation of dictionaries and grammar programs.
Unlike other methods, RBMT involves more information about the linguistics of the source and
target languages, using the morphological and syntactic rules and semantic analysis of both
languages. The basic approach involves linking the structure of the input sentence with the
structure of the output sentence using a parser and an analyzer for the source language, a
generator for the target language, and a transfer lexicon for the actual translation. RBMT's biggest
downfall is that everything must be made explicit.
• Statistical: Statistical machine translation tries to generate translations using statistical methods
based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record
of the Canadian parliament and EUROPARL, the record of the European Parliament. Where such
corpora are available, good results can be achieved translating similar texts, but such corpora
are still rare for many language pairs. Newer approaches into Statistical Machine translation
such as METIS II and PRESEMT use minimal corpus size and instead focus on derivation of
syntactic structure through pattern recognition. SMT's biggest downfall includes it being
dependent upon huge amounts of parallel texts, its problems with morphology-rich languages
(especially with translating into such languages), and its inability to correct singleton errors.
• Example Based: Example-based machine translation (EBMT) approach was proposed by Makoto
Nagao in 1984. Example-based machine translation is based on the idea of analogy. In this
approach, the corpus that is used is one that contains texts that have already been translated.
Given a sentence that is to be translated, sentences from this corpus are selected that contain
similar sub-sentential components. The similar sentences are then used to translate the sub-
sentential components of the original sentence into the target language, and these phrases are
put together to form a complete translation.
• Hybrid MT: Hybrid machine translation (HMT) leverages the strengths of statistical and rule-based
translation methodologies. The approaches differ in a number of ways –
o Rules post-processed by statistics: Translations are performed using a rules based engine.
Statistics are then used in an attempt to adjust/correct the output from the rules engine.
o Statistics guided by rules: Rules are used to pre-process data in an attempt to better guide
the statistical engine. Rules are also used to post-process the statistical output to perform

functions such as normalization. This approach has a lot more power, flexibility and
control when translating. It also provides extensive control over the way in which the
content is processed during both pre-translation (e.g. markup of content and non-
translatable terms) and post-translation (e.g. post translation corrections and
adjustments).
• Neural MT: A deep learning based approach to MT, neural machine translation has made rapid
progress in recent years, and Google has announced its translation services are now using this
technology in preference to its previous statistical methods. Microsoft team reached human
parity on WMT-2017 in 2018 and this was a historical milestone. It utilizes LSTM, GRU and RNN to
achieve this task along with Attention Mechanisms.

Question Answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and
natural language processing (NLP), which is concerned with building systems that automatically answer
questions posed by humans in a natural language.
A question answering implementation, usually a computer program, may construct its answers by
querying a structured database of knowledge or information, usually a knowledge base. More commonly,
question answering systems can pull answers from an unstructured collection of natural language.
Question answering research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically constrained, and cross-lingual questions.
• Closed-domain question answering deals with questions under a specific domain (for example,
medicine or automotive maintenance), and can exploit domain-specific knowledge frequently
formalized in ontologies. Alternatively, closed-domain might refer to a situation where only a
limited type of questions are accepted, such as questions asking for descriptive rather than
procedural information. Question answering systems in the context of machine reading
applications have also been constructed in the medical domain.
• Open-domain question answering deals with questions about nearly anything, and can only rely
on general ontologies and world knowledge. On the other hand, these systems usually have much
more data available from which to extract the answer.

Cross-language Information Retrieval (CLIR)
Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving
information written in a language different from the language of the user's query.

CLIR requires the ability to represent and match information in the same representation space even if the
query and the document collection are in different languages. The fundamental problem in CLIR is to
match terms in different languages that describe the same or a similar meaning. The strategy of mapping
between different language representations is usually machine translation. In CLIR, this translation
process can be in several ways.
• Document translation is to map the document representation into the query representation
space.
• Query translation is to map the query representation into the document representation space.
• Pivot language or Interlingua is to map both document and query representations to a third space.

Recent CLIR Models
DUET: A document ranking model consisting of two separate deep neural network sub-models. The first
sub-model matches the query and the document using a local representation of text, while the second
learns a distributed representations for queries and documents before matching them.
MUSE: The paper study cross-lingual word embeddings where word embeddings for two languages are
aligned in the same representation space. State-of-the-art methods for cross-lingual word embeddings
rely on bilingual supervision such as dictionaries or parallel corpora.
Unsupervised CLIR: Leverages shared cross-lingual word embedding spaces induced solely from
monolingual corpora in two languages through an iterative process based on adversarial neural networks.
The information retrieval is performed by calculating semantic similarity directly from the cross-lingual
embedding space. This does not require any bilingual supervision or relevance labels of documents.