NLTK: Natural Language Processing made easy

outsider2 5,877 views 47 slides Mar 08, 2009
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python is introduced. It is useful for getting started with NLP and also for research/teaching.


Slide Content

http://barcampbangalore.org
NLTK
Natural Language Processing made easy
Elvis Joel D’Souza
Gopikrishnan Nambiar
Ashutosh Pandey

http://barcampbangalore.org
WHAT: Session Objective
To introduce Natural Language Toolkit(NLTK),
an open source library which simplifies the
implementation of Natural Language
Processing(NLP) in Python.

http://barcampbangalore.org
HOW: Session Layout
This session is divided into 3 parts:
•Python – The programming language
•Natural Language Processing (NLP) – The concept
•Natural Language Toolkit (NLTK) – The tool for NLP
implementation in Python

http://barcampbangalore.org

http://barcampbangalore.org
Why Python?

http://barcampbangalore.org
Data Structures
Python has 4 built-in data structures:
2.List
3.Tuple
4.Dictionary
5.Set

http://barcampbangalore.org
List
•A list in Python is an ordered group of items
(or elements).
•It is a very general structure, and list elements
don't have to be of the same type.
listOfWords = [‘this’,’is’,’a’,’list’,’of’,’words’]
listOfRandomStuff =
[1,’pen’,’costs’,’Rs.’,6.50]

http://barcampbangalore.org
Tuple
•A tuple in Python is much like a list except
that it is immutable (unchangeable) once
created.
•They are generally used for data which should
not be edited.
Example: (100,10,0.01,’hundred’)
Number
Square root
Reciprocal
Number in words

http://barcampbangalore.org
Return a tuple
def func(x,y):
# code to compute a and b
return (a,b)
One very useful situation is returning multiple values from a
function. To return multiple values in many other languages
requires creating an object or container of some type.

http://barcampbangalore.org
Dictionary
•A dictionary in python is a collection of
unordered values which are accessed by key.
•Example:
•Here, the key is the character and the value is
its position in the alphabet
{1: ‘one’, 2: ‘two’, 3:
‘three’}

http://barcampbangalore.org
Sets
•Python also has an implementation of the mathematical set.
•Unlike sequence objects such as lists and tuples, in which
each element is indexed, a set is an unordered collection of
objects.
•Sets also cannot have duplicate members - a given object
appears in a set 0 or 1 times.
SetOfBrowsers=set([‘IE’,’Firefox’,’Opera’,’Chrome’]
)

http://barcampbangalore.org
Control Statements

http://barcampbangalore.org
Decision Control - If
num = 3

http://barcampbangalore.org
Loop Control - While
number = 10

http://barcampbangalore.org
Loop Control - For

http://barcampbangalore.org
Functions - Syntax
def functionname(arg1, arg2, ...):
statement1
statement2
return variable

http://barcampbangalore.org
Functions - Example

http://barcampbangalore.org
Modules
•A module is a file containing Python
definitions and statements.
•The file name is the module name with the
suffix .py appended.
•A module can be imported by another
program to make use of its functionality.

http://barcampbangalore.org
Import
import math
The import keyword is used to tell Python, that
we need the ‘math’ module.
This statement makes all the functions in this
module accessible in the program.

http://barcampbangalore.org
Using Modules – An Example
print math.sqrt(100)
sqrt is a function
math is a module
math.sqrt(100) returns 10
This is being printed to the standard output

http://barcampbangalore.org
Natural Language Processing
(NLP)

http://barcampbangalore.org
Natural Language Processing
The term natural language processing
encompasses a broad set of techniques for
automated generation, manipulation, and
analysis of natural or human languages

http://barcampbangalore.org
Why NLP
•Applications for processing large amounts of
texts require NLP expertise
•Index and search large texts
•Speech understanding
•Information extraction
•Automatic summarization

http://barcampbangalore.org
Stemming
•Stemming is the process for reducing inflected
(or sometimes derived) words to their stem, base
or root form – generally a written word form.
•The stem need not be identical to the
morphological root of the word; it is usually
sufficient that related words map to the same
stem, even if this stem is not in itself a valid root.
•When you apply stemming on 'cats', the result is
'cat'

http://barcampbangalore.org
Part of speech tagging(POS Tagging)
•Part-of-speech (POS) tag: A word can be
classified into one or more lexical or part-of-
speech categories
•such as nouns, verbs, adjectives, and articles,
to name a few. A POS tag is a symbol
representing such a lexical category, e.g., NN
(noun), VB (verb), JJ (adjective), AT (article).

http://barcampbangalore.org
POS tagging - continued
•Given a sentence and a set of POS tags, a
common language processing task is to
automatically assign POS tags to each word in
the sentence.
•State-of-the-art POS taggers can achieve
accuracy as high as 96%.

http://barcampbangalore.org
POS Tagging – An Example
The ball is red
NOUN VERB
ADJECTIVE
ARTICLE

http://barcampbangalore.org
Parsing
Parsing a sentence involves the use of
linguistic knowledge of a language to discover
the way in which a sentence is structured

http://barcampbangalore.org
Parsing– An Example
The boy went home
NOUN
VERB NOUN
ARTICLE
NP
VP
The boy
went home

http://barcampbangalore.org
Challenges
•We will often imply additional information in
spoken language by the way we place stress
on words.
•The sentence "I never said she stole my
money" demonstrates the importance stress
can play in a sentence, and thus the inherent
difficulty a natural language processor can
have in parsing it.

http://barcampbangalore.org
Depending on which word the speaker places
the stress, sentences could have several
distinct meanings
Here goes an example…

http://barcampbangalore.org
•"I never said she stole my money“
Someone else said it, but I didn't.
•"I never said she stole my money“
I simply didn't ever say it.
•"I never said she stole my money"
I might have implied it in some way, but I
never explicitly said it.
•"I never said she stole my money"
I said someone took it; I didn't say it was she.

http://barcampbangalore.org
•"I never said she stole my money"
I just said she probably borrowed it.
•"I never said she stole my money"
I said she stole someone else's money.
•"I never said she stole my money"
I said she stole something, but not my money

http://barcampbangalore.org
NLTK
Natural Language Toolkit

http://barcampbangalore.org
Design Goals

http://barcampbangalore.org
Exploring Corpora
Corpus is a large collection of text which is
used to either train an NLP program or is used
as input by an NLP program
In NLTK , a corpus can be loaded using the
PlainTextCorpusReader Class

http://barcampbangalore.org

http://barcampbangalore.org
Loading your own corpus
>>> from nltk.corpus import PlaintextCorpusReader
corpus_root = ‘C:\text\’
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*‘)
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

http://barcampbangalore.org
NLTK Corpora
•Gutenberg corpus
•Brown corpus
•Wordnet
•Stopwords
•Shakespeare corpus
•Treebank
•And many more…

http://barcampbangalore.org
Computing with Language: Simple Statistics
Frequency Distributions
>>> fdist1 = FreqDist(text1)
>>> fdist1 [2]
<FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1['whale']
906

http://barcampbangalore.org
Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick

http://barcampbangalore.org
POS tagging

http://barcampbangalore.org
WordNet Lemmatizer

http://barcampbangalore.org
Parsing
>>> from nltk.parse import ShiftReduceParser
>>> sr = ShiftReduceParser(grammar)
>>> sentence1 = 'the cat chased the dog'.split()
>>> sentence2 = 'the cat chased the dog on the rug'.split()
>>> for t in sr.nbest_parse(sentence1):
... print t
(S (NP (DT the) (N cat)) (VP (V chased) (NP (DT the) (N dog))))

http://barcampbangalore.org
Authorship Attribution
An Example

http://barcampbangalore.org
Find nltk @
<python-installation>\Lib\site-packages\nltk

http://barcampbangalore.org
The Road Ahead
Python:
•http://www.python.org
•A Byte of Python, Swaroop CH
http://www.swaroopch.com/notes/python
Natural Language Processing:
•Speech And Language Processing, Jurafsky and Martin
•Foundations of Statistical Natural Language Processing,
Manning and Schutze
Natural Language Toolkit:
•http://www.nltk.org (for NLTK Book, Documentation)
•Upcoming book by O'reilly Publishers