Python NLTK

albertspumpurs 1,492 views 28 slides Feb 05, 2015
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

A fast tutorial over Python NLTK library. Text analysis use cases and possible profit


Slide Content

NLTK Alberts Pumpurs

90 % of world's data generated over last two years

common Interne t user creates Visual Textual Instagram Flickr Vscocam Facebook Tumblr Blogger Twitter Facebook Emails Costumer Reviews

Detecting hidden signals

World is full of unstructured , text-rich data . Everything from emails to customer tweets. The information buried in all that text holds the potential to deliver valuable business insights

Text analytics is the practice of using technology to gather, store and mine textual information for hidden signals that can be used to inform smarter business decisions

An explosion of unstructured data

Many types of organizations are experiencing explosive growth in their unstructured enterprise data . Same time that they have access to external sources of data such as social media, blogs, and mobile data.

Until now, much of this information passed through the organization virtually unanalyzed . T oday , new tools for handling large amounts of complex data make s it easier to squeeze value from such unlikely sources .

Text Processing use cases

sentiment analysis s pam filtering t ext categorization t opic detection keyword frequency plagiatism detection document similarity phrase extraction

Natural Language Tool Kit leading platform for building Python programs to work with human language data

NLTK Features

sentence and word tokenization text calsification corpora parsing clustring part of speach tagging text stemming and mutch more ..

Sentence tokenization

Word tokenization

Part of speech tagging

Part of speech tagging explanation CC Coordinating conjunctin CD Cardinal Number DT Determiner EX Existing “ there “ FW Foreign word IN Preposition or subordination conjuction JJ Adjective JJR Adjective- comparative JJS Adjective- superlative LS List item marker MD Modal NN Noun- singular or mass NNS Non-Plural NP Proper noun- singular nltk.help.upenn_tagset () // all tag sets

Chunking and NER

Text clasification Algorithms in NLTK Naive Bayes Maximum Entropy Decision Tree

Text clasification

Sentiment analysis https://github.com/pumpurs/SentimentWordsLV/

Document similarity detection Tf-idf stands for  term frequency-inverse document frequency , and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. 

Similarity and concordance

Dispersion Plot

But where is the

“ Market and product reserch ” “ Social CMS ” 1.97 b social network users “ Costumer profiling / analytics ” 70% of marketers used Facebook to gain 6.7 million people blog on blogging sites

[email protected] Big Data, Startups , Text Analysis, Internet of Things, Web D evelopment