Text Processing with KNIME

KNIMESlides 11,736 views 28 slides Dec 23, 2015
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

This tutorial was presented at a Boston KNIME User Meetup in 2014 and offers a crash course in KNIME, text processing, text mining, and topic classification.


Slide Content

Copyright © 2014 KNIME.com AG
Boston KNIME Users
Text Processing Applications
Kilian Thiel
KNIME

Copyright © 2014 KNIME.com AG
Agenda
•KNIME Crash Course
•Text Mining with KNIME: Mining TripadvisorData
•Text Mining with KNIME: Mining Amazon Reviews
(Anil Tarachandani)
•Networking Apero
2

Copyright © 2014 KNIME.com AG
Text Mining with KNIME: Mining TripadvisorData
Agenda
•The KNIME TextprocessingExtension
–Preliminaries
–Philosophy & Usage
•Classification of TripadvisorReviews
–Tripadvisordata
–Classification of reviews
3

Copyright © 2014 KNIME.com AG
Resources
http://tech.knime.org/knime-text-processing
•Documentation
•Examples
•Forum
•White Papers
4

Copyright © 2014 KNIME.com AG
Installation
5
1.) 2.)

Copyright © 2014 KNIME.com AG
Requirements
Requirements to import and run demo workflows
•KNIME 2.10
•Textprocessing(labs)
•Distance Matrix (KNIME)
•Palladian (Community)
6

Copyright © 2014 KNIME.com AG
Tips
•Settings (knime.ini)
–Set maximum memory for KNIME
–-Xmx3G
7

Copyright © 2014 KNIME.com AG
Demo
Prepare KNIME
•Go to KNIME directory
•Change knime.ini file (optional)
–-Xmx3G
•Start KNIME
•Install TextprocessingExtension
–(or better have it already installed)
8

Copyright © 2014 KNIME.com AG
Philosophy
9
… perhapsyourname
is
Rumpelstiltskin[Perso
n]? …
… perhapsyourname
is
Rumpelstiltskin[Perso
n] ? … Visualization
Cluster-
ing
Classifi-
cation
1 1 1 0 1 0 0 1 1
0 1 1 0 0 1 0 0 0
0 0 1 1 1 0 1 1 0

Copyright © 2014 KNIME.com AG
Additional Data Types
•Document Cell
–Encapsulates a document
•Title, sentences, terms, words
•Authors, category, source
•Generic meta data (key, value pairs)
•Term Cell
–Encapsulates a term
•Words, tags
10

Copyright © 2014 KNIME.com AG
Data Table Structures
•Document table
–List of documents
•Bag of words
–Tuples of documents
and terms
•Document vectors
–Numerical
representations of
documents
11

Copyright © 2014 KNIME.com AG
Philosophy and Data Table Structures
12
Enrichment Preprocessing
1 1 1 0
1 0 0 1
Documents Bow VectorsDocuments Documents

Copyright © 2014 KNIME.com AG
TripadvisorData
13
Title
Author
Rating
Fulltext

Copyright © 2014 KNIME.com AG
TripadvisorData
14
Reviews aboutitalianandchineserestaurantsin
Boston
•Chinese: 272
•Italian: 268

Copyright © 2014 KNIME.com AG
TripadvisorData
15
Goal:
•Buildclassifiertodistinguishbetweenchineseand
italianrestaurants, basedon theirreviews.
Review aboutitalianor
chineserestaurant?

Copyright © 2014 KNIME.com AG
TripadvisorData
16
Goal:

Copyright © 2014 KNIME.com AG
1.) Reading
Read/Parse textual data
17

Copyright © 2014 KNIME.com AG
Demo
Reading
•Read Tripadvisordata (.table file)
•Filter rows with missing restaurant value
•Convert strings to documents
•Filter all but the document column
18

Copyright © 2014 KNIME.com AG
2.) Enrichment
Enrich documents with semantic information
19

Copyright © 2014 KNIME.com AG
Demo
Enrichment / Tagging
•Apply POS Tagger node
•Use Bag of Words node to inspect tagging result
20

Copyright © 2014 KNIME.com AG
3.) Preprocessing
Preprocess documents and filter words
21

Copyright © 2014 KNIME.com AG
Demo
Preprocessing
•Filter
–Numbers
–Punctuation marks
–Stop Words
•Convert to lower case
•Stemming
•Keep only nouns, verbs, adjectives
22

Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
23

Copyright © 2014 KNIME.com AG
Demo
Transformation
•Transform to bag of word
•Compute TF value for terms
•Transform to document vectors
•Extract category (class) value
24

Copyright © 2014 KNIME.com AG
5.) Classification
Training of a model (decision tree) and scoring
25

Copyright © 2014 KNIME.com AG
Demo
Classification
•Append color based on class
•Partition data into training and test set
•Train decision tree model in training data
•Apply decision tree model on test data
•Score model, measure accuracy
26

Copyright © 2014 KNIME.com AG
Additional Workflows
•Multi Word Tagging
–Detection of frequent Ngrams
–Creation of dictionary from Ngrams
–Applying Dictionary Tagger
•Classification with Multi Words
•Clustering of documents
27

Copyright © 2014 KNIME.com AG
Thank You
40k
60k
20k
28
Questions
•http://tech.knime.org/forum
[email protected]
Follow us
•Twitter: @KNIME
•LinkedIn: https://www.linkedin.com/groups?gid=2212172
•KNIME Blog: http://www.knime.org/blog