Large Scale Processing of Unstructured Text

Hadoop_Summit 737 views 36 slides Jun 26, 2017

Slide 1 of 36

About This Presentation

Natural Language Processing (NLP) practitioners often have to deal with analyzing large corpora of unstructured documents and this is often a tedious process. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable framework like Apache Spa...

Size: 1.96 MB

Language: en

Added: Jun 26, 2017

Slides: 36 pages

Slide Content

Large Scale Processing of Text
Suneel Marthi
DataWorks Summit 2017,
San Jose, California

@suneelmarthi

$WhoAmI
●Principal Software Engineer in the Office of Technology, Red Hat

●Member of Apache Software Foundation

●Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache
Streams

What is a Natural Language?

What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without conscious planning or
premeditation
(From Wikipedia)

What is NOT a Natural Language?

Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search

and it holds most of human knowledge

and but it holds most of human knowledge

As information overload grows
ever worse, computers may
become our only hope for
handling a growing deluge of
documents.

MIT Press - May 12, 2017

What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language corpora.(From Wikipedia)

???

How?

By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Sentence Detector
Mr. Robert talk is today at room num. 7. Let's go?
| | | | ❌
| | ✅
Tokenizer
Mr. Robert talk is today at room num. 7. Let's go?
|| | | | | | | || || | ||| | | ❌
| | | | | | | | || | | | | | ✅

By solving small problems each time
Each step of a pipeline solves one ambiguity problem.
Name Finder
<Person>Washington</Person> was the first president of the USA.
<Place>Washington</Place> is a state in the Pacific Northwest region
of the USA.
POS Tagger
Laura Keene brushed by him with the glass of water .
| | | | | | | | | | |
NNP NNP VBD IN PRP IN DT NN IN NN .

By solving small problems each time
A pipeline can be long and resolve many ambiguities
Lemmatizer
He is better than many others
| | | | | |
He be good than many other

Apache OpenNLP

Apache OpenNLP
Mature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast

Language Detector (soon)
Sentence detector
Tokenizer
Part of Speech Tagger
Lemmatizer
Chunker
Parser
....

Training Models for English
Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)

bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin

bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin

Training Models for Portuguese
Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)

bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer
lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1

bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding
ISO-8859-1 -includeFeatures false

bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding
ISO-8859-1

bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding
ISO-8859-1

Name Finder API - Detect Names
NameFinderME nameFinder = new NameFinderME(new
TokenNameFinderModel(
OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”)));

for (String document[][] : documents) {

for (String[] sentence : document) {
Span nameSpans[] = nameFinder.find(sentence);
// do something with the names
}

nameFinder.clearAdaptiveData()
}

Name Finder API - Train a model
ObjectStream<String> lineStream =
new PlainTextByLineStream(new
FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);

TokenNameFinderModel model;
try (ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream)) {
model = NameFinderME.train("en", "person", sampleStream,
TrainingParameters.defaultParams(),
TokenNameFinderFactory nameFinderFactory);
}

model.serialize(modelFile);

Name Finder API - Evaluate a model
TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new
NameFinderME(model));

evaluator.evaluate(sampleStream);

FMeasure result = evaluator.getFMeasure();

System.out.println(result.toString());

Name Finder API - Cross Evaluate a model
FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");
ObjectStream<NameSample> sampleStream = new
PlainTextByLineStream(sampleDataIn.getChannel(),
StandardCharsets.UTF_8);

TokenNameFinderCrossValidator evaluator = new
TokenNameFinderCrossValidator("en", 100, 5);

evaluator.evaluate(sampleStream, 10);

FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());

Language
Detector
Sentence
Detector
Tokenizer
POS
Tagger
Lemmatizer
Name
Finder
Chunker
Language 1
Language 2
Language N
Index
.
.
.

Apache Flink

Apache Flink
Mature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fault-Tolerant
Scalable - to 1000s of nodes and more
High Throughput, Low Latency

Apache Flink - Pos Tagger and NER
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> portugeseText =
env.readTextFile(OpenNLPMain.class.getResource(
"/input/por_newscrawl.txt").getFile());

DataStream<String> engText = env.readTextFile(
OpenNLPMain.class.getResource("/input/eng_news.txt").getFile());

DataStream<String> mergedStream = inputStream.union(portugeseText);

SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new
LanguageSelector());

Apache Flink - Pos Tagger and NER
DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por");
DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new
PorTokenizerMapFunction());

DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new
PorPOSTaggerMapFunction());

DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new
PorNameFinderMapFunction());

Apache Flink - Pos Tagger and NER
private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> {
public Iterable<String> select(Tuple2<String, String> s) {
List<String> list = new ArrayList<>();
list.add(languageDetectorME.predictLanguage(s.f1).getLang());
return list;
}
}

private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>,
Tuple2<String, String[]>> {
public Tuple2<String, String[]> map(Tuple2<String, String> s) {
return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0));
}
}

Apache Flink - Pos Tagger and NER
private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>,
POSSample> {
public POSSample map(Tuple2<String, String[]> s) {
String[] tags = porPosTagger.tag(s.f1);
return new POSSample(s.f0, s.f1, tags);
}
}

private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>,
NameSample> {
public NameSample map(Tuple2<String, String[]> s) {
Span[] names = engNameFinder.find(s.f1);
return new NameSample(s.f0, s.f1, names, null, true);
}
}

What’s Coming ??

What’s Coming ??
●DL4J: Mature Project: 114 contributors, ~8k commits
●Modular: Tensor library, reinforcement learning, ETL,..
●Focused on integrating with JVM ecosystem while
supporting state of the art like gpus on large clusters
●Implements most neural nets you’d need for language
●Named Entity Recognition using DL4J with LSTMs
●Language Detection using DL4J with LSTMs
●Possible: Translation using Bidirectional LSTMs with embeddings
●Computation graph architecture for more advanced use cases

Credits
Joern Kottmann — PMC Chair, Apache OpenNLP

Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP

William Colen --- Head of Technology, Stilingue - Inteligência Artificial,
Sao Paulo, Brazil
PMC - Apache OpenNLP

Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany
Committer and PMC, Apache Flink

Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink

Large Scale Processing of Unstructured Text

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Large Scale Processing of Unstructured Text

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx