Machine learning (ML) and natural language processing (NLP)

nikolamilosevic86 874 views 23 slides Apr 23, 2019
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

Short introduction on natural language processing (NLP) and machine learning (ML). Speaks about sub-areas of artificial inteligence and then mainly focuses on the sub-areas of machine learning and natural language processing. Explains the process of data mining from high perspective


Slide Content

Natural language processing and machine learning Nikola Milosevic

What is AI? Intelligence presented by a machine Flexible agent that interacts with the environment and performs actions to maximize success towards certain goal

Popular AI

What is machine learning Subfield of computer science that explores how machines can learn to perform certain task without explicit programming

Data mining generally

Types of machine learning Supervised learning Semi-supervised learning Unsupervised learning Reinforcement learning

Machine learning problems Classification Clustering Regression

Testing the model Iteratively improve the model Test multiple algorithms – find the best one No free lunch theory Feedback loop for feature selection Konfuziona matrica  

Examples of ML frameworks and algorithms SCI-kit learn Python library Implementation of the most useful algorithms Naïve Bayes, SVM, Random forests, decision trees… Keras Python library implementing about everything related to neural networks

Text data About 80% of data in organizations are in text format Harder to analyse than structured data Huge amount of textual documents Only in biomedicine 2200 scientific papers are published every day Growing exponentially

Main goals of text mining Make communication easier (e.g. translation) Automate some processes (e.g. communication agents/ chatbots ) Do data mining on textual and unstructured data

Process overview

Challenges Man saw a woman with the telescope. Who has a telescope? Multiple senses, synonyms, homonyms, irony Grammar and context can help Acronyms

Approaches Rule based Human defined rules to extract information Needs expert humans who know how people express certain things Is quite laborious Machine learning based Machine tries to learn what to extract guided by human Needs annotated corpora (usually fairly large) This is expensive to create and quite laborious

Levels of analysis Lexical Analysis of words Syntactic Analysis of organization of words ( phrases , sentences ) Semantic Analysis meaning Sometimes pragmatic Analysis pragmatics of the use of certain words, phrases. Why author used that?

Steps

Lexical processing Part of speech tagging Parsing Constituency Dependency Stanford parser

Semantic processing Text classification Sentiment analysis (positive/negative) Classification by topics (politics/sport/business) Authorship detection (Tolkien, Rowling, Shakespeare) Named entity recognition Topic modelling (unsupervised) Search

Sequence modelling Machine learning technique useful for named entity recognition Conditional random fields (CRF) or recurrent neural networks (often LSTM)

Feature engineering Selecting important features that help extract information Can be: Words, PoS , word shapes, vocabulary features, etc. May depend on task and methodology Iterative process of selecting and improving the performance Some features may confuse the algorithm

Search Finds documents that are the most relevant for a given user query Usual techniques include algorithm called TF-IDF and cosine similarity May additionally use links towards text, positions of matched words and similar things to rank found documents Apache Lucene, Solr (Java), there are also Python libraries

Language models Used as features to classification and other NLP tasks Contain some basic characteristics of language The most naïve (but also frequently used) is called Bag of Words NN use more advanced m odels: word2vec, Glove, ULMo , BERT…

Useful tools and libraries Apache OpenNLP – Java Apache Lucene – Java, C# Stanford Core NLP – Java NLTK – Python GATE – GUI alat SharpNLP ... Weka – for machine learning (GUI)