Data Science: lesson01_intro-to-ds-and-ml.pdf

alhashediyemen 39 views 43 slides Jul 12, 2024
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

Python programming and core libraries for data
analysis, visualisation, and modelling
• Working with data: collecting, cleaning, transforming
• Creating and interpreting descriptive statistics
• Creating and interpreting data visualisations
• Creating statistical models for inference
• Pra...


Slide Content

Lewis Tunstall | Data Scientist | [email protected]
Leandro von Werra | Data Scientist | [email protected]
BTW2401 - Data Science
Lesson 1.1 - The Big Picture

Overall Course Goals
•Know how to approach business problems from a
data science perspective
•Understand the fundamental principles behind
extracting useful knowledge from data
•Gain hands-on experience with mining data for
insights
By the end of this course you will:

Skills Overview
•Python programming and core libraries for data
analysis, visualisation, and modelling
•Working with data: collecting, cleaning, transforming
•Creating and interpreting descriptive statistics
•Creating and interpreting data visualisations
•Creating statistical models for inference
•Practical machine learning
In this course you are going to learn several skills:

Course Materials
•Data Science for Business, F. Provost & T. Fawcett
(O'Reilly Media, Sebastopol, 2013).
This course is largely based on the excellent textbook:
•Hands-On Machine Learning with Scikit-Learn and
TensorFlow, A. Géron (O'Reilly Media, 2017)
•Introduction to Machine Learning for Coders, fast.ai
(http://course18.fast.ai/ml)
Other useful references include:

Week No. Date Topic
8 20.02 Introduction to data science
9 27.02 Python for data analysis I
10 5.03 Python for data analysis II
11 12.03 Introduction to random forests
12 19.03 Random forest deep dive
13 26.03 Model interpretation
14 2.04 Classification
15 9.04 No class (Easter)
16 16.04Midterm exam & define group projects
17 23.04Cross-validation and model performance
Week No. Date Topic
18 30.04 Neighbours and clusters I
19 7.05 Neighbours and clusters II
20 14.05 Natural language processing I
21 21.05
No class (Ascension) & project
submission
22 28.05 Natural language processing II
23 4.06 Project presentations
24 11.06 Deep learning
25 18.06 Exam Preparation
26 25.06 Exam Preparation
27 TBD Final Exam
Timetable and Key Dates

About Us: Backgrounds
•PhD in theoretical physics from Adelaide, Australia
•Postdoctoral researcher in Switzerland (University of Bern)
•2+ years working in industry
•Expertise in machine learning & mathematics
•MSc in computational physics from ETH
•2+ years working in industry
•Focus on application of machine learning to big data
Lewis
Leandro

About Us: What We Actually Do
Raw data,
little value
Data
exploration
Analysis,
Model building
Reporting,
Automation

This Lesson
We aim to answer the following 3 questions:
•What is data science?
•Why is it important?
•How is data science performed?
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”

This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
Big Data
•What is data science?
•Why is it important?
•How is data science performed?

This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
Big Data
Machine Learning
•What is data science?
•Why is it important?
•How is data science performed?

This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
We finish with
•A mini (ungraded!) quiz
•Onboarding of Python, Kaggle, and Paperspace
Big Data
Machine Learning
•What is data science?
•Why is it important?
•How is data science performed?

What is Data Science?
It’s a surprisingly hard definition to nail down:
•Superfluous?
•Buzzword?

What is Data Science?
It’s a surprisingly hard definition to nail down:
•Superfluous?
•Buzzword?
You know it’s serious when your
field makes it onto Gartner’s
hype cycle
1
1https://en.wikipedia.org/wiki/Hype_cycle
Python
Deep/Machine Learning
Predictive
Analytics

What is Data Science?
Despite the hype, useful definitions exist
1

Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making
1Provost & Fawcett, Chapter 1

What is Data Science?
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
Despite the hype, useful definitions exist
1

Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making

What is Data Science?
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
•Statistics
•Computer science
•Domain expertise
Despite the hype, useful definitions exist
1

Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making

What is Data Science?
Data
Science
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
•Statistics
•Computer science
•Domain expertise
Despite the hype, useful definitions exist
1

Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making

Why is Data Science Important?
In the past, data analysis was typically slow:
needed teams of statisticians, analysts etc to
explore data manually
Today: volume, velocity, and variety make
manual analysis impossible …

Big Data: The Large Hadron Collider at CERN

Big Data: The Large Hadron Collider at CERN
•150 million sensors delivering data 

40 million times per second.
•There are nearly 600 million collisions per second.
•Only 100 collisions of interest per second.
•Raw data production exceeds  500 exabytes per day 

(1 EB = 1 million TB).
•Due to filtering only 200 petabyte are generated 

annually (1 PB = 1000 TB).

Why is Data Science Important?
In the past, data analysis was typically slow:
needed teams of statisticians, analysts etc to
explore data manually
Today: volume, velocity, and variety make
manual analysis impossible …
… but fast computers and good algorithms
allow much deeper analyses than before
)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>data-driven decision making
Cartoon from Provost & Fawcett, Chapter 1
)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>base decisions on analysis of data,
not intuition

The Data Science Process
Find a
question
Collect
the data
Deploy
the model
Evaluate
the model
Create a
model
Prepare
the data
Data
How is data science performed?
1From Data Science: The Big Picture, M. Renze, Pluralsight

The Data Science Process
Find a
question
Collect
the data
Deploy
the model
Evaluate
the model
Create a
model
Prepare
the data
Data
•Iterative process
•Non-sequential
•Early termination
•Established processes, e.g. 

CRISP-DM (https://bit.ly/1tX6508)
How is data science performed?
1From Data Science: The Big Picture, M. Renze, Pluralsight

Example #1
1What Wal-Mart Knows About Customer Habits, NYT (2004)
Hurricane Frances was on its way, barreling
across the Caribbean, threatening a direct hit
on Florida's Atlantic coast … A week ahead of
the storm's landfall, Linda M. Dillman, Wal-
Mart's chief information officer, pressed her
staff to come up with forecasts based on what
had happened when Hurricane Charley struck
several weeks earlier.

Why might data-driven predictions be useful in
this scenario?

Example #1
1What Wal-Mart knows about customer habits, NYT (2004)
Hurricane Frances was on its way, barreling
across the Caribbean, threatening a direct hit
on Florida's Atlantic coast … A week ahead of
the storm's landfall, Linda M. Dillman, Wal-
Mart's chief information officer, pressed her
staff to come up with forecasts based on what
had happened when Hurricane Charley struck
several weeks earlier.

Why might data-driven predictions be useful in
this scenario?
7x increase in sales
before hurricane
Top-selling item
was beer!

Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?

Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?

Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
“My daughter got this in the mail!
She’s still in high school … 

Are you trying to encourage her to
get pregnant?”

Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
“My daughter got this in the mail!
She’s still in high school … 

Are you trying to encourage her to
get pregnant?”
“It turns out there’s been some activities in my
house I haven’t been completely aware of. 

She’s due in August. I owe you an apology.”
3 days later …

Lewis Tunstall | Data Scientist | [email protected]
Leandro von Werra | Data Scientist | [email protected]
BTW2401 - Data Science
Lesson 1.2 - Machine Learning

What is machine learning?
Blue faces seem to be important…

What is machine learning really?
1950’s: creation of first “intelligent”
algorithms and programs
1980’s: statistical models and algorithms
that can learn from data
2010’s: statistical models and algorithms
inspired by neurones that can learn from data

Machine Learning Branches
3 Main Branches:
-Supervised Learning
-Unsupervised Learning
-Reinforcement Learning

Machine Learning Branches

Machine Learning Branches: Supervised Learning
In supervised learning the
training data consists of
input/output pairs and we
train a function to map the
inputs to the outputs.
Input
RegressionClassification
Categorical Variable Continuous Variables
Values, Vectors,
Words, Images etc.
A, B, C
Dogs/Cats
Prize/Cost,
Weight, Lifetime

Supervised Learning: Classification
Classification: Assign categorical labels from a fixed set of labels to data samples.
“Broken Bike”“Normal Bike”Output/Label:
Input Data:

Supervised Learning: Regression
Regression: Find the relationship between one dependent variable and a series of
other changing variables.
Concentration
Length of lecture

Machine Learning Branches

Machine Learning Branches: Unsupervised Learning
In unsupervised learning there are
no labels available, insights are
gained without* prior knowledge. Input Data
Dimensionality Reduction
ClusteringOutlier Detection
Generative Models
* Usually some model parameters need to be set ahead of training.

Unsupervised Learning: Anomaly/Outlier detection
Anomaly Detection: The task of finding
samples in a dataset that raise suspicion.
Problem: Usually, what exactly you are
looking for is unknown.
Solution: Use statistics and characteristics
of dataset to find outliers.

Unsupervised Learning: Anomaly/Outlier detection
Unsupervised & automated requirements of an ideal detector [1]:
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
MULDER: Unsupervised Anomaly Detection for Streaming Applications
Leandro von Werra and Lewis Tunstall
SPOUD AG, Effingerstrasse 23, 3011 Bern, Switzerland
Results
The MULDER detector performs well across both validations sets. On the NAB our
algorithm outperforms those provided by Amazon and Twitter, while being
computationally lightweight.
Applying the MULDER detector on production data from Swiss Mobiliar showcases its
ability to detect positive and negative anomalies.
Implementation
To deploy MULDER in production we are currently building a streaming system. Pre-
aggregated events from a Kafka stream are consumed by a Flink pipeline that
preprocesses them and prepares the state for the anomaly detection service. The results
from the anomaly detector are then passed downstream, where they are used to source
real-time dashboards and/or alerting services.
Algorithm Standard Profile Reward Low FP Reward Low FN
Numenta HTM 70.5 62.6 75.2
MULDER 54.4 40.5 61.0
Random Cut Forest (AWS) 51.7 38.4 59.7
Twitter ADVec 47.1 33.6 53.5
References
[1] A. Lavin and S. Ahmad, Evaluating Real-time Anomaly Detection Algorithms - the Numenta
Anomaly Benchmark, in the 14th International Conference on Machine Learning and
Applications (IEEEE ICMLA’15), 2015.
[2] D. Goldberg and Y. Shan, The Importance of Features for Statistical Anomaly Detection, in
the 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15), 2015.
Acknowledgements
We thank Wilhelm Masero and the Need4Speed team at Swiss Mobiliar for a fruitful
collaboration.
Figure: Topology of Swiss Mobiliar’s application network, where each blue node corresponds to a
component of an application. Communication between components is denoted by grey lines. The right
panel shows an illustrative time series from a single node exhibiting positive and negative anomalies.
Motivation
At SPOUD we are building a Data Market that redefines the way enterprises access and
integrate streaming business data within and outside company borders.
As part of our product development, we apply our expertise in event-processing
technologies to solve real-world problems for our customers. One such customer is
Swiss Mobiliar, whose employees rely on a very large network of applications to assist
their customers with insurance policies, quotes, etc. These applications are monitored,
where quantities such as workload and response time are recorded at frequent
intervals.
Theoretical
No train/test split
Full dataset not available
Time-dependent definition of “normal”
Practical
High throughput
Data labelling not viable
No manual parameter tuning
and yield positive and negative anomalies.These time series exhibit
seasonality
trend
!
<latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit>

<latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit>
how to detect disruptions in monitored signals before they affect users or escalate?
To address these challenges we conducted a feasibility study with Swiss Mobiliar.
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
Anomaly Detection in Streaming Systems
Streaming data unique challenges & constraints:
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
Key challenges:
bottlenecks
disruptions
> 20,000 correlated time series
> 2,000 events/s
Detects all
anomalies present
in streaming data.
Detects anomalies
as soon as
possible.
Triggers no false
alarms.
Is fully automated
across all
streaming sources.
Requirements template for algorithm development
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
The Algorithm
The MULDER detector, developed at SPOUD, is based on the concept of surprise in a
univariate time series [2]. Surprise is defined as the difference between the expected and
the actual value of a given metric. Calculating the surprise at each time-step yields a
secondary, “surprise” time series. Examples of stationary and periodic signals shown below.
Validation Procedure
Numenta Anomaly Benchmark (NAB)
The NAB [1] consists of over 50 time series with annotated anomalies, along with several
evaluation profiles to measure precision and recall as well as time to detect. The
datasets range from AWS CPU utilisation to taxi usage in NYC; the timescales range from
weeks to years. Performance on the NAB provides a powerful test for the generalisation
capabilities of a given algorithm.
Stationary Signal Periodic Signal
Figure: Calculation of surprise, where the
expected value is obtained from the median of
previous time steps at the same time of day.
Figure: Calculation of surprise, where the expected
value is obtained by linear extrapolation of the past
time-steps.
Anomaly detection via surprise percentiles
From the surprise time series, percentiles
(10th and 90th) are tracked over time with a
sliding window. With a 3σ test for outliers,
the percentile time series are probed for
anomalies. Tracking both the upper and
lower percentiles enables the reliable
detection of negative (dips), as well as
positive (spikes) anomalies.
Figure: Surprise time series with
corresponding percentile curves and
anomaly threshold.
Synthetic Anomalies
In addition to the NAB validation, we collected production telemetry data from Swiss
Mobiliar and injected synthetic anomalies. There are several types of anomalies available
(gaussian, sawtooth, top-hat) and the amplitude and timespan are chosen at random.
With this validation approach we measure the performance (precision/recall/F1-score)
directly on the use case and calibrate the system to our customer’s needs.
Figure: Production data with time shifts of 1, 2 and 3 days. Detected anomalies shown in red.
Figure: Sample NAB datasets. The left figure shows CPU utilisation, while the right shows temperature
readings.
Figure: Production data (teal) with synthetic anomaly injections (red).
A feasibility study in collaboration with Swiss Mobiliar
Anomaly Detection Use-Case: IT Infrastructure at Swiss Mobiliar
The goal is to detect disruptions in
network before they affect user or
escalate.
This is an unsupervised task, since we
don’t know how a disruption looks like.
-20’000 components communicating
-average latency and number of calls are logged
-ca. 2’000 events/second

Unsupervised Learning: Anomaly/Outlier detection
Anomaly Detection Use-Case: IT Infrastructure at Swiss Mobiliar
Introducing MULDER:
Anomaly detector for positive and negative anomalies
in time-series data.

Machine Learning Branches

Part II: Deep Learning
Why now? In recent years two things became available:
1. A lot of data

Part II: Deep Learning
Why now? In recent years two things became available:
2. Necessary compute

What is deep learning?
Rosenblatt - 1961

What is new in deep learning?
What is new (among other things) is a learning
algorithm called backpropagation which allows
to train deep neural nets.
State-of-the-art networks can have over 200
layers!
GoogLeNet -2014

Difference classical ML vs. Deep Learning
Classical ML methods don’t handle high
dimensionality well.
dimensionality reduction & feature selection
Deep neural nets learn compact representations
of data even in a high dimensionality/sparse
setting - no feature engineering required!

Unsupervised Learning: Generative Adversarial Nets (GAN)
None of these images were taken in the real world!
NVIDIA

Unsupervised Learning: Generative Adversarial Nets (GAN)
Which of these images was generated?
DeepMind - 2018

Unsupervised Learning: Language Generation
OpenAI, February 14 2019: Better Language Models and Their Implications

Unsupervised Learning: Language Generation
OpenAI, February 14 2019: Better Language Models and Their Implications
www.talktotransformer.com

So why not use Deep Learning for everything?
There are reasons why we don’t only use DL:
-Necessary data no available
-Computational power not available
-Harder to interpret results
-Deep networks can be fooled: