Python programming and core libraries for data
analysis, visualisation, and modelling
• Working with data: collecting, cleaning, transforming
• Creating and interpreting descriptive statistics
• Creating and interpreting data visualisations
• Creating statistical models for inference
• Pra...
Python programming and core libraries for data
analysis, visualisation, and modelling
• Working with data: collecting, cleaning, transforming
• Creating and interpreting descriptive statistics
• Creating and interpreting data visualisations
• Creating statistical models for inference
• Practical machine learning
What is data science?
Why is it important?
How is data science performed?
The Data Science Process.
Size: 54.31 MB
Language: en
Added: Jul 12, 2024
Slides: 43 pages
Slide Content
Lewis Tunstall | Data Scientist | [email protected]
Leandro von Werra | Data Scientist | [email protected]
BTW2401 - Data Science
Lesson 1.1 - The Big Picture
Overall Course Goals
•Know how to approach business problems from a
data science perspective
•Understand the fundamental principles behind
extracting useful knowledge from data
•Gain hands-on experience with mining data for
insights
By the end of this course you will:
Skills Overview
•Python programming and core libraries for data
analysis, visualisation, and modelling
•Working with data: collecting, cleaning, transforming
•Creating and interpreting descriptive statistics
•Creating and interpreting data visualisations
•Creating statistical models for inference
•Practical machine learning
In this course you are going to learn several skills:
Course Materials
•Data Science for Business, F. Provost & T. Fawcett
(O'Reilly Media, Sebastopol, 2013).
This course is largely based on the excellent textbook:
•Hands-On Machine Learning with Scikit-Learn and
TensorFlow, A. Géron (O'Reilly Media, 2017)
•Introduction to Machine Learning for Coders, fast.ai
(http://course18.fast.ai/ml)
Other useful references include:
Week No. Date Topic
8 20.02 Introduction to data science
9 27.02 Python for data analysis I
10 5.03 Python for data analysis II
11 12.03 Introduction to random forests
12 19.03 Random forest deep dive
13 26.03 Model interpretation
14 2.04 Classification
15 9.04 No class (Easter)
16 16.04Midterm exam & define group projects
17 23.04Cross-validation and model performance
Week No. Date Topic
18 30.04 Neighbours and clusters I
19 7.05 Neighbours and clusters II
20 14.05 Natural language processing I
21 21.05
No class (Ascension) & project
submission
22 28.05 Natural language processing II
23 4.06 Project presentations
24 11.06 Deep learning
25 18.06 Exam Preparation
26 25.06 Exam Preparation
27 TBD Final Exam
Timetable and Key Dates
About Us: Backgrounds
•PhD in theoretical physics from Adelaide, Australia
•Postdoctoral researcher in Switzerland (University of Bern)
•2+ years working in industry
•Expertise in machine learning & mathematics
•MSc in computational physics from ETH
•2+ years working in industry
•Focus on application of machine learning to big data
Lewis
Leandro
About Us: What We Actually Do
Raw data,
little value
Data
exploration
Analysis,
Model building
Reporting,
Automation
This Lesson
We aim to answer the following 3 questions:
•What is data science?
•Why is it important?
•How is data science performed?
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
Big Data
•What is data science?
•Why is it important?
•How is data science performed?
This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
Big Data
Machine Learning
•What is data science?
•Why is it important?
•How is data science performed?
This Lesson
We aim to answer the following 3 questions:
To answer these questions, we will
focus on a series of trends that are driving
the data science “revolution”
We finish with
•A mini (ungraded!) quiz
•Onboarding of Python, Kaggle, and Paperspace
Big Data
Machine Learning
•What is data science?
•Why is it important?
•How is data science performed?
What is Data Science?
It’s a surprisingly hard definition to nail down:
•Superfluous?
•Buzzword?
What is Data Science?
It’s a surprisingly hard definition to nail down:
•Superfluous?
•Buzzword?
You know it’s serious when your
field makes it onto Gartner’s
hype cycle
1
1https://en.wikipedia.org/wiki/Hype_cycle
Python
Deep/Machine Learning
Predictive
Analytics
What is Data Science?
Despite the hype, useful definitions exist
1
Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making
1Provost & Fawcett, Chapter 1
What is Data Science?
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
Despite the hype, useful definitions exist
1
Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making
What is Data Science?
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
•Statistics
•Computer science
•Domain expertise
Despite the hype, useful definitions exist
1
Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making
What is Data Science?
Data
Science
1Provost & Fawcett, Chapter 1
Is an interdisciplinary subject with 3 key areas:
•Statistics
•Computer science
•Domain expertise
Despite the hype, useful definitions exist
1
Data science is about the extraction of useful
information and knowledge from large
volumes of data, in order to improve
business decision-making
Why is Data Science Important?
In the past, data analysis was typically slow:
needed teams of statisticians, analysts etc to
explore data manually
Today: volume, velocity, and variety make
manual analysis impossible …
Big Data: The Large Hadron Collider at CERN
Big Data: The Large Hadron Collider at CERN
•150 million sensors delivering data
40 million times per second.
•There are nearly 600 million collisions per second.
•Only 100 collisions of interest per second.
•Raw data production exceeds 500 exabytes per day
(1 EB = 1 million TB).
•Due to filtering only 200 petabyte are generated
annually (1 PB = 1000 TB).
Why is Data Science Important?
In the past, data analysis was typically slow:
needed teams of statisticians, analysts etc to
explore data manually
Today: volume, velocity, and variety make
manual analysis impossible …
… but fast computers and good algorithms
allow much deeper analyses than before
)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>data-driven decision making
Cartoon from Provost & Fawcett, Chapter 1
)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>base decisions on analysis of data,
not intuition
The Data Science Process
Find a
question
Collect
the data
Deploy
the model
Evaluate
the model
Create a
model
Prepare
the data
Data
How is data science performed?
1From Data Science: The Big Picture, M. Renze, Pluralsight
The Data Science Process
Find a
question
Collect
the data
Deploy
the model
Evaluate
the model
Create a
model
Prepare
the data
Data
•Iterative process
•Non-sequential
•Early termination
•Established processes, e.g.
CRISP-DM (https://bit.ly/1tX6508)
How is data science performed?
1From Data Science: The Big Picture, M. Renze, Pluralsight
Example #1
1What Wal-Mart Knows About Customer Habits, NYT (2004)
Hurricane Frances was on its way, barreling
across the Caribbean, threatening a direct hit
on Florida's Atlantic coast … A week ahead of
the storm's landfall, Linda M. Dillman, Wal-
Mart's chief information officer, pressed her
staff to come up with forecasts based on what
had happened when Hurricane Charley struck
several weeks earlier.
1
Why might data-driven predictions be useful in
this scenario?
Example #1
1What Wal-Mart knows about customer habits, NYT (2004)
Hurricane Frances was on its way, barreling
across the Caribbean, threatening a direct hit
on Florida's Atlantic coast … A week ahead of
the storm's landfall, Linda M. Dillman, Wal-
Mart's chief information officer, pressed her
staff to come up with forecasts based on what
had happened when Hurricane Charley struck
several weeks earlier.
1
Why might data-driven predictions be useful in
this scenario?
7x increase in sales
before hurricane
Top-selling item
was beer!
Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
“My daughter got this in the mail!
She’s still in high school …
Are you trying to encourage her to
get pregnant?”
Example #2
“If we wanted to figure out if a
customer is pregnant, even if she
didn’t want us to know, can you
do that?”
1
1How Companies Learn Your Secrets, NYT (2012)
Why might Target want to know when you’re
pregnant?
“My daughter got this in the mail!
She’s still in high school …
Are you trying to encourage her to
get pregnant?”
“It turns out there’s been some activities in my
house I haven’t been completely aware of.
She’s due in August. I owe you an apology.”
3 days later …
Lewis Tunstall | Data Scientist | [email protected]
Leandro von Werra | Data Scientist | [email protected]
BTW2401 - Data Science
Lesson 1.2 - Machine Learning
What is machine learning?
Blue faces seem to be important…
What is machine learning really?
1950’s: creation of first “intelligent”
algorithms and programs
1980’s: statistical models and algorithms
that can learn from data
2010’s: statistical models and algorithms
inspired by neurones that can learn from data
Machine Learning Branches: Supervised Learning
In supervised learning the
training data consists of
input/output pairs and we
train a function to map the
inputs to the outputs.
Input
RegressionClassification
Categorical Variable Continuous Variables
Values, Vectors,
Words, Images etc.
A, B, C
Dogs/Cats
Prize/Cost,
Weight, Lifetime
Supervised Learning: Classification
Classification: Assign categorical labels from a fixed set of labels to data samples.
“Broken Bike”“Normal Bike”Output/Label:
Input Data:
Supervised Learning: Regression
Regression: Find the relationship between one dependent variable and a series of
other changing variables.
Concentration
Length of lecture
Machine Learning Branches
Machine Learning Branches: Unsupervised Learning
In unsupervised learning there are
no labels available, insights are
gained without* prior knowledge. Input Data
Dimensionality Reduction
ClusteringOutlier Detection
Generative Models
* Usually some model parameters need to be set ahead of training.
Unsupervised Learning: Anomaly/Outlier detection
Anomaly Detection: The task of finding
samples in a dataset that raise suspicion.
Problem: Usually, what exactly you are
looking for is unknown.
Solution: Use statistics and characteristics
of dataset to find outliers.
Unsupervised Learning: Anomaly/Outlier detection
Unsupervised & automated requirements of an ideal detector [1]:
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
MULDER: Unsupervised Anomaly Detection for Streaming Applications
Leandro von Werra and Lewis Tunstall
SPOUD AG, Effingerstrasse 23, 3011 Bern, Switzerland
Results
The MULDER detector performs well across both validations sets. On the NAB our
algorithm outperforms those provided by Amazon and Twitter, while being
computationally lightweight.
Applying the MULDER detector on production data from Swiss Mobiliar showcases its
ability to detect positive and negative anomalies.
Implementation
To deploy MULDER in production we are currently building a streaming system. Pre-
aggregated events from a Kafka stream are consumed by a Flink pipeline that
preprocesses them and prepares the state for the anomaly detection service. The results
from the anomaly detector are then passed downstream, where they are used to source
real-time dashboards and/or alerting services.
Algorithm Standard Profile Reward Low FP Reward Low FN
Numenta HTM 70.5 62.6 75.2
MULDER 54.4 40.5 61.0
Random Cut Forest (AWS) 51.7 38.4 59.7
Twitter ADVec 47.1 33.6 53.5
References
[1] A. Lavin and S. Ahmad, Evaluating Real-time Anomaly Detection Algorithms - the Numenta
Anomaly Benchmark, in the 14th International Conference on Machine Learning and
Applications (IEEEE ICMLA’15), 2015.
[2] D. Goldberg and Y. Shan, The Importance of Features for Statistical Anomaly Detection, in
the 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15), 2015.
Acknowledgements
We thank Wilhelm Masero and the Need4Speed team at Swiss Mobiliar for a fruitful
collaboration.
Figure: Topology of Swiss Mobiliar’s application network, where each blue node corresponds to a
component of an application. Communication between components is denoted by grey lines. The right
panel shows an illustrative time series from a single node exhibiting positive and negative anomalies.
Motivation
At SPOUD we are building a Data Market that redefines the way enterprises access and
integrate streaming business data within and outside company borders.
As part of our product development, we apply our expertise in event-processing
technologies to solve real-world problems for our customers. One such customer is
Swiss Mobiliar, whose employees rely on a very large network of applications to assist
their customers with insurance policies, quotes, etc. These applications are monitored,
where quantities such as workload and response time are recorded at frequent
intervals.
Theoretical
No train/test split
Full dataset not available
Time-dependent definition of “normal”
Practical
High throughput
Data labelling not viable
No manual parameter tuning
and yield positive and negative anomalies.These time series exhibit
seasonality
trend
!
<latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit><latexit sha1_base64="7jmvvMTu5132+w17g6dAibF9JSE=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBjm/WrNr/tzoVUICqhBoWa/+hUOFMkElZZwbEwv8FMb5VhbRjidVsLM0BSTMR7SnkOJBTVRPt92is6cM0CJ0u5Ji+bu74kcC2MmInadAtuRWa7NzP9qvcwm11HOZJpZKsnioyTjyCo0Ox0NmKbE8okDTDRzuyIywhoT6wKquBCC5ZNXoX1RDxzfX9Yad0UcZTiBUziHAK6gAbfQhBYQeIRneIU3T3kv3rv3sWgtecXMMfyR9/kDkgePIg==</latexit>
<latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit><latexit sha1_base64="fPxe4Ox/DjWCx0LFS+PsX+KiN0s=">AAAB7XicbZDLSgMxFIbP1Futt6pLN8EiuCozIuiy4EZ3FewFOkPJpJk2NpchyQhl6Du4caGIW9/HnW9j2s5CW38IfPznHHLOH6ecGev7315pbX1jc6u8XdnZ3ds/qB4etY3KNKEtorjS3RgbypmkLcssp91UUyxiTjvx+GZW7zxRbZiSD3aS0kjgoWQJI9g6qx3GbBhO+9WaX/fnQqsQFFCDQs1+9SscKJIJKi3h2Jhe4Kc2yrG2jHA6rYSZoSkmYzykPYcSC2qifL7tFJ05Z4ASpd2TFs3d3xM5FsZMROw6BbYjs1ybmf/VeplNrqOcyTSzVJLFR0nGkVVodjoaME2J5RMHmGjmdkVkhDUm1gVUcSEEyyevQvuiHji+v6w17oo4ynACp3AOAVxBA26hCS0g8AjP8ApvnvJevHfvY9Fa8oqZY/gj7/MHlQ+PJA==</latexit>
how to detect disruptions in monitored signals before they affect users or escalate?
To address these challenges we conducted a feasibility study with Swiss Mobiliar.
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
Anomaly Detection in Streaming Systems
Streaming data unique challenges & constraints:
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
Key challenges:
bottlenecks
disruptions
> 20,000 correlated time series
> 2,000 events/s
Detects all
anomalies present
in streaming data.
Detects anomalies
as soon as
possible.
Triggers no false
alarms.
Is fully automated
across all
streaming sources.
Requirements template for algorithm development
)
<latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit><latexit sha1_base64="6kGrZJ1RTM8uHhImCP2QJ0sVV+g=">AAAB8nicdVDLSsNAFJ3UV62vqks3g0VwFRLbxrqRghuXVewD0lAm00k7OMmEmRullH6GGxeKuPVr3Pk3Th+Cih64cDjnXu69J0wF1+A4H1ZuaXlldS2/XtjY3NreKe7utbTMFGVNKoVUnZBoJnjCmsBBsE6qGIlDwdrh7cXUb98xpblMbmCUsiAmg4RHnBIwkt+95oMhEKXkfa9YcmynVvaqHnbsiueVq44hnntWqZaxazszlNACjV7xvduXNItZAlQQrX3XSSEYEwWcCjYpdDPNUkJvyYD5hiYkZjoYz06e4COj9HEklakE8Ez9PjEmsdajODSdMYGh/u1Nxb88P4OoFox5kmbAEjpfFGUCg8TT/3GfK0ZBjAwhVHFzK6ZDoggFk1LBhPD1Kf6ftE5s1/CrSql+vogjjw7QITpGLjpFdXSJGqiJKJLoAT2hZwusR+vFep235qzFzD76AevtEwPFkbk=</latexit>
The Algorithm
The MULDER detector, developed at SPOUD, is based on the concept of surprise in a
univariate time series [2]. Surprise is defined as the difference between the expected and
the actual value of a given metric. Calculating the surprise at each time-step yields a
secondary, “surprise” time series. Examples of stationary and periodic signals shown below.
Validation Procedure
Numenta Anomaly Benchmark (NAB)
The NAB [1] consists of over 50 time series with annotated anomalies, along with several
evaluation profiles to measure precision and recall as well as time to detect. The
datasets range from AWS CPU utilisation to taxi usage in NYC; the timescales range from
weeks to years. Performance on the NAB provides a powerful test for the generalisation
capabilities of a given algorithm.
Stationary Signal Periodic Signal
Figure: Calculation of surprise, where the
expected value is obtained from the median of
previous time steps at the same time of day.
Figure: Calculation of surprise, where the expected
value is obtained by linear extrapolation of the past
time-steps.
Anomaly detection via surprise percentiles
From the surprise time series, percentiles
(10th and 90th) are tracked over time with a
sliding window. With a 3σ test for outliers,
the percentile time series are probed for
anomalies. Tracking both the upper and
lower percentiles enables the reliable
detection of negative (dips), as well as
positive (spikes) anomalies.
Figure: Surprise time series with
corresponding percentile curves and
anomaly threshold.
Synthetic Anomalies
In addition to the NAB validation, we collected production telemetry data from Swiss
Mobiliar and injected synthetic anomalies. There are several types of anomalies available
(gaussian, sawtooth, top-hat) and the amplitude and timespan are chosen at random.
With this validation approach we measure the performance (precision/recall/F1-score)
directly on the use case and calibrate the system to our customer’s needs.
Figure: Production data with time shifts of 1, 2 and 3 days. Detected anomalies shown in red.
Figure: Sample NAB datasets. The left figure shows CPU utilisation, while the right shows temperature
readings.
Figure: Production data (teal) with synthetic anomaly injections (red).
A feasibility study in collaboration with Swiss Mobiliar
Anomaly Detection Use-Case: IT Infrastructure at Swiss Mobiliar
The goal is to detect disruptions in
network before they affect user or
escalate.
This is an unsupervised task, since we
don’t know how a disruption looks like.
-20’000 components communicating
-average latency and number of calls are logged
-ca. 2’000 events/second
Unsupervised Learning: Anomaly/Outlier detection
Anomaly Detection Use-Case: IT Infrastructure at Swiss Mobiliar
Introducing MULDER:
Anomaly detector for positive and negative anomalies
in time-series data.
Machine Learning Branches
Part II: Deep Learning
Why now? In recent years two things became available:
1. A lot of data
Part II: Deep Learning
Why now? In recent years two things became available:
2. Necessary compute
What is deep learning?
Rosenblatt - 1961
What is new in deep learning?
What is new (among other things) is a learning
algorithm called backpropagation which allows
to train deep neural nets.
State-of-the-art networks can have over 200
layers!
GoogLeNet -2014
Difference classical ML vs. Deep Learning
Classical ML methods don’t handle high
dimensionality well.
dimensionality reduction & feature selection
Deep neural nets learn compact representations
of data even in a high dimensionality/sparse
setting - no feature engineering required!
Unsupervised Learning: Generative Adversarial Nets (GAN)
None of these images were taken in the real world!
NVIDIA
Unsupervised Learning: Generative Adversarial Nets (GAN)
Which of these images was generated?
DeepMind - 2018
Unsupervised Learning: Language Generation
OpenAI, February 14 2019: Better Language Models and Their Implications
Unsupervised Learning: Language Generation
OpenAI, February 14 2019: Better Language Models and Their Implications
www.talktotransformer.com
So why not use Deep Learning for everything?
There are reasons why we don’t only use DL:
-Necessary data no available
-Computational power not available
-Harder to interpret results
-Deep networks can be fooled: