Realtime anomaly detection in surveillance data.pptx

KingrockPeter 15 views 18 slides Jul 16, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

The complexity of both natural and technological systems has reduced the ability of humans to monitor, detect and fix anomalies before they occur and in real-time. In this talk, I examine types of anomalies and the different machine learning methods that can be applied to detect the anomalies in tim...


Slide Content

Real-time anomaly detection in disease surveillance data Dr. Peter Eze [email protected] (with Ivo Mueller, Nic Geard and Iadine Chades ) Research Fellow, AI for Decision Support School of Computing and Information Systems Faculty of Engineering and Engineering Technology University of Melbourne, Australia 23 rd May, 2022

Background and Problems Overtime, endemic diseases get neglected despite collected surveillance data, which together with other factors increase the time to disease elimination. Hence, endemic diseases require automated anomaly detection to trigger investigations and interventions. A unique interplay of the biological, environmental and social factors that allow malaria to flourish Disease year infections Deaths Malaria 2018 228 million 405,000

Questions In particular: How to automatically detect anomaly in reported malaria case data? How to provide possible epidemiological interpretations for detected anomalies How to use the interpretations to stratify risk and ensure dynamic spatio -temporal intervention targeting? What patterns can we find from malaria surveillance data?

Anomalies (or outliers) are observations that deviate from current expectation as to arouse suspicion that it was generated by a different mechanism . ( Hagemann & Katsarou ; 2020) (P. Bhattacharjee , A. Garg & P.Mitra ; 2021) Time

We transformed the Brazilian Amazon malaria time series data to help detect anomalies in testing and incident rate Proportion of positives Positive cases Number of tests negatives Source of Dataset: https://www.synapse.org/##!Synapse:syn21555933 Data source and features for anomaly detection

We chose the Para state in Brazil and stratified the data into 13 health regions in the state Proportion of positives =   Time series data stratified by health regions

Time series data are composed of trends, seasonality, holidays and error terms (irregularities) y(t) = g(t) + s(t) + h(t) + e(t) g(t) = trend (changes over a long period of time) s(t) = seasonality (periodic or short term changes) h(t) = effects of holidays to the forecast e(t) = error term (the unconditional changes that is specific to a circumstance) Under the additive modeling approach, a time series y(t) is given as : (S.J Taylor and B. Letham , 2017) Most models represent different aspects of time series well Methods

Discovering patterns and anomalies using multiple machine learning algorithms Facebook Prophet LSTM Methods

Non-parametric models determine anomaly based on locally fitted models using weighted local data points Local linear/non-linear regressions Locality is defined within a sliding window of length, n . An upper and lower bound tolerance limit Limits defined by either confidence level or number of standard deviations ( n_sigma ) . Data points that lie outside of the boundary is detected as anomaly. Confidence level : 0-1 n_Sigma ( σ) : 1-6 The criteria for choosing the exact value for these parameters require expert advise on the health capacity and risk tolerance of a health administrative region within the time period

LOWESS (locally weighted scatterplot smoothing) is a non-parametric model that assigns higher weights to data points closer to the point being fitted in the model   Where d is the distance of a given data point from the point on the curve being fitted The weight, w of a point x for fitting a local curve is: The locality of a curve is defined by the length of the sliding window, n .

The smaller the value of n-sigma , the higher the number of anomaly per time window. n_sigma ( σ )=1 produces more number of outliers than n_sigma ( σ )= 2 or 3 . Setting n_sigma ( σ ) will be determined by the health capacity of a region. This method assumes that health capacity closely follows proportion of positive cases. Each health region would adjust capacity at the end of each time window. Results

Given the same n-sigma (tolerance) for all health regions they will experience anomaly at different times. ARAGUAIA at times 35 and 131 experienced Flareup at the time when BAIXO and CARAJAS were experiencing Decline . Hence, at those times, ARAGUAIA would require to be targeted but still the success in BAIXO and CARAJAS will also need to be investigated to ascertain the cause. Results

But point anomaly may not be reliable to change policy or commission an investigation State-wide, there is a consistent case decline for 6 months. The ARAGUAIA also follows the state trend. However, CARAJAS and TOCANTINS has consistently flared-up over the same 6 months. Looking at the state-wide progress only, elimination may not happen. The incidence rate in TOCANTINS is up to 40%.

Limitation of traditional non-parametric LOWESS Small increase in incidence rate per window may sum up into undetected large outbreaks

Solving the Drift Problem Looking back n-lags or time steps to determine true trend while incorporating uncertainty. Compute anomaly only within the sliding window Train a model that detects baseline normal data and flags others as anomaly.   Ongoing/Future Work

With the rising threats of pandemics and climate change, global attention and funding for mitigating the inequitable burden of malaria is more necessary than ever. Because data for endemic diseases such as malaria are not analysed by humans on daily basis, automated methods can help to provide proactive decision support. We have developed a tool to help identify appropriate anomaly thresholds for health regions: https://github.com/KingPeter2014/Anomaly_in_malaria_surveillance_data Summary

T.  Hagemann and K.  Katsarou . A Systematic Review on Anomaly Detection for Cloud Computing Environments .2020. doi : https://doi.org/10.1145/3442536.3442550 Understanding LSTMs. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ B. Agrawal, T. Wiktorski & C. Rong . Adaptive Real-Time Anomaly Detection in Cloud Infrastructures .  2018 1st International Conference on Data Intelligence and Security (ICDIS). J. Clark, Z. Liu and N. Japkowicz , " Adaptive Threshold for Outlier Detection on Data Streams ,"  2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) , 2018, pp. 41-49, doi : 10.1109/DSAA.2018.00014. S.J Taylor and B. Letham . Forecasting at Scale . https://peerj.com/preprints/3190.pdf , 2017. SIVEP-Malaria database . IntegratedDataset.csv: Derived from Brazilian epidemiological surveillance system of malaria (2020). https://www.synapse.org/##!Synapse:syn21555933 . References