This is my presentation at Monitorama PDX in Portland on May 5, 2014
Simple math to get some signal out of your noisy sea of data
You’ve instrumented your system and application to the hilt. You can now “measure all the things”. Your team has set up thousands of metrics collecting millions of data points a day. Now what?
Most IT ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this mountain of data and extracting signal from the noise is not easy. The choice of what analytic method to use ranges from simple statistical analysis to sophisticated machine learning techniques. And one algorithm doesn’t fit all data.
PGroonga – Make PostgreSQL fast full text search platform for all languages!Kouhei Sutou
PostgreSQL has built-in full text search feature. But it supports only limited languages. For example, it doesn't support Japanese. pg_trgm bundled in PostgreSQL supports all languages including Japanese. But it has performance problems for large documents.
This talk describes about PGroonga that resolves these problems.
The first portion of the session will cover the critical reason why PostgreSQL generates these full page writes (FPW) and how to monitor the rate of generation. Next we will demonstrate the negative effect of full page writes on performance, scale, backups and replication. Then we will cover various techniques to decrease the amount of full page writes and improve your databases performance/scale/efficiency including using new PostgreSQL versions, parameter changes, application changes and the use of specific PostgreSQL features like partitioning. The final portion of the session will look at how future architectures can eliminate the need for full page writes.
PGroonga – Make PostgreSQL fast full text search platform for all languages!Kouhei Sutou
PostgreSQL has built-in full text search feature. But it supports only limited languages. For example, it doesn't support Japanese. pg_trgm bundled in PostgreSQL supports all languages including Japanese. But it has performance problems for large documents.
This talk describes about PGroonga that resolves these problems.
The first portion of the session will cover the critical reason why PostgreSQL generates these full page writes (FPW) and how to monitor the rate of generation. Next we will demonstrate the negative effect of full page writes on performance, scale, backups and replication. Then we will cover various techniques to decrease the amount of full page writes and improve your databases performance/scale/efficiency including using new PostgreSQL versions, parameter changes, application changes and the use of specific PostgreSQL features like partitioning. The final portion of the session will look at how future architectures can eliminate the need for full page writes.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days...tboubez
Current monitoring tools are clearly reaching the limit of their capabilities. That's because these tools are based on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Interest in applying analytics and machine learning to detect anomalies in dynamic web environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.
This talk builds on an Open Space discussion that was started at DevOps Days Austin. We will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data and the two main approaches for analyzing operations data: parametric and non-parametric methods
The importance of context
Simple data transformations that can give you powerful results
As the global financial crisis, recent natural disasters and political uprisings have shown us, our global supply chains and ability to deliver are increasingly vulnerable to factors entirely outside our control. Add to that the possibly disruptive new technology such as 3D manufacturing printing, hard-hitting new competitors from emerging markets and falling customer loyalty, and it is easy to wonder if manufacturing CEOs sleep well at night.
Immutable infrastructure with Docker and containers (GlueCon 2015)Jérôme Petazzoni
"Never upgrade a server again. Never update your code. Instead, create new servers, and throw away the old ones!"
That's the idea of immutable servers, or immutable infrastructure. This makes many things easier: rollbacks (you can always bring back the old servers), A/B testing (put old and new servers side by side), security (use the latest and safest base system at each deploy), and more.
However, throwing in a bunch of new servers at each one-line CSS change is going to be complicated, not to mention costly.
Containers to the rescue! Creating container "golden images" is easy, fast, dare I say painless. Replacing your old containers with new ones is also easy to do; much easier than virtual machines, let alone physical ones.
In this talk, we'll quickly recap the pros (and cons) of immutable servers; then explain how to implement that pattern with containers. We will use Docker as an example, but the technique can easily be adapted to Rocket or even plain LXC containers.
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25tboubez
Vancouver DevOps Days
25 October 2013
IT Ops collect a ton of data and produce reams of graphs to monitor systems and applications. Getting the right signal out of all that noise however is getting tougher and tougher. The traditional techniques to deal with such metrics, whether threshold-based or very simple statistical methods that were developed to deal with stable, static manufacturing processes, are failing in today’s dynamic environment. Interest in applying more advanced analytics and machine learning to detect anomalies is gaining steam but understanding which algorithms should be used to identify and predict anomalies without producing more false positives is not so easy.
This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data’s characteristics
The two main approaches for analyzing operations data: parametric and non-parametric methods
Overview of some current simple statistical methods and their weaknesses
Simple data transformations that can give you powerful results
By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
My presentation from Velocity Europe 2013 in London: Beyond Pretty Charts…. Analytics for the cloud infrastructure.
IT Ops collect tons of data on the status of their data center or cloud environment. Much of that data ends up as graphs on big screens so ops folks can keep an eye on the behavior of their systems. But unless a threshold is crossed, behavioral issues will often fall through the cracks. Thresholds are reactive, and humans are, well, human. Applying analytics and machine learning to detect anomalies in dynamic infrastructure environments can catch these behavioral changes before they become critical.
Current tools used to monitor web environments rely on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Thus interest in applying analytics and machine learning to predict and detect anomalies in these dynamic environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.
This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data’s characteristics
The two main approaches for analyzing operations data: parametric and non-parametric methods
Simple data transformations that can give you powerful results
By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
In Part II of the Anomaly Detection Series, we discuss the challenges in analyzing Temporal datasets and discuss methods for outlier analysis. We focus on single time series and discuss point outlier and sub-sequence methods.
Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Some Simple Math for
Anomaly Detection
#Monitorama PDX
2014.05.05
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software
toufic@metaforsoftware.com
@tboubez
3. 3
Preamble
• I lied!
– There are no “simple” tricks
– If it’s too good to be true, it probably is
• I usually beat up on parametric, Gaussian, supervised techniques
– This talk is to show some alternatives
– Only enough time to cover a couple of relatively simple but very useful
techniques
– Oh, and I will still beat up on the usual suspects
• Adrian and James are right! Listen to them!
– What’s the point of collecting all that data if you can’t get useful information
out of it!?
• Note: real data
• Note: no y-axis labels on charts – on purpose!!
• Note to self: remember to SLOW DOWN!
• Note to self: mention the cats!! Everybody loves cats!!
4. 4
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped
• Co-Founder/CTO Saffron Technology
• IBM Chief Architect for SOA
• Co-Author, Co-Editor: WS-Trust, WS-
SecureConversation, WS-Federation, WS-Policy
• Building large scale software systems for >20
years (I’m older than I look, I know!)
Toufic intro – who I am
6. 6
The WoC side-effects: alert fatigue
“Alert fatigue is the single
biggest problem we have
right now … We need to be
more intelligent about our
alerts or we’ll all go insane.”
- John Vincent (@lusis)
(#monitoringsucks)
9. 9
TO THE RESCUE: Anomaly Detection!!
• Anomaly detection (also known as outlier
detection) is the search for items or events
which do not conform to an expected pattern.
[Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A
survey". ACM Computing Surveys 41 (3): 1]
• For devops: Need to know when one or more
of our metrics is going wonky
11. 11
… are based on Gaussian distributions
• Make assumptions about probability
distributions and process behaviour
– Data is normally distributed with a useful and
usable mean and standard deviation
– Data is probabilistically “stationary”
12. 12
Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is an outlier
14. 14
Stationary Gaussian distributions are powerful
• Because far far in the future, in a galaxy far far
away:
– I can make the same predictions because the
statistical properties of the data haven’t changed
– I can easily compare different metrics since they
have similar statistical properties
• Let’s do this!!
• BUT…
• Cue in DRAMATIC MUSIC
22. 22
Attempts #2, #3, etc: mo’ better thresholds
• Static thresholds ineffective on dynamic data
– Thresholds use the mean as predictor and alert if
data falls more than 3 sigma outside the mean
• Need “moving” or “adaptive” thresholds:
– Value of mean changes with time to
accommodate new data values/trends
23. 23
Moving Averages “big idea”
• At any point in time in a well-behaved time
series, your next value should not significantly
deviate from the general trend of your data
• Mean as a predictor is too static, relies on too
much past data (ALL of the data!)
• Instead of overall mean use a finite window of
past values, predict most likely next value
• Alert if actual value “significantly” (3 sigmas?)
deviates from predicted value
24. 24
Moving Averages typical method
• Generate a “smoothed” version of the time series
– Average over a sliding (moving) window
• Compute the squared error between raw series
and its smoothed version
• Compute a new effective standard deviation by
smoothing the squared error
• Generate a moving threshold:
– Outliers are 3-sigma outside the new, smoothed data!
• Ta-da!
25. 25
Simple and Weighted Moving Averages
• Simple Moving Average
– Average of last N values in your time series
• S[t] <- sum(X[t-(N-1):t])/N
– Each value in the window contributes equally to
prediction
– …INCLUDING spikes and outliers
• Weigthed Moving Average
– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window
– Older values contribute less to the prediction
26. 26
Exponential Smoothing techniques
• Exponential Smoothing
– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples
• S[t]=αX[t-1] + (1-α)S[t-1]
– Does not deal with trends in data
• DES
– In addition to data smoothing factor (α), introduces a trend
smoothing factor (β)
– Better at dealing with trending
– Does not deal with seasonality in data
• TES, Holt-Winters
– Introduces additional seasonality factor
– … and so on
31. 31
Hmmmm, so are we doomed?
• No!
• ALL smoothing predictive methods work best
with normally distributed data!
• But there are lots of other non-Gaussian
based techniques
– We can only scratch the surface in this talk
36. 36
Trick #2: Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)
http://en.wikipedia.org/wiki/Kolmogorov%E2%
80%93Smirnov_test
44. 44
Trick #3: Diffing/Derivatives
• Often, even when the data itself is not
stationary, its derivatives tends to be!
• Most frequently, first difference is sufficient:
dS(t) <- S(t+1) – S(t)
• Can then perform some analytics on first
difference
47. 47
We’re not doomed, but: Know your data!!
• You need to understand the statistical properties
of your data, and where it comes from, in order
to determine what kind of analytics to use.
– Your data is very important!
– You spend time collecting it so spend time analyzing
it!
• A large amount of data center data is non-
Gaussian
– Guassian statistics won’t work
– Use appropriate techniques
48. 48
More?
• Only scratched the surface
• I want to talk more about algorithms, analytics,
current issues, etc, in more depth, but time’s up!!
– Come talk to me or email me if interested.
• Thank you!
toufic@metaforsoftware.com
@tboubez