Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Real-time Analytics &
Anomaly detection
using Hadoop, Elasticsearch and Storm
Costin Leau
@costinl
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Interesting != Common
Datasets tend to have hot / common entities
Monopolize the data set
Create too much noise
Cannot be easily avoided
Common = frequent
Interesting = frequently different
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
H5N1
flu
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
Foreground: “bird flu”
“H5N1” appears in 4 / 100 docs
H5N1
flu
H5N1
bird flu
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon - Challenges
Deal with big data sets
• Hadoop
Perform the analysis
• Elasticsearch
Keep the data fresh
• Storm
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop
De-facto platform for big data
HDFS - Used for storing and performing ETL at scale
Map/Reduce - Excellent for iterating, thorough analysis
YARN – Job scheduling and resource management
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Open-source real-time search and analytics engine
• Fully-featured search
Relevance-ranked text search
Scalable search
High-performance geo, temporal, range and key lookup
Highlighting
Support for complex / nested document types *
Spelling suggestions
Powerful query DSL *
“Standing” queries *
Real-time results *
Extensible via plugins *
• Powerful faceting/analysis
Summarize large sets by any combinations of
time, geo, category and more. *
“Kibana” visualization tool *
* Features we see as differentiators
• Management
Simple and robust deployments *
REST APIs for handling all aspects of administration/monitoring *
“Marvel” console for monitoring and administering clusters *
Special features to manage the life cycle of content *
• Integration
Hadoop (Map/Red,Hive,Pig,Cascading..)*
Client libraries (Python, Java, Ruby, javascript…)
Data connectors (Twitter, JMS…)
Logstash ETL framework *
• Support
Development and Production support with tiered levels
Support staff are the core developers of the product *
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Open-source real-time search and analytics engine
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
Use Elasticsearch natively in Hadoop
‣ Map/Reduce – Input/OutputFormat
‣ Apache Pig – Storage
‣ Apache Hive – External Table
‣ Cascading – Tap/Sink
‣ Storm (in development) – Spout / Bolt
All operations (reads/writes) are parallelized (Map/Reduce)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Storm
Distributed, fault-tolerant, real-time computation system
Perform on-the-fly queries
React to live data
Prevention
Routing
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Discovering
the relevant
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Inverted index
Inverting Shakespeare
‣ Take all the plays and break them down word by word
‣ For each word, store the ids of the documents that contain it
‣ Sort all tokens (words)
token doc freq. postings (doc ids)
Anthony 2 1, 2
Brutus 1 5
Caesar 2 2, 3
Calpurnia 2 4, 5
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy
How well does a document match a query?
step query d1 d2
The text brown fox The quick brown fox likes
brown nuts
The red fox
The terms (brown, fox) (brown, brown, fox, likes, nuts,
quick)
(red, fox)
A frequency vector (1, 1) (2, 1) (0, 1)
Relevancy - 2? 1?
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy - Vector Space Model
• How well q matches d1 and d2?
‣ The coordinates in the vector represent weights per term
‣ The simple (1, 0) vector we discussed defines these weights based on the
frequency of each term
‣ But to generalize:
.
2
1
1
tf: brown
tf: fox
q: (brown, fox)
d1: (brown, brown, fox)
d2: (fox)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy TF-IDF
Term frequency / Inverse Document Frequency
TF = the more a token appears in a doc, the
more important it is
IDF = the more documents containing the term,
the less important it is
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Ranking Formula
Called Lucene Similarity
Can be ignored (was an
attempt to make query
scores comparable across
indices, it’s there for
backward compatibility)
Core TF/IDF weight
Score of a
document for a
given query
Normalized doc length,
shorter docs are more
likely to be relevant
than longer docs
Boost of
query term t
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Discovering
the interesting
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Frequency differentiator
TF-IDF by-itself is not enough
need to compare the DF in foreground vs background
Precision vs Recall balance
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Single-set analysis
A C F H I K
A B C D E … X Y Z W
Query results
Dataset
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Single-set analysis example
crimes
bicycle theft
crimes
bicycle theft
British Police Force British Transport Police
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Multi-set analysis
A B C D E … X Y Z W
A C F H I K M Q R
…
Query results
Dataset
A B C D .. J L M N O .. U
Aggregate
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Background (geo-aggregation)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Foreground (geo-aggregation)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop
Off-line / slow learning
‣ In-depth analysis
‣ Break down data into hot spots
‣ Eliminate noise
‣ Build multiple models
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Search features
‣ Scoring, TF-IDF
‣ Significant terms (multi-set analysis)
Aggregations
‣ Buckets & Metrics
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting
to data
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting to data
Prevent
execute queries as data flows in  build a model
Route
place suspicious data into a dedicate pipeline
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting to data
spout bolt
bolt
bolt
bolt
bolt bolt
bolt
bolt
bolt
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Live loop
Data keeps changing
‣ Adapt the set of rules
Improves reaction time
‣ Build a model for fast decision making
Keeps the prevention rate high
‣ Categorize data on the fly
bolt
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Putting it all together
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
The Big Picture
HDFS
Slow, in-depth
learning
Fast, real-time
learning
ETL
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Usages
Recommendation
‣ Find similar movies based on user feedback
‣ Use Storm to optimize the returned results
Card Fraud
‣ Use Storm to prevent suspicious transactions from executing
‣ Route possible frauds to a dedicated analysis queue
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Q&A
Thank you!
@costinl

Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and Storm

  • 1.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Real-time Analytics & Anomaly detection using Hadoop, Elasticsearch and Storm Costin Leau @costinl
  • 2.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited
  • 3.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Interesting != Common Datasets tend to have hot / common entities Monopolize the data set Create too much noise Cannot be easily avoided Common = frequent Interesting = frequently different
  • 4.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs H5N1 flu
  • 5.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs Foreground: “bird flu” “H5N1” appears in 4 / 100 docs H5N1 flu H5N1 bird flu
  • 6.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon - Challenges Deal with big data sets • Hadoop Perform the analysis • Elasticsearch Keep the data fresh • Storm
  • 7.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Hadoop De-facto platform for big data HDFS - Used for storing and performing ETL at scale Map/Reduce - Excellent for iterating, thorough analysis YARN – Job scheduling and resource management
  • 8.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Open-source real-time search and analytics engine • Fully-featured search Relevance-ranked text search Scalable search High-performance geo, temporal, range and key lookup Highlighting Support for complex / nested document types * Spelling suggestions Powerful query DSL * “Standing” queries * Real-time results * Extensible via plugins * • Powerful faceting/analysis Summarize large sets by any combinations of time, geo, category and more. * “Kibana” visualization tool * * Features we see as differentiators • Management Simple and robust deployments * REST APIs for handling all aspects of administration/monitoring * “Marvel” console for monitoring and administering clusters * Special features to manage the life cycle of content * • Integration Hadoop (Map/Red,Hive,Pig,Cascading..)* Client libraries (Python, Java, Ruby, javascript…) Data connectors (Twitter, JMS…) Logstash ETL framework * • Support Development and Production support with tiered levels Support staff are the core developers of the product *
  • 9.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Open-source real-time search and analytics engine
  • 10.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Hadoop Use Elasticsearch natively in Hadoop ‣ Map/Reduce – Input/OutputFormat ‣ Apache Pig – Storage ‣ Apache Hive – External Table ‣ Cascading – Tap/Sink ‣ Storm (in development) – Spout / Bolt All operations (reads/writes) are parallelized (Map/Reduce)
  • 11.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Storm Distributed, fault-tolerant, real-time computation system Perform on-the-fly queries React to live data Prevention Routing
  • 12.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Discovering the relevant
  • 13.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Inverted index Inverting Shakespeare ‣ Take all the plays and break them down word by word ‣ For each word, store the ids of the documents that contain it ‣ Sort all tokens (words) token doc freq. postings (doc ids) Anthony 2 1, 2 Brutus 1 5 Caesar 2 2, 3 Calpurnia 2 4, 5
  • 14.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy How well does a document match a query? step query d1 d2 The text brown fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (red, fox) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 2? 1?
  • 15.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy - Vector Space Model • How well q matches d1 and d2? ‣ The coordinates in the vector represent weights per term ‣ The simple (1, 0) vector we discussed defines these weights based on the frequency of each term ‣ But to generalize: . 2 1 1 tf: brown tf: fox q: (brown, fox) d1: (brown, brown, fox) d2: (fox)
  • 16.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy TF-IDF Term frequency / Inverse Document Frequency TF = the more a token appears in a doc, the more important it is IDF = the more documents containing the term, the less important it is
  • 17.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Ranking Formula Called Lucene Similarity Can be ignored (was an attempt to make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t
  • 18.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Discovering the interesting
  • 19.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Frequency differentiator TF-IDF by-itself is not enough need to compare the DF in foreground vs background Precision vs Recall balance
  • 20.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Single-set analysis A C F H I K A B C D E … X Y Z W Query results Dataset
  • 21.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Single-set analysis example crimes bicycle theft crimes bicycle theft British Police Force British Transport Police
  • 22.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Multi-set analysis A B C D E … X Y Z W A C F H I K M Q R … Query results Dataset A B C D .. J L M N O .. U Aggregate
  • 23.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Background (geo-aggregation)
  • 24.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Foreground (geo-aggregation)
  • 25.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Hadoop Off-line / slow learning ‣ In-depth analysis ‣ Break down data into hot spots ‣ Eliminate noise ‣ Build multiple models
  • 26.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Search features ‣ Scoring, TF-IDF ‣ Significant terms (multi-set analysis) Aggregations ‣ Buckets & Metrics
  • 27.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data
  • 28.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data Prevent execute queries as data flows in  build a model Route place suspicious data into a dedicate pipeline
  • 29.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data spout bolt bolt bolt bolt bolt bolt bolt bolt bolt
  • 30.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Live loop Data keeps changing ‣ Adapt the set of rules Improves reaction time ‣ Build a model for fast decision making Keeps the prevention rate high ‣ Categorize data on the fly bolt
  • 31.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Putting it all together
  • 32.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited The Big Picture HDFS Slow, in-depth learning Fast, real-time learning ETL
  • 33.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited Usages Recommendation ‣ Find similar movies based on user feedback ‣ Use Storm to optimize the returned results Card Fraud ‣ Use Storm to prevent suspicious transactions from executing ‣ Route possible frauds to a dedicated analysis queue
  • 34.
    Copyright Elasticsearch 2014.Copying, publishing and/or distributing without written permission is strictly prohibited
  • 35.
    Copyright Elasticsearch 2014Copying, publishing and/or distributing without written permission is strictly prohibited Q&A Thank you! @costinl

Editor's Notes

  • #4 Zipf distribution, power-law, long tail
  • #10 Full-text search Highlight Search-as-you-type Did-you-mean More-like-this Geolocation Events, logs,
  • #20 Precision = how many of the retrieved documents are relevant Recall = how many of the relevant documents are retrieved
  • #24 Term aggregation
  • #25 Significant terms