SlideShare a Scribd company logo
Streaming Data Mining
4/18/2014 Streaming Data Mining 1
Once upon a time.
• Life was easy
– Eg. Org. has only transaction data, analyst were happy analyzing them.
– Competition was less.
– Customer had lesser options to review product.
• Wait! Web- 2.(oh)0
– Customer who consumed data started generating data - tweets, blogs,
facebook comments, reviews………..
– Another burst came when the Mobile era came in.
• Apps recording customers location
• Actions on apps.
• Pattern of app use.
4/18/2014 Streaming Data Mining 2
Server
DB
DB
DB
DB
DB
DB
4/18/2014 Streaming Data Mining
Its All About the Numbers!
4
58M/Day
500Tb/Day 2.1M GB/Hr 4B view/day
So, Its GOOD to have data,
Right?
Digging Into the Data
• Analyze to understand customer.
• Identify Patterns
• Machine Learning
• Statistical Model Building
• Natural Language Processing
• …….
4/18/2014 Streaming Data Mining 7
Usual Pipeline in Data Mining.
4/18/2014 Streaming Data Mining 8
Data of
Entire
Population
Sample
Population
Cleaning and
Preprocessing
Training and testing
Models
Production Server
Why?
Huge Training Data Set - Volume
• Organizations these days have huge datasets that can be used
to train their models.
• But Main Memory Restrictions.
– Machine Learning Algorithm.
– Batch Processing.
• Y no Sampling??
4/18/2014 Streaming Data Mining 12
Streams - Velocity
• Ubiquitous Computing, Mobile Devices, Social Media.
• Potentially of Infinite length
• Usual Strategy – Batch Mode.
4/18/2014 Streaming Data Mining 13
Contextual Trends.
• Trending topic on social media.
• Weather
• Location
• Demographics
• Market Dynamics
• Jargon Alert : Concept Drift
4/18/2014 Streaming Data Mining 14
What we want today?
Consume Real time data
and extract insights.
Wait.! Can I say Analyze
Streams?
Streaming Data Mining!
Philosophy
• Continuous Data Record aka Data Streams
• Bounded Storage
• Single Pass
• Real Time
• Concept Drift
4/18/2014 Streaming Data Mining 17
So What… We have Hadoop…
• The big Elephant doesn’t fit in here.
• Hadoop – Batch Processing
• We need Storm
– Storm is fast: a benchmark clocked it at
over a million tuples processed per
second per node.
4/18/2014 Streaming Data Mining 18
Algorithms.
• The conventional Machine learning algorithm were designed
for batch processing.
– The Algorithm needs to load entire dataset into the memory.
– Computes the necessary statistics, example entropyinformation gain
in decision trees.
• With Streams?
– Streams are of infinite length
– Storing everything, if you can, will be an issue on the memory of the
system $$$$
4/18/2014 Streaming Data Mining 19
Streaming Machine Learning
• When?
– High Data volume
– Rate at which data comes is high.
– Unbound, will always arrive in the system and we wont be able to fit it
in our memory
• Requirements to be adhered.
– Each input element to be processed atmost once.
– Space
– Time
– Start predicting from t0
4/18/2014 Streaming Data Mining 20
General Flow of Streaming Algorithms
4/18/2014 Streaming Data Mining 21
Spam Detection
• Models trained in the past by traditional data mining strategy
will become obsolete as spammers will find a way out.
• Solution : VFDT - Hoeffding Tree Steam Classification
• Train the model in streaming setup.
• When new spam pattern detected, people mark them as
spams.
• Use them to retrain the model in real time.
Concept Drift! Win!
4/18/2014 Streaming Data Mining 22
Answering Todays BigData Needs
• Streaming Data Mining
– Storm
– MOA
– SAMOA
– KAFKA
– ……
4/18/2014 Streaming Data Mining 23
Thank You!
Ankit Solanki
Neil Shah

More Related Content

What's hot

Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0
Eclipse Techno Consulting Global (P) Ltd
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
tyagiakansha
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
Winston Chen
 
Big data groningen
Big data groningenBig data groningen
Big data groningen
Willem Hendriks
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Big data groningen
Big data groningenBig data groningen
Big data groningen
Willem Hendriks
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
Nitesh Dubey
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
Rising Media, Inc.
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?
FlyData Inc.
 

What's hot (10)

Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
 
Big data groningen
Big data groningenBig data groningen
Big data groningen
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Big data groningen
Big data groningenBig data groningen
Big data groningen
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?
 

Similar to Streaming data mining

Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
S P Sajjan
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
Ashish Jha
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
wisdomeye
 
Big data
Big dataBig data
Big data
roysonli
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Hadoop at Musicmetric
Hadoop at MusicmetricHadoop at Musicmetric
Hadoop at Musicmetric
Jameel Syed
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
Madhukara Phatak
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
MohammedShahid562503
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
Danielle Jabin
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
MongoDB
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
vidhi171881
 
Big Data
Big DataBig Data
Big Data
Priyanka Tuteja
 
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
Gigaom
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Analyzing Census Data: Large databases and challenges to statistical softwares
Analyzing Census Data: Large databases and challenges to statistical softwaresAnalyzing Census Data: Large databases and challenges to statistical softwares
Analyzing Census Data: Large databases and challenges to statistical softwares
Rogério Barbosa
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
Andrew Morris
 
Big data – An Introduction, July 2013
Big data – An Introduction, July 2013Big data – An Introduction, July 2013
Big data – An Introduction, July 2013
Peter Morgan
 

Similar to Streaming data mining (20)

Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
Big data
Big dataBig data
Big data
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Hadoop at Musicmetric
Hadoop at MusicmetricHadoop at Musicmetric
Hadoop at Musicmetric
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Big Data
Big DataBig Data
Big Data
 
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Analyzing Census Data: Large databases and challenges to statistical softwares
Analyzing Census Data: Large databases and challenges to statistical softwaresAnalyzing Census Data: Large databases and challenges to statistical softwares
Analyzing Census Data: Large databases and challenges to statistical softwares
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
 
Big data – An Introduction, July 2013
Big data – An Introduction, July 2013Big data – An Introduction, July 2013
Big data – An Introduction, July 2013
 

Recently uploaded

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 

Recently uploaded (20)

Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 

Streaming data mining

  • 1. Streaming Data Mining 4/18/2014 Streaming Data Mining 1
  • 2. Once upon a time. • Life was easy – Eg. Org. has only transaction data, analyst were happy analyzing them. – Competition was less. – Customer had lesser options to review product. • Wait! Web- 2.(oh)0 – Customer who consumed data started generating data - tweets, blogs, facebook comments, reviews……….. – Another burst came when the Mobile era came in. • Apps recording customers location • Actions on apps. • Pattern of app use. 4/18/2014 Streaming Data Mining 2
  • 4. 4/18/2014 Streaming Data Mining Its All About the Numbers! 4 58M/Day 500Tb/Day 2.1M GB/Hr 4B view/day
  • 5. So, Its GOOD to have data, Right?
  • 6. Digging Into the Data • Analyze to understand customer. • Identify Patterns • Machine Learning • Statistical Model Building • Natural Language Processing • ……. 4/18/2014 Streaming Data Mining 7
  • 7. Usual Pipeline in Data Mining. 4/18/2014 Streaming Data Mining 8 Data of Entire Population Sample Population Cleaning and Preprocessing Training and testing Models Production Server
  • 8.
  • 10.
  • 11. Huge Training Data Set - Volume • Organizations these days have huge datasets that can be used to train their models. • But Main Memory Restrictions. – Machine Learning Algorithm. – Batch Processing. • Y no Sampling?? 4/18/2014 Streaming Data Mining 12
  • 12. Streams - Velocity • Ubiquitous Computing, Mobile Devices, Social Media. • Potentially of Infinite length • Usual Strategy – Batch Mode. 4/18/2014 Streaming Data Mining 13
  • 13. Contextual Trends. • Trending topic on social media. • Weather • Location • Demographics • Market Dynamics • Jargon Alert : Concept Drift 4/18/2014 Streaming Data Mining 14
  • 14. What we want today? Consume Real time data and extract insights. Wait.! Can I say Analyze Streams?
  • 16. Philosophy • Continuous Data Record aka Data Streams • Bounded Storage • Single Pass • Real Time • Concept Drift 4/18/2014 Streaming Data Mining 17
  • 17. So What… We have Hadoop… • The big Elephant doesn’t fit in here. • Hadoop – Batch Processing • We need Storm – Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. 4/18/2014 Streaming Data Mining 18
  • 18. Algorithms. • The conventional Machine learning algorithm were designed for batch processing. – The Algorithm needs to load entire dataset into the memory. – Computes the necessary statistics, example entropyinformation gain in decision trees. • With Streams? – Streams are of infinite length – Storing everything, if you can, will be an issue on the memory of the system $$$$ 4/18/2014 Streaming Data Mining 19
  • 19. Streaming Machine Learning • When? – High Data volume – Rate at which data comes is high. – Unbound, will always arrive in the system and we wont be able to fit it in our memory • Requirements to be adhered. – Each input element to be processed atmost once. – Space – Time – Start predicting from t0 4/18/2014 Streaming Data Mining 20
  • 20. General Flow of Streaming Algorithms 4/18/2014 Streaming Data Mining 21
  • 21. Spam Detection • Models trained in the past by traditional data mining strategy will become obsolete as spammers will find a way out. • Solution : VFDT - Hoeffding Tree Steam Classification • Train the model in streaming setup. • When new spam pattern detected, people mark them as spams. • Use them to retrain the model in real time. Concept Drift! Win! 4/18/2014 Streaming Data Mining 22
  • 22. Answering Todays BigData Needs • Streaming Data Mining – Storm – MOA – SAMOA – KAFKA – …… 4/18/2014 Streaming Data Mining 23