SlideShare a Scribd company logo
1 of 22
Download to read offline
NEW YORK STORM
USERS GROUP

Using Storm with MapR M7
for Real-Time Predictive
Modeling
!
!
!

January 28, 2014
AGENDA

•
•
•
•
•
•
•
•
•
•
•
•
•

Introductions
About Velos
Our Use Cases
Requirements
Why Storm?
Why MapR M7?
How Did We Get Here?
Architecture
Quick Storm Introduction
Our Topologies
Performance & Learnings
Road Map
Q&A
2
INTRODUCTIONS

Gna Phetsarath

Director of Technology

@sourigna




http://www.linkedin.com/in/sourignaphetsarath/

3
ABOUT VELOS

•
•
•
•
•

Velos provides Predictive Analytics lifecycle and scaling
solutions for Enterprise companies
Formerly Sociocast, a SaaS solution with use-case
specific ad tech and e-commerce models on our own
hardware
Velos provides an on-premise platform supporting any
models on various production runtimes, such as Hadoop,
Storm, Spark and others
Customers can easily automate ETL, feature engineering,
model evaluation and production deployment and
monitoring, as well as relearning and adaptation
Plug-in existing Python, Java, and R models
4
OUR USE CASES

• Real-Time Predictive Modeling
• Real-Time Metrics
• Atomic Counters
• Unique Probabilistic Counting (Hyper
Log Log Plus)
• Group Membership (Bloom Filters)
• Page Parsing - NLP Feature Extraction
• Event/Entity Attribute Maintenance

5
REQUIREMENTS

• < 50 ms response time
• Random access to large data set > 1B keys
• Near Real-time/streaming
• Distributed
• Scalable
• Fault Tolerant
• Reliable

6
WHY STORM?

• Simple API
• Scalable
• Fault tolerant
• Guarantees data processing
• Handles parallelization, partitioning, and
retrying on failures when necessary
• Easy to deploy and operate
• Free and open source

7
WHY MAPR M7?

•
•
•
•

•
•
•

Configuration is simpler than with HBase
No region servers
No compaction happens since it is read-write file system
Recovery from cold starts are easier. HBase if it goes
down and has to restarted takes a long time. Hours.
Whereas, this is in minutes. we haven't had to experience
that but we did have a ZK failure and had to bounce each
node. Was quick.
NFS Gateway is very useful
There are plenty of features we haven't taken advantage
of yet
MapR Admin UI is easy to use
8
HOW DID WE GET HERE?

• Amazon Elastic MapReduce
• Cloudera Hadoop on Amazon Web Services
• MapR M3 (Hadoop MapReduce) on
•
•

Managed Hosting Service
MapR M3, Riak, Storm, Kafka, Redis, Play
on Managed Hosting Service
MapR M3, MapR M7 (HBase), Storm, Kafka,
Redis, Play on Managed Hosting Service

9
ARCHITECTURE - Q42013

API - Play Framework

MapR M3

Kafka

Redis

Storm Toplogies

MapR M7
Dashboard - Play
Framework
PostgresSQL

10
STORM CONCEPTS
Bolt
Spout

Tuples
(key,fields,...)
Topology
11
QUICK STORM OVERVIEW

•
•
•
•
•
•
•
•

Tuple - named list of values
Streams - streams of tuples
Spouts - a source of streams
Bolts - processes any number of input streams
and produces a number of output streams
Topologies - an network of spouts and bolts
Reliability - guarantees that every tuple will be
fully processed
Workers - executes subset of topology
Tasks - executed by workers for bolts/spouts
12
OUR TOPOLOGIES

• Entity Observe
• Kafka Spout
• Bot Detection Bolt
• Entity Observe Bolt
• Real-time Counter Bolt
• Predictive Model Update Bolt
• NLP Feature Extraction of HTML Content
• Entity/Event Attribute Maintenance
13
ENTITY OBSERVE TOPOLOGY

14
R E A L - T I M E C O U N T E R B O LT

15
PERFORMANCE METRICS

Play / Kafka

~ 3000 ops/node

Kafka / Storm

~ 1650 ops/node

Storm / MapR M7

~ 5000 ops/node

16
C O M PA R I N G M 7 W I T H C A S S A N D R A

1M Put

1,900 ops/n

15,000 ops/n

1M RW

2,000 ops/n

5,000 ops/n
closer to what we
see in production

1B Load

N/A

7,000 ops/n

YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node
Cassandra 2.0.x; MapR M7 Pre-Release 3.00

17
TA M I N G S TO R M

•
•
•
•

•
•
•

Use monit to keep Nimbus & Supervisors running smoothly
Local queues that periodically write operational stats to Redis
(e.g. processing throughput) & alert Ops team
Shaded jars & deployment scripts to keep topologies up to date
ScBaseRichBolt
• Write your own base classes to trap framework exceptions and
do proper things
• Reduce boiler-plate code
Use Murmur Hash to make jobs more efficient by distributing
keys more evenly. (True for MapReduce, as well)
Storm UI is not reliable (v0.8.2). So, need to roll out your own
stats; Storm 0.9 UI should be more reliable
DataDog used for Dashboards and Alerts
18
F E AT U R E S

ROAD MAP

• Deep learning for feature detection
• Anomaly detection
• Automation of full data science lifecycle,
from exploration and modeling to
production and relearning
• R and Python custom algorithm support
• Automated model training and
optimization

19
TECHNOLOGY

ROAD MAP

• Storm 0.9.0
• Kafka 0.8.0
• Apache Spark
• Play 2.2.x
• Cascading
• Spring XD - eXtreme Data
• Spring Reactor
• Spring Boot
20
Q&A
Thank you!

More Related Content

Similar to New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Nvp deep dive_session_cee-day
Nvp deep dive_session_cee-dayNvp deep dive_session_cee-day
Nvp deep dive_session_cee-day
yfauser
 

Similar to New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling (20)

Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
c-quilibrium R forecasting integration
c-quilibrium R forecasting integrationc-quilibrium R forecasting integration
c-quilibrium R forecasting integration
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...
Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...
Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Nvp deep dive_session_cee-day
Nvp deep dive_session_cee-dayNvp deep dive_session_cee-day
Nvp deep dive_session_cee-day
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

  • 1. NEW YORK STORM USERS GROUP Using Storm with MapR M7 for Real-Time Predictive Modeling ! ! ! January 28, 2014
  • 2. AGENDA • • • • • • • • • • • • • Introductions About Velos Our Use Cases Requirements Why Storm? Why MapR M7? How Did We Get Here? Architecture Quick Storm Introduction Our Topologies Performance & Learnings Road Map Q&A 2
  • 3. INTRODUCTIONS Gna Phetsarath
 Director of Technology
 @sourigna
 
 http://www.linkedin.com/in/sourignaphetsarath/ 3
  • 4. ABOUT VELOS • • • • • Velos provides Predictive Analytics lifecycle and scaling solutions for Enterprise companies Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others Customers can easily automate ETL, feature engineering, model evaluation and production deployment and monitoring, as well as relearning and adaptation Plug-in existing Python, Java, and R models 4
  • 5. OUR USE CASES • Real-Time Predictive Modeling • Real-Time Metrics • Atomic Counters • Unique Probabilistic Counting (Hyper Log Log Plus) • Group Membership (Bloom Filters) • Page Parsing - NLP Feature Extraction • Event/Entity Attribute Maintenance 5
  • 6. REQUIREMENTS • < 50 ms response time • Random access to large data set > 1B keys • Near Real-time/streaming • Distributed • Scalable • Fault Tolerant • Reliable 6
  • 7. WHY STORM? • Simple API • Scalable • Fault tolerant • Guarantees data processing • Handles parallelization, partitioning, and retrying on failures when necessary • Easy to deploy and operate • Free and open source 7
  • 8. WHY MAPR M7? • • • • • • • Configuration is simpler than with HBase No region servers No compaction happens since it is read-write file system Recovery from cold starts are easier. HBase if it goes down and has to restarted takes a long time. Hours. Whereas, this is in minutes. we haven't had to experience that but we did have a ZK failure and had to bounce each node. Was quick. NFS Gateway is very useful There are plenty of features we haven't taken advantage of yet MapR Admin UI is easy to use 8
  • 9. HOW DID WE GET HERE? • Amazon Elastic MapReduce • Cloudera Hadoop on Amazon Web Services • MapR M3 (Hadoop MapReduce) on • • Managed Hosting Service MapR M3, Riak, Storm, Kafka, Redis, Play on Managed Hosting Service MapR M3, MapR M7 (HBase), Storm, Kafka, Redis, Play on Managed Hosting Service 9
  • 10. ARCHITECTURE - Q42013 API - Play Framework MapR M3 Kafka Redis Storm Toplogies MapR M7 Dashboard - Play Framework PostgresSQL 10
  • 12. QUICK STORM OVERVIEW • • • • • • • • Tuple - named list of values Streams - streams of tuples Spouts - a source of streams Bolts - processes any number of input streams and produces a number of output streams Topologies - an network of spouts and bolts Reliability - guarantees that every tuple will be fully processed Workers - executes subset of topology Tasks - executed by workers for bolts/spouts 12
  • 13. OUR TOPOLOGIES • Entity Observe • Kafka Spout • Bot Detection Bolt • Entity Observe Bolt • Real-time Counter Bolt • Predictive Model Update Bolt • NLP Feature Extraction of HTML Content • Entity/Event Attribute Maintenance 13
  • 15. R E A L - T I M E C O U N T E R B O LT 15
  • 16. PERFORMANCE METRICS Play / Kafka ~ 3000 ops/node Kafka / Storm ~ 1650 ops/node Storm / MapR M7 ~ 5000 ops/node 16
  • 17. C O M PA R I N G M 7 W I T H C A S S A N D R A 1M Put 1,900 ops/n 15,000 ops/n 1M RW 2,000 ops/n 5,000 ops/n closer to what we see in production 1B Load N/A 7,000 ops/n YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node Cassandra 2.0.x; MapR M7 Pre-Release 3.00 17
  • 18. TA M I N G S TO R M • • • • • • • Use monit to keep Nimbus & Supervisors running smoothly Local queues that periodically write operational stats to Redis (e.g. processing throughput) & alert Ops team Shaded jars & deployment scripts to keep topologies up to date ScBaseRichBolt • Write your own base classes to trap framework exceptions and do proper things • Reduce boiler-plate code Use Murmur Hash to make jobs more efficient by distributing keys more evenly. (True for MapReduce, as well) Storm UI is not reliable (v0.8.2). So, need to roll out your own stats; Storm 0.9 UI should be more reliable DataDog used for Dashboards and Alerts 18
  • 19. F E AT U R E S ROAD MAP • Deep learning for feature detection • Anomaly detection • Automation of full data science lifecycle, from exploration and modeling to production and relearning • R and Python custom algorithm support • Automated model training and optimization 19
  • 20. TECHNOLOGY ROAD MAP • Storm 0.9.0 • Kafka 0.8.0 • Apache Spark • Play 2.2.x • Cascading • Spring XD - eXtreme Data • Spring Reactor • Spring Boot 20
  • 21. Q&A