NEW YORK STORM
USERS GROUP

Using Storm with MapR M7
for Real-Time Predictive
Modeling
!
!
!

January 28, 2014
AGENDA

•
•
•
•
•
•
•
•
•
•
•
•
•

Introductions
About Velos
Our Use Cases
Requirements
Why Storm?
Why MapR M7?
How Did We...
INTRODUCTIONS

Gna Phetsarath

Director of Technology

@sourigna




http://www.linkedin.com/in/sourignaphetsarath/

3
ABOUT VELOS

•
•
•
•
•

Velos provides Predictive Analytics lifecycle and scaling
solutions for Enterprise companies
Forme...
OUR USE CASES

• Real-Time Predictive Modeling
• Real-Time Metrics
• Atomic Counters
• Unique Probabilistic Counting (Hype...
REQUIREMENTS

• < 50 ms response time
• Random access to large data set > 1B keys
• Near Real-time/streaming
• Distributed...
WHY STORM?

• Simple API
• Scalable
• Fault tolerant
• Guarantees data processing
• Handles parallelization, partitioning,...
WHY MAPR M7?

•
•
•
•

•
•
•

Configuration is simpler than with HBase
No region servers
No compaction happens since it is ...
HOW DID WE GET HERE?

• Amazon Elastic MapReduce
• Cloudera Hadoop on Amazon Web Services
• MapR M3 (Hadoop MapReduce) on
...
ARCHITECTURE - Q42013

API - Play Framework

MapR M3

Kafka

Redis

Storm Toplogies

MapR M7
Dashboard - Play
Framework
Po...
STORM CONCEPTS
Bolt
Spout

Tuples
(key,fields,...)
Topology
11
QUICK STORM OVERVIEW

•
•
•
•
•
•
•
•

Tuple - named list of values
Streams - streams of tuples
Spouts - a source of strea...
OUR TOPOLOGIES

• Entity Observe
• Kafka Spout
• Bot Detection Bolt
• Entity Observe Bolt
• Real-time Counter Bolt
• Predi...
ENTITY OBSERVE TOPOLOGY

14
R E A L - T I M E C O U N T E R B O LT

15
PERFORMANCE METRICS

Play / Kafka

~ 3000 ops/node

Kafka / Storm

~ 1650 ops/node

Storm / MapR M7

~ 5000 ops/node

16
C O M PA R I N G M 7 W I T H C A S S A N D R A

1M Put

1,900 ops/n

15,000 ops/n

1M RW

2,000 ops/n

5,000 ops/n
closer ...
TA M I N G S TO R M

•
•
•
•

•
•
•

Use monit to keep Nimbus & Supervisors running smoothly
Local queues that periodicall...
F E AT U R E S

ROAD MAP

• Deep learning for feature detection
• Anomaly detection
• Automation of full data science life...
TECHNOLOGY

ROAD MAP

• Storm 0.9.0
• Kafka 0.8.0
• Apache Spark
• Play 2.2.x
• Cascading
• Spring XD - eXtreme Data
• Spr...
Q&A
Thank you!
Upcoming SlideShare
Loading in …5
×

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

1,979 views

Published on

Velos provides predictive Aaalytics lifecycle and scaling solutions for Enterprise companies. Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware. Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others.

We will discuss the evolution from an Hadoop-only system to an architecture consisting of Storm, Play, Kafka, Redis, MapR M3, and MapR M7 (HBase) to meet our requirements. An overview of the different types of topologies created by Sociocast will be discussed with an in depth review of the topology used for real-time probabilistic and absolute counting. Performance metrics of the platform will be shared as well as a development road map for the platform.

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,979
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
33
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

  1. 1. NEW YORK STORM USERS GROUP Using Storm with MapR M7 for Real-Time Predictive Modeling ! ! ! January 28, 2014
  2. 2. AGENDA • • • • • • • • • • • • • Introductions About Velos Our Use Cases Requirements Why Storm? Why MapR M7? How Did We Get Here? Architecture Quick Storm Introduction Our Topologies Performance & Learnings Road Map Q&A 2
  3. 3. INTRODUCTIONS Gna Phetsarath
 Director of Technology
 @sourigna
 
 http://www.linkedin.com/in/sourignaphetsarath/ 3
  4. 4. ABOUT VELOS • • • • • Velos provides Predictive Analytics lifecycle and scaling solutions for Enterprise companies Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others Customers can easily automate ETL, feature engineering, model evaluation and production deployment and monitoring, as well as relearning and adaptation Plug-in existing Python, Java, and R models 4
  5. 5. OUR USE CASES • Real-Time Predictive Modeling • Real-Time Metrics • Atomic Counters • Unique Probabilistic Counting (Hyper Log Log Plus) • Group Membership (Bloom Filters) • Page Parsing - NLP Feature Extraction • Event/Entity Attribute Maintenance 5
  6. 6. REQUIREMENTS • < 50 ms response time • Random access to large data set > 1B keys • Near Real-time/streaming • Distributed • Scalable • Fault Tolerant • Reliable 6
  7. 7. WHY STORM? • Simple API • Scalable • Fault tolerant • Guarantees data processing • Handles parallelization, partitioning, and retrying on failures when necessary • Easy to deploy and operate • Free and open source 7
  8. 8. WHY MAPR M7? • • • • • • • Configuration is simpler than with HBase No region servers No compaction happens since it is read-write file system Recovery from cold starts are easier. HBase if it goes down and has to restarted takes a long time. Hours. Whereas, this is in minutes. we haven't had to experience that but we did have a ZK failure and had to bounce each node. Was quick. NFS Gateway is very useful There are plenty of features we haven't taken advantage of yet MapR Admin UI is easy to use 8
  9. 9. HOW DID WE GET HERE? • Amazon Elastic MapReduce • Cloudera Hadoop on Amazon Web Services • MapR M3 (Hadoop MapReduce) on • • Managed Hosting Service MapR M3, Riak, Storm, Kafka, Redis, Play on Managed Hosting Service MapR M3, MapR M7 (HBase), Storm, Kafka, Redis, Play on Managed Hosting Service 9
  10. 10. ARCHITECTURE - Q42013 API - Play Framework MapR M3 Kafka Redis Storm Toplogies MapR M7 Dashboard - Play Framework PostgresSQL 10
  11. 11. STORM CONCEPTS Bolt Spout Tuples (key,fields,...) Topology 11
  12. 12. QUICK STORM OVERVIEW • • • • • • • • Tuple - named list of values Streams - streams of tuples Spouts - a source of streams Bolts - processes any number of input streams and produces a number of output streams Topologies - an network of spouts and bolts Reliability - guarantees that every tuple will be fully processed Workers - executes subset of topology Tasks - executed by workers for bolts/spouts 12
  13. 13. OUR TOPOLOGIES • Entity Observe • Kafka Spout • Bot Detection Bolt • Entity Observe Bolt • Real-time Counter Bolt • Predictive Model Update Bolt • NLP Feature Extraction of HTML Content • Entity/Event Attribute Maintenance 13
  14. 14. ENTITY OBSERVE TOPOLOGY 14
  15. 15. R E A L - T I M E C O U N T E R B O LT 15
  16. 16. PERFORMANCE METRICS Play / Kafka ~ 3000 ops/node Kafka / Storm ~ 1650 ops/node Storm / MapR M7 ~ 5000 ops/node 16
  17. 17. C O M PA R I N G M 7 W I T H C A S S A N D R A 1M Put 1,900 ops/n 15,000 ops/n 1M RW 2,000 ops/n 5,000 ops/n closer to what we see in production 1B Load N/A 7,000 ops/n YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node Cassandra 2.0.x; MapR M7 Pre-Release 3.00 17
  18. 18. TA M I N G S TO R M • • • • • • • Use monit to keep Nimbus & Supervisors running smoothly Local queues that periodically write operational stats to Redis (e.g. processing throughput) & alert Ops team Shaded jars & deployment scripts to keep topologies up to date ScBaseRichBolt • Write your own base classes to trap framework exceptions and do proper things • Reduce boiler-plate code Use Murmur Hash to make jobs more efficient by distributing keys more evenly. (True for MapReduce, as well) Storm UI is not reliable (v0.8.2). So, need to roll out your own stats; Storm 0.9 UI should be more reliable DataDog used for Dashboards and Alerts 18
  19. 19. F E AT U R E S ROAD MAP • Deep learning for feature detection • Anomaly detection • Automation of full data science lifecycle, from exploration and modeling to production and relearning • R and Python custom algorithm support • Automated model training and optimization 19
  20. 20. TECHNOLOGY ROAD MAP • Storm 0.9.0 • Kafka 0.8.0 • Apache Spark • Play 2.2.x • Cascading • Spring XD - eXtreme Data • Spring Reactor • Spring Boot 20
  21. 21. Q&A
  22. 22. Thank you!

×