New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

NEW YORK STORM
USERS GROUP

Using Storm with MapR M7
for Real-Time Predictive
Modeling
!
!
!

January 28, 2014

AGENDA

•
•
•
•
•
•
•
•
•
•
•
•
•

Introductions
About Velos
Our Use Cases
Requirements
Why Storm?
Why MapR M7?
How Did We Get Here?
Architecture
Quick Storm Introduction
Our Topologies
Performance & Learnings
Road Map
Q&A
2

INTRODUCTIONS

Gna Phetsarath 
Director of Technology 
@sourigna 
 

http://www.linkedin.com/in/sourignaphetsarath/

3

ABOUT VELOS

•
•
•
•
•

Velos provides Predictive Analytics lifecycle and scaling
solutions for Enterprise companies
Formerly Sociocast, a SaaS solution with use-case
speciﬁc ad tech and e-commerce models on our own
hardware
Velos provides an on-premise platform supporting any
models on various production runtimes, such as Hadoop,
Storm, Spark and others
Customers can easily automate ETL, feature engineering,
model evaluation and production deployment and
monitoring, as well as relearning and adaptation
Plug-in existing Python, Java, and R models
4

OUR USE CASES

• Real-Time Predictive Modeling
• Real-Time Metrics
• Atomic Counters
• Unique Probabilistic Counting (Hyper
Log Log Plus)
• Group Membership (Bloom Filters)
• Page Parsing - NLP Feature Extraction
• Event/Entity Attribute Maintenance

5

REQUIREMENTS

• < 50 ms response time
• Random access to large data set > 1B keys
• Near Real-time/streaming
• Distributed
• Scalable
• Fault Tolerant
• Reliable

6

WHY STORM?

• Simple API
• Scalable
• Fault tolerant
• Guarantees data processing
• Handles parallelization, partitioning, and
retrying on failures when necessary
• Easy to deploy and operate
• Free and open source

7

WHY MAPR M7?

•
•
•
•

•
•
•

Conﬁguration is simpler than with HBase
No region servers
No compaction happens since it is read-write ﬁle system
Recovery from cold starts are easier. HBase if it goes
down and has to restarted takes a long time. Hours.
Whereas, this is in minutes. we haven't had to experience
that but we did have a ZK failure and had to bounce each
node. Was quick.
NFS Gateway is very useful
There are plenty of features we haven't taken advantage
of yet
MapR Admin UI is easy to use
8

HOW DID WE GET HERE?

• Amazon Elastic MapReduce
• Cloudera Hadoop on Amazon Web Services
• MapR M3 (Hadoop MapReduce) on
•
•

Managed Hosting Service
MapR M3, Riak, Storm, Kafka, Redis, Play
on Managed Hosting Service
MapR M3, MapR M7 (HBase), Storm, Kafka,
Redis, Play on Managed Hosting Service

9

ARCHITECTURE - Q42013

API - Play Framework

MapR M3

Kafka

Redis

Storm Toplogies

MapR M7
Dashboard - Play
Framework
PostgresSQL

10

STORM CONCEPTS
Bolt
Spout

Tuples
(key,ﬁelds,...)
Topology
11

QUICK STORM OVERVIEW

•
•
•
•
•
•
•
•

Tuple - named list of values
Streams - streams of tuples
Spouts - a source of streams
Bolts - processes any number of input streams
and produces a number of output streams
Topologies - an network of spouts and bolts
Reliability - guarantees that every tuple will be
fully processed
Workers - executes subset of topology
Tasks - executed by workers for bolts/spouts
12

OUR TOPOLOGIES

• Entity Observe
• Kafka Spout
• Bot Detection Bolt
• Entity Observe Bolt
• Real-time Counter Bolt
• Predictive Model Update Bolt
• NLP Feature Extraction of HTML Content
• Entity/Event Attribute Maintenance
13

R E A L - T I M E C O U N T E R B O LT

15

PERFORMANCE METRICS

Play / Kafka

~ 3000 ops/node

Kafka / Storm

~ 1650 ops/node

Storm / MapR M7

~ 5000 ops/node

16

C O M PA R I N G M 7 W I T H C A S S A N D R A

1M Put

1,900 ops/n

15,000 ops/n

1M RW

2,000 ops/n

5,000 ops/n
closer to what we
see in production

1B Load

N/A

7,000 ops/n

YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node
Cassandra 2.0.x; MapR M7 Pre-Release 3.00

17

TA M I N G S TO R M

•
•
•
•

•
•
•

Use monit to keep Nimbus & Supervisors running smoothly
Local queues that periodically write operational stats to Redis
(e.g. processing throughput) & alert Ops team
Shaded jars & deployment scripts to keep topologies up to date
ScBaseRichBolt
• Write your own base classes to trap framework exceptions and
do proper things
• Reduce boiler-plate code
Use Murmur Hash to make jobs more eﬃcient by distributing
keys more evenly. (True for MapReduce, as well)
Storm UI is not reliable (v0.8.2). So, need to roll out your own
stats; Storm 0.9 UI should be more reliable
DataDog used for Dashboards and Alerts
18

F E AT U R E S

ROAD MAP

• Deep learning for feature detection
• Anomaly detection
• Automation of full data science lifecycle,
from exploration and modeling to
production and relearning
• R and Python custom algorithm support
• Automated model training and
optimization

19

TECHNOLOGY

ROAD MAP

• Storm 0.9.0
• Kafka 0.8.0
• Apache Spark
• Play 2.2.x
• Cascading
• Spring XD - eXtreme Data
• Spring Reactor
• Spring Boot
20

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Recommended

Recommended

More Related Content

Similar to New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

Similar to New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling (20)

Recently uploaded

Recently uploaded (20)

New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling