Multi-tenant Storm as a Service
(on and off Hadoop)
Hi I’m Bobby (bobby@apache.org)
2
 Low Latency Data Processing Architect at Yahoo.
› My team and I provide Apache Storm as a service to Yahoo.
› We also maintain Spark at Yahoo, but that is another talk.
Thursday June 5th @ 11:50 Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
› And we get to play around with deep learning and online machine learning too.
 Commiter and PMC/PPMC member for
› Apache Storm incubating
› Apache Hadoop
› Apache Spark
› Apache TEZ incubating
Agenda
3
 Storm and YARN Overview
 Why?
 Securing Standalone Storm
 Storm on YARN
 What’s Next?
Storm Concepts
1. Streams
› Unbounded sequence of tuples
2. Spout
› Source of Stream
› E.g. Read from Twitter streaming API
3. Bolts
› Processes input streams and produces
new streams
› E.g. Functions, Filters, Aggregation,
Joins
4. Topologies
› Network of spouts and bolts
Storm Architecture
Master
Node
Cluster
Coordination
Worker
Processes
Worker
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor Worker
Worker
Worker
Launches
Workers
YARN
6
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request
Agenda
7
 Storm and YARN Overview
 Why?
 Securing Standalone Storm
 Storm on YARN
 What’s Next?
Why?
8
Short Term:
 SOX Compliance (Security)
 Reduced Operations Overhead
 Centralized Knowledge
 Managed Updates
 Some Elasticity
Longer Term:
 Elasticity
 Utilization
Agenda
9
 Storm and YARN Overview
 Why?
 Securing Standalone Storm
 Storm on YARN
 What’s Next?
Authenticating
Each Connection
6/17/201410
Authentication By Type
6/17/201411
 HTTP – Using HTTP Authentication or with a Custom Java Servlet
Filter.
 Thrift – Kerberos (Possibly through a forwarded TGT)
 ZooKeeper
› Kerberos for system processes (Because there is a keytab available)
› a shared secret for worker processes with MD5SUM in ZK.
 File System – OS user/group + FS permissions.
 Worker to Worker – Can use encryption with shared secret, but we
really need to add in SASL Auth.
 External Services (like HBase) – Sorry it is up to you (Sort of …)
Authenticating
Each Connection
6/17/201412
Credentials Push
(Authenticating with External Services)
6/17/201413
APIs to deliver credentials to a Topology.
 ICredentialsListener – informed of credentials updates.
 IAutoCredentials – automatically include credentials to push.
 ICredentialsRenewer – renew credentials.
 Push new Credentials
› storm upload_credentials
› StormSubmitter.pushCredentails
 AutoTGT – push forwardable TGT to topology.
› Also logs you into Hadoop/HBase if needed
Authorization
6/17/201414
IAuthorizer plugin allows you to decide what is and isn’t allowed
SimpleACLAuthorizer for Nimbus.
 Different roles for users
› Administrators can do anything.
› Supervisors
› Users
 Topology can configure access to itself as well (rebalance).
DRPCSimpleACLAuthorizer for DRPC.
 Can configure client and topology users per function.
 Can default open or closed.
Topology can also whitelist users to view info through UI and Logviewer
Multi-tenancy
supervisor.run.worker.as.user: true
6/17/201415
Modified code from Hadoop to let Supervisor launch workers as the user
that ran the topology.
Multi-tenant Scheduler
16
 Provides admin resource allotments per user instead of per topology
› Users decide how to divide up their resources per topology
Available Now
17
Code:
https://github.com/apache/incubator-storm/tree/security
Instructions:
https://github.com/apache/incubator-storm/blob/security/SECURITY.md
Pull Request:
https://github.com/apache/incubator-storm/pull/121
Agenda
18
 Storm and YARN Overview
 Why?
 Securing Standalone Storm
 Storm on YARN
 What’s Next?
Storm on YARN (Launching a Cluster)
19
Storm on YARN
20
Storm on YARN
6/17/201421
Currently
 A stand alone storm cluster running on YARN
 Has some hacks to avoid port conflicts
 No security
 No recovery if AM goes down
Available Now
22
https://github.com/yahoo/storm-yarn
And we plan to push this back into apache storm incubating once security
is merged to master.
Agenda
23
 Storm and YARN Overview
 Why?
 Securing Standalone Storm
 Storm on YARN
 What’s Next?
What’s Next?
(If you see anything you like we are hiring…)
24
 Nimbus HA/Recovery.
 Long lived secure processes in YARN.
 Ephemeral ports for storm.
 Combine the AM and Nimbus.
 Do we need a Supervisor if we have a Node Manager?
 Possibly run as Unmanaged AMs and Proxy Users.
 Elasticity for storm topologies.
 Resource aware scheduling/requests in storm.
 Network aware scheduling in YARN and Storm.
 Automatic fetching of delegation tokens like Oozie
Questions?
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
26
Backup Slides
Why Not…
27
No need for a religious war, there are lots of good options out there and
we picked one.
Apache Spark Streaming
 We started before Spark Streaming was a possibility.
 Storm is currently more advanced in many areas, but not in all.
› Fault Tolerance (I can turn it off in storm)
S4
 The community for Storm was more active
 Fault Tolerance (I can turn it on in storm)
Worker
Task
(Spout A-1)
Task
(Spout A-5)
Task
(Spout A-9)
Task
(Bolt B-3)
Task
(Bolt B-7)
Task
(Acker)
Disruptor Queue

Multi-Tenant Storm Service on Hadoop Grid