Multi-tenant Storm as a Service
(on and off Hadoop)
Hi I’m Bobby (email@example.com)
Low Latency Data Processing Architect at Yahoo.
› My team and I provide Apache Storm as a service to Yahoo.
› We also maintain Spark at Yahoo, but that is another talk.
Thursday June 5th @ 11:50 Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
› And we get to play around with deep learning and online machine learning too.
Commiter and PMC/PPMC member for
› Apache Storm incubating
› Apache Hadoop
› Apache Spark
› Apache TEZ incubating
› Unbounded sequence of tuples
› Source of Stream
› E.g. Read from Twitter streaming API
› Processes input streams and produces
› E.g. Functions, Filters, Aggregation,
› Network of spouts and bolts
Authentication By Type
HTTP – Using HTTP Authentication or with a Custom Java Servlet
Thrift – Kerberos (Possibly through a forwarded TGT)
› Kerberos for system processes (Because there is a keytab available)
› a shared secret for worker processes with MD5SUM in ZK.
File System – OS user/group + FS permissions.
Worker to Worker – Can use encryption with shared secret, but we
really need to add in SASL Auth.
External Services (like HBase) – Sorry it is up to you (Sort of …)
(Authenticating with External Services)
APIs to deliver credentials to a Topology.
ICredentialsListener – informed of credentials updates.
IAutoCredentials – automatically include credentials to push.
ICredentialsRenewer – renew credentials.
Push new Credentials
› storm upload_credentials
AutoTGT – push forwardable TGT to topology.
› Also logs you into Hadoop/HBase if needed
IAuthorizer plugin allows you to decide what is and isn’t allowed
SimpleACLAuthorizer for Nimbus.
Different roles for users
› Administrators can do anything.
Topology can configure access to itself as well (rebalance).
DRPCSimpleACLAuthorizer for DRPC.
Can configure client and topology users per function.
Can default open or closed.
Topology can also whitelist users to view info through UI and Logviewer
Modified code from Hadoop to let Supervisor launch workers as the user
that ran the topology.
Provides admin resource allotments per user instead of per topology
› Users decide how to divide up their resources per topology
(If you see anything you like we are hiring…)
Long lived secure processes in YARN.
Ephemeral ports for storm.
Combine the AM and Nimbus.
Do we need a Supervisor if we have a Node Manager?
Possibly run as Unmanaged AMs and Proxy Users.
Elasticity for storm topologies.
Resource aware scheduling/requests in storm.
Network aware scheduling in YARN and Storm.
Automatic fetching of delegation tokens like Oozie
We are hiring!
Stop by Kiosk P9
or reach out to us at
No need for a religious war, there are lots of good options out there and
we picked one.
Apache Spark Streaming
We started before Spark Streaming was a possibility.
Storm is currently more advanced in many areas, but not in all.
› Fault Tolerance (I can turn it off in storm)
The community for Storm was more active
Fault Tolerance (I can turn it on in storm)