Apache Metron is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable advanced analytics through machine learning and data science to the users. Because of the relative immaturity of data science platform infrastructure integrated into Hadoop that is oriented to streaming analytics applications, we have been forced to create the requisite platform components out of necessity, utilizing many of the pieces of the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and how it utilizes a custom data science model deployment and autodiscovery service that is tightly integrated with Hadoop via Yarn and Zookeeper. We will discuss how we interact with the models deployed there via a custom domain specific language that can query models as data streams past. We will generally discuss the full-stack data science tooling that has been created to enable data science at scale on an advanced analytics
streaming application.
Speaker:
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
Designing IA for AI - Information Architecture Conference 2024
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
1. Model as a Service
Modern Streaming Data Science with Apache Metron
Casey Stella
@casey_stella
2017
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
3. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
4. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
5. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
6. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
7. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
◦ The enriched data will be indexed for further analysis in a realtime index and in a batch
index
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
8. Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
◦ The enriched data will be indexed for further analysis in a realtime index and in a batch
index
• Better enrichment =⇒ more context =⇒ better insights
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
9. Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
10. Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
11. Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
12. Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
• Machine learning can adapt better, but can be hard to scale at high velocity
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
13. Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
• Machine learning can adapt better, but can be hard to scale at high velocity
The ideal solution is to enable the mixing of rules, statistical baselining and machine
learning.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
14. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
15. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
16. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
17. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
18. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
19. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
20. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
21. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
22. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
23. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
• Many instances of many models will be running at once
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
24. Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
• Many instances of many models will be running at once
◦ Share resources fairly by using Yarn for deployment and model lifecycle management
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
25. Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
26. Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
27. Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
28. Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
• Provide a REST interface serving up your model
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
29. Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
• Provide a REST interface serving up your model
From there, MaaS will handle deployment, management and model auto-discovery
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
30. Model as a Service: Stellar
Now that we have a deployment and autodiscovery solution, we need to interact with the
models. We think of Stellar as like Excel functions that we can run on streaming data:
• Compose a rich set of built-in functions with user defined functions
• Provide simple primitives around the functions: boolean operations, conditionals,
numerical computation.
Model interaction is enabled as a set of Stellar functions that discover and interact with
model endpoints.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
31. Model as a Service: Architecture
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
32. Interlude: Domain Generating Algorithms
Domain Generating Algorithms are used by botnets to communicate with compromised
computers in an evasive way. Traffic to a fixed command and control host would be
noticed and firewall rules could be modified to cut communication channels cleanly.
Instead, the botnet command and control hosts must move around to evade detection.
Typically this involves periodically generating a synthetic domain in a repeatable way
and having the compromised computers attempt to connect to a few candidate synthetic
domains daily with some moderate hope of guessing the right one. This traffic is small
enough to be lost in the shuffle of a large organization, making the evasion effective.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
33. Demo
• One solution for detecting synthetic domains is to construct a classifier that must be
run against every domain going across the network.
• At the BSides DFW Conference in 2013, ClickSecurity presented a model written in
Python for detecting synthetic domains.
• We can expose this model as a REST endpoint, deploy via MaaS and interact with
the model via Stellar
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
34. dga_model_endpoint := MAAS_GET_ENDPOINT('dga')
dga_result_map := MAAS_MODEL_APPLY( dga_model_endpoint
, { 'host' : domain_without_subdomains }
)
dga_result := MAP_GET('is_malicious', dga_result_map)
is_dga := dga_result != null && dga_result == 'dga'
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
35. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
36. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
37. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
38. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
39. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
• Automatic conversion of common model output formats to an endpoint
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
40. Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
• Automatic conversion of common model output formats to an endpoint
• A searchable model registry
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
41. Questions
Thanks for your attention! Don’t forget to come to the cybersecurity Bird of a Feather
session Thursday.
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017