SlideShare a Scribd company logo
1 of 41
Download to read offline
Model as a Service
Modern Streaming Data Science with Apache Metron
Casey Stella
@casey_stella
2017
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
◦ The enriched data will be indexed for further analysis in a realtime index and in a batch
index
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Apache Metron
• Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking:
◦ We ingest network data from many different sources.
◦ That data is then normalized and enriched
◦ Potential threats will be identified in the enriched data as it streams by and the
likelihood of the threat will be ranked
◦ The enriched data will be indexed for further analysis in a realtime index and in a batch
index
• Better enrichment =⇒ more context =⇒ better insights
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
• Machine learning can adapt better, but can be hard to scale at high velocity
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Enrichment =⇒ Data Science
Threat identification is a hard thing, it turns out. The key to the solution is enabling
better threat identification.
• Fixed rules are too coarse and multiply aggressively.
• Statistical baselining can adapt better than rules in some circumstances, but has
scale challenges as well
• Machine learning can adapt better, but can be hard to scale at high velocity
The ideal solution is to enable the mixing of rules, statistical baselining and machine
learning.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
• Many instances of many models will be running at once
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Characteristics of a Solution
• There’s a lot of data and it comes at you fast
◦ Build atop scalable infrastructure: Storm + Hadoop
• Data Scientists rarely standardize on one technology or library for all problems
◦ Bring your own Language and Library
◦ Model interaction can be done via exposing models as REST endpoints
• Data Science can be computationally expensive
◦ If we assume models are state-free, they look a lot like traditional microservices and can
be scaled by replication
• Integration of new models and retirement of old models must be seamless
◦ Registry and Autodiscovery through zookeeper
• Many instances of many models will be running at once
◦ Share resources fairly by using Yarn for deployment and model lifecycle management
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
• Provide a REST interface serving up your model
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Requirements of a Model Creator
We want to ask as little of data scientists and model creators as possible:
• Handle training your model wherever and however you like
• Make your model stateless
• Provide a REST interface serving up your model
From there, MaaS will handle deployment, management and model auto-discovery
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Model as a Service: Stellar
Now that we have a deployment and autodiscovery solution, we need to interact with the
models. We think of Stellar as like Excel functions that we can run on streaming data:
• Compose a rich set of built-in functions with user defined functions
• Provide simple primitives around the functions: boolean operations, conditionals,
numerical computation.
Model interaction is enabled as a set of Stellar functions that discover and interact with
model endpoints.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Model as a Service: Architecture
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Interlude: Domain Generating Algorithms
Domain Generating Algorithms are used by botnets to communicate with compromised
computers in an evasive way. Traffic to a fixed command and control host would be
noticed and firewall rules could be modified to cut communication channels cleanly.
Instead, the botnet command and control hosts must move around to evade detection.
Typically this involves periodically generating a synthetic domain in a repeatable way
and having the compromised computers attempt to connect to a few candidate synthetic
domains daily with some moderate hope of guessing the right one. This traffic is small
enough to be lost in the shuffle of a large organization, making the evasion effective.
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Demo
• One solution for detecting synthetic domains is to construct a classifier that must be
run against every domain going across the network.
• At the BSides DFW Conference in 2013, ClickSecurity presented a model written in
Python for detecting synthetic domains.
• We can expose this model as a REST endpoint, deploy via MaaS and interact with
the model via Stellar
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
dga_model_endpoint := MAAS_GET_ENDPOINT('dga')
dga_result_map := MAAS_MODEL_APPLY( dga_model_endpoint
, { 'host' : domain_without_subdomains }
)
dga_result := MAP_GET('is_malicious', dga_result_map)
is_dga := dga_result != null && dga_result == 'dga'
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
• Automatic conversion of common model output formats to an endpoint
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Next Steps
Going forward, we’d like to add a few things to make this an even more complete
solution
• A UI for model deployment
• Model Monitoring w/ dashboards
• Alternative endpoint formats (e.g. gRPC)
• Automatic conversion of common model output formats to an endpoint
• A searchable model registry
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
Questions
Thanks for your attention! Don’t forget to come to the cybersecurity Bird of a Feather
session Thursday.
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
Casey Stella@casey_stella (Hortonworks) Model as a Service 2017

More Related Content

What's hot

How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
DataWorks Summit
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
DataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 

What's hot (20)

Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 

Similar to MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (Incubating)

Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 

Similar to MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (Incubating) (20)

Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Spring + QueryDSL + MongoDB Presentation
Spring + QueryDSL + MongoDB PresentationSpring + QueryDSL + MongoDB Presentation
Spring + QueryDSL + MongoDB Presentation
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
SRV318_Research at PNNL Powered by AWS
SRV318_Research at PNNL Powered by AWSSRV318_Research at PNNL Powered by AWS
SRV318_Research at PNNL Powered by AWS
 
Research at PNNL: Powered by AWS - SRV318 - re:Invent 2017
Research at PNNL: Powered by AWS - SRV318 - re:Invent 2017Research at PNNL: Powered by AWS - SRV318 - re:Invent 2017
Research at PNNL: Powered by AWS - SRV318 - re:Invent 2017
 
Social Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache CassandraSocial Security Company Nexgate's Success Relies on Apache Cassandra
Social Security Company Nexgate's Success Relies on Apache Cassandra
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information Security
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
2017 bio it world
2017 bio it world2017 bio it world
2017 bio it world
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
Overkill Security
 

Recently uploaded (20)

Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (Incubating)

  • 1. Model as a Service Modern Streaming Data Science with Apache Metron Casey Stella @casey_stella 2017 Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 2. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 3. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 4. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: ◦ We ingest network data from many different sources. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 5. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: ◦ We ingest network data from many different sources. ◦ That data is then normalized and enriched Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 6. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: ◦ We ingest network data from many different sources. ◦ That data is then normalized and enriched ◦ Potential threats will be identified in the enriched data as it streams by and the likelihood of the threat will be ranked Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 7. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: ◦ We ingest network data from many different sources. ◦ That data is then normalized and enriched ◦ Potential threats will be identified in the enriched data as it streams by and the likelihood of the threat will be ranked ◦ The enriched data will be indexed for further analysis in a realtime index and in a batch index Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 8. Apache Metron • Apache Metron is a streaming cybersecurity analytics solution. Roughly speaking: ◦ We ingest network data from many different sources. ◦ That data is then normalized and enriched ◦ Potential threats will be identified in the enriched data as it streams by and the likelihood of the threat will be ranked ◦ The enriched data will be indexed for further analysis in a realtime index and in a batch index • Better enrichment =⇒ more context =⇒ better insights Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 9. Enrichment =⇒ Data Science Threat identification is a hard thing, it turns out. The key to the solution is enabling better threat identification. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 10. Enrichment =⇒ Data Science Threat identification is a hard thing, it turns out. The key to the solution is enabling better threat identification. • Fixed rules are too coarse and multiply aggressively. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 11. Enrichment =⇒ Data Science Threat identification is a hard thing, it turns out. The key to the solution is enabling better threat identification. • Fixed rules are too coarse and multiply aggressively. • Statistical baselining can adapt better than rules in some circumstances, but has scale challenges as well Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 12. Enrichment =⇒ Data Science Threat identification is a hard thing, it turns out. The key to the solution is enabling better threat identification. • Fixed rules are too coarse and multiply aggressively. • Statistical baselining can adapt better than rules in some circumstances, but has scale challenges as well • Machine learning can adapt better, but can be hard to scale at high velocity Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 13. Enrichment =⇒ Data Science Threat identification is a hard thing, it turns out. The key to the solution is enabling better threat identification. • Fixed rules are too coarse and multiply aggressively. • Statistical baselining can adapt better than rules in some circumstances, but has scale challenges as well • Machine learning can adapt better, but can be hard to scale at high velocity The ideal solution is to enable the mixing of rules, statistical baselining and machine learning. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 14. Characteristics of a Solution • There’s a lot of data and it comes at you fast Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 15. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 16. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 17. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 18. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 19. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 20. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive ◦ If we assume models are state-free, they look a lot like traditional microservices and can be scaled by replication Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 21. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive ◦ If we assume models are state-free, they look a lot like traditional microservices and can be scaled by replication • Integration of new models and retirement of old models must be seamless Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 22. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive ◦ If we assume models are state-free, they look a lot like traditional microservices and can be scaled by replication • Integration of new models and retirement of old models must be seamless ◦ Registry and Autodiscovery through zookeeper Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 23. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive ◦ If we assume models are state-free, they look a lot like traditional microservices and can be scaled by replication • Integration of new models and retirement of old models must be seamless ◦ Registry and Autodiscovery through zookeeper • Many instances of many models will be running at once Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 24. Characteristics of a Solution • There’s a lot of data and it comes at you fast ◦ Build atop scalable infrastructure: Storm + Hadoop • Data Scientists rarely standardize on one technology or library for all problems ◦ Bring your own Language and Library ◦ Model interaction can be done via exposing models as REST endpoints • Data Science can be computationally expensive ◦ If we assume models are state-free, they look a lot like traditional microservices and can be scaled by replication • Integration of new models and retirement of old models must be seamless ◦ Registry and Autodiscovery through zookeeper • Many instances of many models will be running at once ◦ Share resources fairly by using Yarn for deployment and model lifecycle management Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 25. Requirements of a Model Creator We want to ask as little of data scientists and model creators as possible: Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 26. Requirements of a Model Creator We want to ask as little of data scientists and model creators as possible: • Handle training your model wherever and however you like Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 27. Requirements of a Model Creator We want to ask as little of data scientists and model creators as possible: • Handle training your model wherever and however you like • Make your model stateless Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 28. Requirements of a Model Creator We want to ask as little of data scientists and model creators as possible: • Handle training your model wherever and however you like • Make your model stateless • Provide a REST interface serving up your model Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 29. Requirements of a Model Creator We want to ask as little of data scientists and model creators as possible: • Handle training your model wherever and however you like • Make your model stateless • Provide a REST interface serving up your model From there, MaaS will handle deployment, management and model auto-discovery Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 30. Model as a Service: Stellar Now that we have a deployment and autodiscovery solution, we need to interact with the models. We think of Stellar as like Excel functions that we can run on streaming data: • Compose a rich set of built-in functions with user defined functions • Provide simple primitives around the functions: boolean operations, conditionals, numerical computation. Model interaction is enabled as a set of Stellar functions that discover and interact with model endpoints. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 31. Model as a Service: Architecture Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 32. Interlude: Domain Generating Algorithms Domain Generating Algorithms are used by botnets to communicate with compromised computers in an evasive way. Traffic to a fixed command and control host would be noticed and firewall rules could be modified to cut communication channels cleanly. Instead, the botnet command and control hosts must move around to evade detection. Typically this involves periodically generating a synthetic domain in a repeatable way and having the compromised computers attempt to connect to a few candidate synthetic domains daily with some moderate hope of guessing the right one. This traffic is small enough to be lost in the shuffle of a large organization, making the evasion effective. Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 33. Demo • One solution for detecting synthetic domains is to construct a classifier that must be run against every domain going across the network. • At the BSides DFW Conference in 2013, ClickSecurity presented a model written in Python for detecting synthetic domains. • We can expose this model as a REST endpoint, deploy via MaaS and interact with the model via Stellar Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 34. dga_model_endpoint := MAAS_GET_ENDPOINT('dga') dga_result_map := MAAS_MODEL_APPLY( dga_model_endpoint , { 'host' : domain_without_subdomains } ) dga_result := MAP_GET('is_malicious', dga_result_map) is_dga := dga_result != null && dga_result == 'dga' Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 35. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 36. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution • A UI for model deployment Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 37. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution • A UI for model deployment • Model Monitoring w/ dashboards Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 38. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution • A UI for model deployment • Model Monitoring w/ dashboards • Alternative endpoint formats (e.g. gRPC) Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 39. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution • A UI for model deployment • Model Monitoring w/ dashboards • Alternative endpoint formats (e.g. gRPC) • Automatic conversion of common model output formats to an endpoint Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 40. Next Steps Going forward, we’d like to add a few things to make this an even more complete solution • A UI for model deployment • Model Monitoring w/ dashboards • Alternative endpoint formats (e.g. gRPC) • Automatic conversion of common model output formats to an endpoint • A searchable model registry Casey Stella@casey_stella (Hortonworks) Model as a Service 2017
  • 41. Questions Thanks for your attention! Don’t forget to come to the cybersecurity Bird of a Feather session Thursday. • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com Casey Stella@casey_stella (Hortonworks) Model as a Service 2017