SlideShare a Scribd company logo
1 of 46
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead
San Francisco Hadoop Users Group, June 14th 2016
San Francisco, CA
Streaming ETL for All
© Rocana, Inc. All Rights Reserved. | 2
Slides
http://bit.ly/streaming-etl-slides
© Rocana, Inc. All Rights Reserved. | 3
Context
© Rocana, Inc. All Rights Reserved. | 4
Joey
• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 5
© Rocana, Inc. All Rights Reserved. | 6
History
© Rocana, Inc. All Rights Reserved. | 7
Spark
Impala
“Legacy” data architecture
HDFS
Avro/Parquet FilesFlume/Sqoop
Data Producers
MapReduc
e
Visualization/Query
© Rocana, Inc. All Rights Reserved. | 8
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 9
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 10
Stream processing
A primer
© Rocana, Inc. All Rights Reserved. | 11
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 12
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 13
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
• Data transformation
© Rocana, Inc. All Rights Reserved. | 14
Apache Storm
• "Distributed real-time computation system"
• Applications packaged into topologies (think MapReduce job)
• Topologies operate over streams of tuples
• Spout: source of a stream
• Bolt: arbitrary operation such as filtering, aggregating, joining, or
executing arbitrary functions
© Rocana, Inc. All Rights Reserved. | 15
Apache Spark
• Supports batch and stream processing
• Continuous stream of records discretized into a DStream
• DStream: a sequence of RDDs (batches of records)
• Micro-batch
© Rocana, Inc. All Rights Reserved. | 16
Apache Flink
• Supports batch and stream processing
• DataStream: unbounded collection of records
• Operations can apply to individual records or windows of records
• Supports record-at-a-time processing (like Storm)
© Rocana, Inc. All Rights Reserved. | 17
Apache Kafka
• Pub-sub messaging system implemented as a distributed commit log
• Popular as a source and sink for data streams
• Scalability, durability, and easy-to-understand delivery guarantees
• Can do stream processing directly in Kafka consumers
• Kafka Streams
© Rocana, Inc. All Rights Reserved. | 18
Data transformation
© Rocana, Inc. All Rights Reserved. | 19
Filter
filter
© Rocana, Inc. All Rights Reserved. | 20
Extract
127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326
ts: 1436576671000
body: <binary blob>
event_type_id: 100
...
extract
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
© Rocana, Inc. All Rights Reserved. | 21
Project
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
ts: 1459444413000
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
request: "GET /index.html HTTP/1.0"
status_code: 200
size: 2326
project
© Rocana, Inc. All Rights Reserved. | 22
Problem
© Rocana, Inc. All Rights Reserved. | 23
Who
• Developers
• Data engineers
• Sysadmins
• Analysts
© Rocana, Inc. All Rights Reserved. | 24
Tools
© Rocana, Inc. All Rights Reserved. | 25
The dark art of data science
• Feature engineering
• “Getting a mess of raw data that can be used as input to a machine
learning algorithm” - @josh_wills
• Video from Midwest.io 2014
© Rocana, Inc. All Rights Reserved. | 26
Data transformation for all
© Rocana, Inc. All Rights Reserved. | 27
Rocana Transform
• Library
• Java
• Rocana configuration
• JSON + comments + specific numeric types - excess quoting
© Rocana, Inc. All Rights Reserved. | 28
Data model
• Event schema
• id: A globally unique identifier for this event
• ts: Epoch timestamp in milliseconds
• event_type_id: ID indicating the type of the event
• location: Location from which the event was generated
• host: Hostname, IP, or other device identifier from which the event was
generated
• service: Service or process from which the event was generated
• body: Raw event content in bytes
• attributes: Event type-specific key/value pairs
© Rocana, Inc. All Rights Reserved. | 29
Example event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 100,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "example01.rocana.com",
"service": "dhclient",
"body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …",
"attributes": {
"syslog_timestamp": "1436576671000",
"syslog_process": "dhclient",
"syslog_pid": "865",
"syslog_facility": "3",
"syslog_severity": "6",
"syslog_hostname": "example01",
"syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)"
}
}
© Rocana, Inc. All Rights Reserved. | 30
Filter, extract, and flatten
© Rocana, Inc. All Rights Reserved. | 31
Filter, extract, and flatten
• Filter out events without type id 100
• Filter out events without hostname prefix "ex"
• Extract a numeric prefix from the syslog message
• Flatten syslog attributes to top-level fields in a different avro schema
© Rocana, Inc. All Rights Reserved. | 32
Filter, extract, and flatten
{
load-event: {},
// Filter by event_type_id
filter: { expression: "${event_type_id == 100}" },
// Extract hostname prefix
regex: { ... },
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
// Extract a numeric prefix from the syslog message
regex: { ... },
// Build flattened record
build-avro-record: { ... },
// Accumulate output record
accumulate-output: {
value: "${output_record}"
}
}
© Rocana, Inc. All Rights Reserved. | 33
Extract hostname prefix
{
load-event: {},
filter: { expression: "${event_type_id == 100}" },
regex: {
pattern: "^(.{2}).*$",
value: "${attr.syslog_hostname}",
destination: "host_prefix"
},
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
...
}
© Rocana, Inc. All Rights Reserved. | 34
Extract numeric prefix
...
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
regex: {
pattern: "^([0-9]*)",
value: "${attributes['syslog_message']}",
destination: "msg",
match-actions: {
set-values: { extracted_field: "${msg.match.group.1}" }
},
no-match-actions: {
set-values: { extracted_field: "" }
}
},
...
© Rocana, Inc. All Rights Reserved. | 35
Build flattened record
...
build-avro-record: {
schema-uri: "resource:avro-schemas/flattened-syslog.avsc",
destination: "output_record",
field-mapping: {
ts: "${ts}",
event_type_id: "${event_type_id}",
source: "${source}",
syslog_facility: "${convert:toInt(attributes['syslog_facility'])}",
syslog_severity: "${convert:toInt(attributes['syslog_severity'])}",
...
syslog_message: "${attributes['syslog_message']}",
syslog_pid: "${convert:toInt(attributes['syslog_pid)}",
extracted_field: "${extracted_field}"
},
},
...
© Rocana, Inc. All Rights Reserved. | 36
Extract metrics from log data
© Rocana, Inc. All Rights Reserved. | 37
Extract metrics
• Input: HTTP status logs
• Extract request latency
• Extract counts by HTTP status code
• Metric types
• Guage: A value that varies over time (think latency, CPU %, etc.)
• Counter: A value that accumulates over time (think event volume, status codes,
etc.)
© Rocana, Inc. All Rights Reserved. | 38
Example metric event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 107,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "web01.rocana.com",
"service": "httpd",
"attributes": {
"m.http.request.latency": "4.2000000000E1|g",
"m.http.status.401.count": "1.0000000000E0|c",
}
}
© Rocana, Inc. All Rights Reserved. | 39
Extract metrics
{
load-event: {},
build-metric: {
gauge-mapping: {
http.request.latency: "${convert:toDouble(attributes['latency'])}"
},
destination: "latency_metric"
},
accumulate-output: { value: "${latency_metric}" },
build-metric: {
dynamic-counter-mapping: [
"${string:format('http.status.%s.count', attributes['sc_status'])}", 1D
],
destination: "status_metric"
},
accumulate-output: { value: "${status_metric}" }
}
© Rocana, Inc. All Rights Reserved. | 40
Architecture
© Rocana, Inc. All Rights Reserved. | 41
Java action objects
Architecture
Configuration file Java action objects Context
Variables
Driver
1. Parse config
2. Initialize
context
5. Copy output
3. Execute actions
4. Read/write
variables
© Rocana, Inc. All Rights Reserved. | 42
Custom actions
• Actions loaded at runtime using Java services framework
• Add your jar to the classpath
• Custom actions appear as top-level keywords just like regular actions
• Implement the execute() method of the Action interface
• Implement the build() method of the ActionBuilder interface
© Rocana, Inc. All Rights Reserved. | 43
Custom actions
• Parse custom log formats
• Cisco ACS
• Citrix
• Juniper
• Customer-specific formats
• Lookup IP addresses in the MaxMind GeoIP2 database
• Reference dataset lookups
• Device id to device name
© Rocana, Inc. All Rights Reserved. | 44
Putting it all together
• Stream processing is causing us to re-think how we analyze data
• Limiting accessibility of data transformation side increases costs and
decreases velocity
• Reduce your reliance on developers to code custom pipelines
• Re-use transformation configuration in any stream processing framework
or batch job
© Rocana, Inc. All Rights Reserved. | 45
Coming soon
• Rocana transform will be released under the ASL 2.0
• The configuration library is available today:
• https://github.com/scalingdata/rocana-configuration
© Rocana, Inc. All Rights Reserved. | 46
Questions?

More Related Content

What's hot

Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Streaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsStreaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsLightbend
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ confluent
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenHostedbyConfluent
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend
 

What's hot (20)

How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Streaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka StreamsStreaming Microservices With Akka Streams And Kafka Streams
Streaming Microservices With Akka Streams And Kafka Streams
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
 

Viewers also liked

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @WindwardDemi Ben-Ari
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration TestCampRO
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing EnvironmentŁukasz Morawski
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Lucidworks
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using SparkSalesforce Engineering
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - SearchBillCavaUs
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Kevin Risden
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and RecommendationsLucidworks
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolutionivan provalov
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Viewers also liked (20)

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Test Automation and Continuous Integration
Test Automation and Continuous Integration Test Automation and Continuous Integration
Test Automation and Continuous Integration
 
Distributed Testing Environment
Distributed Testing EnvironmentDistributed Testing Environment
Distributed Testing Environment
 
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Production Readiness Testing Using Spark
Production Readiness Testing Using SparkProduction Readiness Testing Using Spark
Production Readiness Testing Using Spark
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
 
Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016Solr JDBC - Lucene/Solr Revolution 2016
Solr JDBC - Lucene/Solr Revolution 2016
 
Events, Signals, and Recommendations
Events, Signals, and RecommendationsEvents, Signals, and Recommendations
Events, Signals, and Recommendations
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Similar to Streaming ETL for All

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017iguazio
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)Eran Duchan
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...confluent
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3uzzal basak
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETconfluent
 

Similar to Streaming ETL for All (20)

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark etl
Spark etlSpark etl
Spark etl
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 

More from Joey Echeverria

The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (10)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoUXDXConf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 

Recently uploaded (20)

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 

Streaming ETL for All

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead San Francisco Hadoop Users Group, June 14th 2016 San Francisco, CA Streaming ETL for All
  • 2. © Rocana, Inc. All Rights Reserved. | 2 Slides http://bit.ly/streaming-etl-slides
  • 3. © Rocana, Inc. All Rights Reserved. | 3 Context
  • 4. © Rocana, Inc. All Rights Reserved. | 4 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  • 5. © Rocana, Inc. All Rights Reserved. | 5
  • 6. © Rocana, Inc. All Rights Reserved. | 6 History
  • 7. © Rocana, Inc. All Rights Reserved. | 7 Spark Impala “Legacy” data architecture HDFS Avro/Parquet FilesFlume/Sqoop Data Producers MapReduc e Visualization/Query
  • 8. © Rocana, Inc. All Rights Reserved. | 8 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 9. © Rocana, Inc. All Rights Reserved. | 9 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 10. © Rocana, Inc. All Rights Reserved. | 10 Stream processing A primer
  • 11. © Rocana, Inc. All Rights Reserved. | 11 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 12. © Rocana, Inc. All Rights Reserved. | 12 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 13. © Rocana, Inc. All Rights Reserved. | 13 Stream processing • Filter • Extract • Project • Aggregate • Join • Model • Data transformation
  • 14. © Rocana, Inc. All Rights Reserved. | 14 Apache Storm • "Distributed real-time computation system" • Applications packaged into topologies (think MapReduce job) • Topologies operate over streams of tuples • Spout: source of a stream • Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
  • 15. © Rocana, Inc. All Rights Reserved. | 15 Apache Spark • Supports batch and stream processing • Continuous stream of records discretized into a DStream • DStream: a sequence of RDDs (batches of records) • Micro-batch
  • 16. © Rocana, Inc. All Rights Reserved. | 16 Apache Flink • Supports batch and stream processing • DataStream: unbounded collection of records • Operations can apply to individual records or windows of records • Supports record-at-a-time processing (like Storm)
  • 17. © Rocana, Inc. All Rights Reserved. | 17 Apache Kafka • Pub-sub messaging system implemented as a distributed commit log • Popular as a source and sink for data streams • Scalability, durability, and easy-to-understand delivery guarantees • Can do stream processing directly in Kafka consumers • Kafka Streams
  • 18. © Rocana, Inc. All Rights Reserved. | 18 Data transformation
  • 19. © Rocana, Inc. All Rights Reserved. | 19 Filter filter
  • 20. © Rocana, Inc. All Rights Reserved. | 20 Extract 127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326 ts: 1436576671000 body: <binary blob> event_type_id: 100 ... extract ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" }
  • 21. © Rocana, Inc. All Rights Reserved. | 21 Project ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" } ts: 1459444413000 ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" request: "GET /index.html HTTP/1.0" status_code: 200 size: 2326 project
  • 22. © Rocana, Inc. All Rights Reserved. | 22 Problem
  • 23. © Rocana, Inc. All Rights Reserved. | 23 Who • Developers • Data engineers • Sysadmins • Analysts
  • 24. © Rocana, Inc. All Rights Reserved. | 24 Tools
  • 25. © Rocana, Inc. All Rights Reserved. | 25 The dark art of data science • Feature engineering • “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills • Video from Midwest.io 2014
  • 26. © Rocana, Inc. All Rights Reserved. | 26 Data transformation for all
  • 27. © Rocana, Inc. All Rights Reserved. | 27 Rocana Transform • Library • Java • Rocana configuration • JSON + comments + specific numeric types - excess quoting
  • 28. © Rocana, Inc. All Rights Reserved. | 28 Data model • Event schema • id: A globally unique identifier for this event • ts: Epoch timestamp in milliseconds • event_type_id: ID indicating the type of the event • location: Location from which the event was generated • host: Hostname, IP, or other device identifier from which the event was generated • service: Service or process from which the event was generated • body: Raw event content in bytes • attributes: Event type-specific key/value pairs
  • 29. © Rocana, Inc. All Rights Reserved. | 29 Example event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" } }
  • 30. © Rocana, Inc. All Rights Reserved. | 30 Filter, extract, and flatten
  • 31. © Rocana, Inc. All Rights Reserved. | 31 Filter, extract, and flatten • Filter out events without type id 100 • Filter out events without hostname prefix "ex" • Extract a numeric prefix from the syslog message • Flatten syslog attributes to top-level fields in a different avro schema
  • 32. © Rocana, Inc. All Rights Reserved. | 32 Filter, extract, and flatten { load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" } }
  • 33. © Rocana, Inc. All Rights Reserved. | 33 Extract hostname prefix { load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ... }
  • 34. © Rocana, Inc. All Rights Reserved. | 34 Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
  • 35. © Rocana, Inc. All Rights Reserved. | 35 Build flattened record ... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, }, ...
  • 36. © Rocana, Inc. All Rights Reserved. | 36 Extract metrics from log data
  • 37. © Rocana, Inc. All Rights Reserved. | 37 Extract metrics • Input: HTTP status logs • Extract request latency • Extract counts by HTTP status code • Metric types • Guage: A value that varies over time (think latency, CPU %, etc.) • Counter: A value that accumulates over time (think event volume, status codes, etc.)
  • 38. © Rocana, Inc. All Rights Reserved. | 38 Example metric event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", } }
  • 39. © Rocana, Inc. All Rights Reserved. | 39 Extract metrics { load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" } }
  • 40. © Rocana, Inc. All Rights Reserved. | 40 Architecture
  • 41. © Rocana, Inc. All Rights Reserved. | 41 Java action objects Architecture Configuration file Java action objects Context Variables Driver 1. Parse config 2. Initialize context 5. Copy output 3. Execute actions 4. Read/write variables
  • 42. © Rocana, Inc. All Rights Reserved. | 42 Custom actions • Actions loaded at runtime using Java services framework • Add your jar to the classpath • Custom actions appear as top-level keywords just like regular actions • Implement the execute() method of the Action interface • Implement the build() method of the ActionBuilder interface
  • 43. © Rocana, Inc. All Rights Reserved. | 43 Custom actions • Parse custom log formats • Cisco ACS • Citrix • Juniper • Customer-specific formats • Lookup IP addresses in the MaxMind GeoIP2 database • Reference dataset lookups • Device id to device name
  • 44. © Rocana, Inc. All Rights Reserved. | 44 Putting it all together • Stream processing is causing us to re-think how we analyze data • Limiting accessibility of data transformation side increases costs and decreases velocity • Reduce your reliance on developers to code custom pipelines • Re-use transformation configuration in any stream processing framework or batch job
  • 45. © Rocana, Inc. All Rights Reserved. | 45 Coming soon • Rocana transform will be released under the ASL 2.0 • The configuration library is available today: • https://github.com/scalingdata/rocana-configuration
  • 46. © Rocana, Inc. All Rights Reserved. | 46 Questions?