Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming ETL for All

928 views

Published on

Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for streaming ETL that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.

Published in: Technology
  • Be the first to comment

Streaming ETL for All

  1. 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead San Francisco Hadoop Users Group, June 14th 2016 San Francisco, CA Streaming ETL for All
  2. 2. © Rocana, Inc. All Rights Reserved. | 2 Slides http://bit.ly/streaming-etl-slides
  3. 3. © Rocana, Inc. All Rights Reserved. | 3 Context
  4. 4. © Rocana, Inc. All Rights Reserved. | 4 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  5. 5. © Rocana, Inc. All Rights Reserved. | 5
  6. 6. © Rocana, Inc. All Rights Reserved. | 6 History
  7. 7. © Rocana, Inc. All Rights Reserved. | 7 Spark Impala “Legacy” data architecture HDFS Avro/Parquet FilesFlume/Sqoop Data Producers MapReduc e Visualization/Query
  8. 8. © Rocana, Inc. All Rights Reserved. | 8 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  9. 9. © Rocana, Inc. All Rights Reserved. | 9 Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  10. 10. © Rocana, Inc. All Rights Reserved. | 10 Stream processing A primer
  11. 11. © Rocana, Inc. All Rights Reserved. | 11 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  12. 12. © Rocana, Inc. All Rights Reserved. | 12 Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  13. 13. © Rocana, Inc. All Rights Reserved. | 13 Stream processing • Filter • Extract • Project • Aggregate • Join • Model • Data transformation
  14. 14. © Rocana, Inc. All Rights Reserved. | 14 Apache Storm • "Distributed real-time computation system" • Applications packaged into topologies (think MapReduce job) • Topologies operate over streams of tuples • Spout: source of a stream • Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
  15. 15. © Rocana, Inc. All Rights Reserved. | 15 Apache Spark • Supports batch and stream processing • Continuous stream of records discretized into a DStream • DStream: a sequence of RDDs (batches of records) • Micro-batch
  16. 16. © Rocana, Inc. All Rights Reserved. | 16 Apache Flink • Supports batch and stream processing • DataStream: unbounded collection of records • Operations can apply to individual records or windows of records • Supports record-at-a-time processing (like Storm)
  17. 17. © Rocana, Inc. All Rights Reserved. | 17 Apache Kafka • Pub-sub messaging system implemented as a distributed commit log • Popular as a source and sink for data streams • Scalability, durability, and easy-to-understand delivery guarantees • Can do stream processing directly in Kafka consumers • Kafka Streams
  18. 18. © Rocana, Inc. All Rights Reserved. | 18 Data transformation
  19. 19. © Rocana, Inc. All Rights Reserved. | 19 Filter filter
  20. 20. © Rocana, Inc. All Rights Reserved. | 20 Extract 127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326 ts: 1436576671000 body: <binary blob> event_type_id: 100 ... extract ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" }
  21. 21. © Rocana, Inc. All Rights Reserved. | 21 Project ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" } ts: 1459444413000 ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" request: "GET /index.html HTTP/1.0" status_code: 200 size: 2326 project
  22. 22. © Rocana, Inc. All Rights Reserved. | 22 Problem
  23. 23. © Rocana, Inc. All Rights Reserved. | 23 Who • Developers • Data engineers • Sysadmins • Analysts
  24. 24. © Rocana, Inc. All Rights Reserved. | 24 Tools
  25. 25. © Rocana, Inc. All Rights Reserved. | 25 The dark art of data science • Feature engineering • “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills • Video from Midwest.io 2014
  26. 26. © Rocana, Inc. All Rights Reserved. | 26 Data transformation for all
  27. 27. © Rocana, Inc. All Rights Reserved. | 27 Rocana Transform • Library • Java • Rocana configuration • JSON + comments + specific numeric types - excess quoting
  28. 28. © Rocana, Inc. All Rights Reserved. | 28 Data model • Event schema • id: A globally unique identifier for this event • ts: Epoch timestamp in milliseconds • event_type_id: ID indicating the type of the event • location: Location from which the event was generated • host: Hostname, IP, or other device identifier from which the event was generated • service: Service or process from which the event was generated • body: Raw event content in bytes • attributes: Event type-specific key/value pairs
  29. 29. © Rocana, Inc. All Rights Reserved. | 29 Example event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" } }
  30. 30. © Rocana, Inc. All Rights Reserved. | 30 Filter, extract, and flatten
  31. 31. © Rocana, Inc. All Rights Reserved. | 31 Filter, extract, and flatten • Filter out events without type id 100 • Filter out events without hostname prefix "ex" • Extract a numeric prefix from the syslog message • Flatten syslog attributes to top-level fields in a different avro schema
  32. 32. © Rocana, Inc. All Rights Reserved. | 32 Filter, extract, and flatten { load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" } }
  33. 33. © Rocana, Inc. All Rights Reserved. | 33 Extract hostname prefix { load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ... }
  34. 34. © Rocana, Inc. All Rights Reserved. | 34 Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
  35. 35. © Rocana, Inc. All Rights Reserved. | 35 Build flattened record ... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, }, ...
  36. 36. © Rocana, Inc. All Rights Reserved. | 36 Extract metrics from log data
  37. 37. © Rocana, Inc. All Rights Reserved. | 37 Extract metrics • Input: HTTP status logs • Extract request latency • Extract counts by HTTP status code • Metric types • Guage: A value that varies over time (think latency, CPU %, etc.) • Counter: A value that accumulates over time (think event volume, status codes, etc.)
  38. 38. © Rocana, Inc. All Rights Reserved. | 38 Example metric event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", } }
  39. 39. © Rocana, Inc. All Rights Reserved. | 39 Extract metrics { load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" } }
  40. 40. © Rocana, Inc. All Rights Reserved. | 40 Architecture
  41. 41. © Rocana, Inc. All Rights Reserved. | 41 Java action objects Architecture Configuration file Java action objects Context Variables Driver 1. Parse config 2. Initialize context 5. Copy output 3. Execute actions 4. Read/write variables
  42. 42. © Rocana, Inc. All Rights Reserved. | 42 Custom actions • Actions loaded at runtime using Java services framework • Add your jar to the classpath • Custom actions appear as top-level keywords just like regular actions • Implement the execute() method of the Action interface • Implement the build() method of the ActionBuilder interface
  43. 43. © Rocana, Inc. All Rights Reserved. | 43 Custom actions • Parse custom log formats • Cisco ACS • Citrix • Juniper • Customer-specific formats • Lookup IP addresses in the MaxMind GeoIP2 database • Reference dataset lookups • Device id to device name
  44. 44. © Rocana, Inc. All Rights Reserved. | 44 Putting it all together • Stream processing is causing us to re-think how we analyze data • Limiting accessibility of data transformation side increases costs and decreases velocity • Reduce your reliance on developers to code custom pipelines • Re-use transformation configuration in any stream processing framework or batch job
  45. 45. © Rocana, Inc. All Rights Reserved. | 45 Coming soon • Rocana transform will be released under the ASL 2.0 • The configuration library is available today: • https://github.com/scalingdata/rocana-configuration
  46. 46. © Rocana, Inc. All Rights Reserved. | 46 Questions?

×