Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Embeddable data transformation for real time streams

636 views

Published on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.

Published in: Technology
  • Be the first to comment

Embeddable data transformation for real time streams

  1. 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead Strata+Hadoop World, March 31st 2016 San Jose, CA Embeddable data transformation for real-time streams
  2. 2. © Rocana, Inc. All Rights Reserved. | 2 http://j.mp/hw-questions Slides http://j.mp/rocana-transform-slides
  3. 3. © Rocana, Inc. All Rights Reserved. | 3 http://j.mp/hw-questions Questions http://j.mp/hw-questions
  4. 4. © Rocana, Inc. All Rights Reserved. | 4 http://j.mp/hw-questions Context
  5. 5. © Rocana, Inc. All Rights Reserved. | 5 http://j.mp/hw-questions Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  6. 6. © Rocana, Inc. All Rights Reserved. | 6 Signing today at 1pm at the Cloudera booth
  7. 7. © Rocana, Inc. All Rights Reserved. | 7 http://j.mp/hw-questions History
  8. 8. © Rocana, Inc. All Rights Reserved. | 8 http://j.mp/hw-questions Spark Impala “Legacy” data architecture HDFS Avro/Parquet FilesFlume/Sqoop Data Producers MapReduc e Visualization/Query
  9. 9. © Rocana, Inc. All Rights Reserved. | 9 http://j.mp/hw-questions Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  10. 10. © Rocana, Inc. All Rights Reserved. | 10 http://j.mp/hw-questions Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  11. 11. © Rocana, Inc. All Rights Reserved. | 11 http://j.mp/hw-questions Stream processing A primer
  12. 12. © Rocana, Inc. All Rights Reserved. | 12 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  13. 13. © Rocana, Inc. All Rights Reserved. | 13 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  14. 14. © Rocana, Inc. All Rights Reserved. | 14 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model • Data transformation
  15. 15. © Rocana, Inc. All Rights Reserved. | 15 http://j.mp/hw-questions Apache Storm • "Distributed real-time computation system" • Applications packaged into topologies (think MapReduce job) • Topologies operate over streams of tuples • Spout: source of a stream • Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
  16. 16. © Rocana, Inc. All Rights Reserved. | 16 http://j.mp/hw-questions Apache Spark • Supports batch and stream processing • Continuous stream of records discretized into a DStream • DStream: a sequence of RDDs (batches of records) • Micro-batch
  17. 17. © Rocana, Inc. All Rights Reserved. | 17 http://j.mp/hw-questions Apache Flink • Supports batch and stream processing • DataStream: unbounded collection of records • Operations can apply to individual records or windows of records • Supports record-at-a-time processing (like Storm)
  18. 18. © Rocana, Inc. All Rights Reserved. | 18 http://j.mp/hw-questions Apache Kafka • Pub-sub messaging system implemented as a distributed commit log • Popular as a source and sink for data streams • Scalability, durability, and easy-to-understand delivery guarantees • Can do stream processing directly in Kafka consumers
  19. 19. © Rocana, Inc. All Rights Reserved. | 19 http://j.mp/hw-questions Data transformation
  20. 20. © Rocana, Inc. All Rights Reserved. | 20 http://j.mp/hw-questions Filter filter
  21. 21. © Rocana, Inc. All Rights Reserved. | 21 http://j.mp/hw-questions Extract 127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326 ts: 1436576671000 body: <binary blob> event_type_id: 100 ... extract ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" }
  22. 22. © Rocana, Inc. All Rights Reserved. | 22 http://j.mp/hw-questions Project ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" } ts: 1459444413000 ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" request: "GET /index.html HTTP/1.0" status_code: 200 size: 2326 project
  23. 23. © Rocana, Inc. All Rights Reserved. | 23 http://j.mp/hw-questions Problem
  24. 24. © Rocana, Inc. All Rights Reserved. | 24 http://j.mp/hw-questions Who • Developers • Data engineers • Sysadmins • Analysts
  25. 25. © Rocana, Inc. All Rights Reserved. | 25 http://j.mp/hw-questions Tools
  26. 26. © Rocana, Inc. All Rights Reserved. | 26 http://j.mp/hw-questions The dark art of data science • Feature engineering • “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills • Video from Midwest.io 2014
  27. 27. © Rocana, Inc. All Rights Reserved. | 27 http://j.mp/hw-questions Data transformation for all
  28. 28. © Rocana, Inc. All Rights Reserved. | 28 http://j.mp/hw-questions Rocana Transform • Library • Java • Rocana configuration • JSON + comments + specific numeric types - excess quoting
  29. 29. © Rocana, Inc. All Rights Reserved. | 29 http://j.mp/hw-questions Data model • Event schema • id: A globally unique identifier for this event • ts: Epoch timestamp in milliseconds • event_type_id: ID indicating the type of the event • location: Location from which the event was generated • host: Hostname, IP, or other device identifier from which the event was generated • service: Service or process from which the event was generated • body: Raw event content in bytes • attributes: Event type-specific key/value pairs
  30. 30. © Rocana, Inc. All Rights Reserved. | 30 http://j.mp/hw-questions Example event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" } }
  31. 31. © Rocana, Inc. All Rights Reserved. | 31 http://j.mp/hw-questions Filter, extract, and flatten
  32. 32. © Rocana, Inc. All Rights Reserved. | 32 http://j.mp/hw-questions Filter, extract, and flatten • Filter out events without type id 100 • Filter out events without hostname prefix "ex" • Extract a numeric prefix from the syslog message • Flatten syslog attributes to top-level fields in a different avro schema
  33. 33. © Rocana, Inc. All Rights Reserved. | 33 http://j.mp/hw-questions Filter, extract, and flatten { load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" } }
  34. 34. © Rocana, Inc. All Rights Reserved. | 34 http://j.mp/hw-questions Extract hostname prefix { load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ... }
  35. 35. © Rocana, Inc. All Rights Reserved. | 35 http://j.mp/hw-questions Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
  36. 36. © Rocana, Inc. All Rights Reserved. | 36 http://j.mp/hw-questions Build flattened record ... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, }, ...
  37. 37. © Rocana, Inc. All Rights Reserved. | 37 http://j.mp/hw-questions Extract metrics from log data
  38. 38. © Rocana, Inc. All Rights Reserved. | 38 http://j.mp/hw-questions Extract metrics • Input: HTTP status logs • Extract request latency • Extract counts by HTTP status code • Metric types • Guage: A value that varies over time (think latency, CPU %, etc.) • Counter: A value that accumulates over time (think event volume, status codes, etc.)
  39. 39. © Rocana, Inc. All Rights Reserved. | 39 http://j.mp/hw-questions Example metric event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", } }
  40. 40. © Rocana, Inc. All Rights Reserved. | 40 http://j.mp/hw-questions Extract metrics { load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" } }
  41. 41. © Rocana, Inc. All Rights Reserved. | 41 http://j.mp/hw-questions Architecture
  42. 42. © Rocana, Inc. All Rights Reserved. | 42 http://j.mp/hw-questions Java action objects Architecture Configuration file Java action objects Context Variables Driver 1. Parse config 2. Initialize context 5. Copy output 3. Execute actions 4. Read/write variables
  43. 43. © Rocana, Inc. All Rights Reserved. | 43 http://j.mp/hw-questions Custom actions • Actions loaded at runtime using Java services framework • Add your jar to the classpath • Custom actions appear as top-level keywords just like regular actions • Implement the execute() method of the Action interface • Implement the build() method of the ActionBuilder interface
  44. 44. © Rocana, Inc. All Rights Reserved. | 44 http://j.mp/hw-questions Custom actions • Parse custom log formats • Cisco ACS • Citrix • Juniper • Customer-specific formats • Lookup IP addresses in the MaxMind GeoIP2 database • Reference dataset lookups • Device id to device name
  45. 45. © Rocana, Inc. All Rights Reserved. | 45 http://j.mp/hw-questions Putting it all together • Stream processing is causing us to re-think how we analyze data • Limiting accessibility of data transformation side increases costs and decreases velocity • Reduce your reliance on developers to code custom pipelines • Re-use transformation configuration in any stream processing framework or batch job
  46. 46. © Rocana, Inc. All Rights Reserved. | 46 http://j.mp/hw-questions Coming soon • Rocana transform will be released under the ASL 2.0 • The base configuration library is available today: • https://github.com/scalingdata/rocana-configuration
  47. 47. © Rocana, Inc. All Rights Reserved. | 47 http://j.mp/hw-questions Questions? • Signing "Hadoop Security" today at 1pm at the Cloudera booth

×