Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Rocana, Inc. All Rights Reserved. | 1
JOEY ECHEVERRIA | @fwiffo | November 4th, 2015
San Francisco Hadoop Users Group
Bu...
© Rocana, Inc. All Rights Reserved. | 2
Context
© Rocana, Inc. All Rights Reserved. | 3
Joey
• Where I work: Rocana – Director of Engineering
• Where I used to work: Clou...
© Rocana, Inc. All Rights Reserved. | 4
Free stuff!
• Tweet @rocanainc with
#SFHUG – best three
tweets get a book
© Rocana, Inc. All Rights Reserved. | 5
What we do
• Build a system for the operation of modern data centers
• Triage and ...
© Rocana, Inc. All Rights Reserved. | 6
Our typical customer use cases
• >100K events / sec (8.6B events / day), sub-secon...
© Rocana, Inc. All Rights Reserved. | 7
10,000 foot view
© Rocana, Inc. All Rights Reserved. | 8
High level architecture
© Rocana, Inc. All Rights Reserved. | 9
Guarantees
• No single point of failure exists
• All components scale horizontally...
© Rocana, Inc. All Rights Reserved. | 10
Events
© Rocana, Inc. All Rights Reserved. | 11
Modeling our world
• Everything is an event
• Each event contains a timestamp, ty...
© Rocana, Inc. All Rights Reserved. | 12
Event schema
{
id: string,
ts: long,
event_type_id: int,
location: string,
host: ...
© Rocana, Inc. All Rights Reserved. | 13
Event types
• Some event types are standard
• syslog, http, log4j, generic text r...
© Rocana, Inc. All Rights Reserved. | 14
Ex: generic syslog event
event_type_id: 100, // rfc3164, rfc5424 (syslog)
body: …...
© Rocana, Inc. All Rights Reserved. | 15
Ex: generic http event
event_type_id: 102, // generic http event
body: … // raw h...
© Rocana, Inc. All Rights Reserved. | 16
Consumers
© Rocana, Inc. All Rights Reserved. | 17
Consumers
• …do most of the work
• Parallelism
• Kafka offset management
• Messag...
© Rocana, Inc. All Rights Reserved. | 18
Inside a consumer
© Rocana, Inc. All Rights Reserved. | 19
Metrics and time series
© Rocana, Inc. All Rights Reserved. | 20
Aggregation
• Mostly for time series metrics
• Two halves: on write and on query
...
© Rocana, Inc. All Rights Reserved. | 21
Aside: late arriving data (it’s a thing)
• Never trust a (wall) clock
• Producer ...
© Rocana, Inc. All Rights Reserved. | 22
Ex: service event volume by host and minute
• Dimensions: ts, window, location, h...
© Rocana, Inc. All Rights Reserved. | 23
Extension, pain, and advice
© Rocana, Inc. All Rights Reserved. | 24
Extending the system
• Custom producers
• Custom consumers
• Event types
• Parser...
© Rocana, Inc. All Rights Reserved. | 25
Pain (aka: the struggle is real)
• Lots of tradeoffs when picking a stream proces...
© Rocana, Inc. All Rights Reserved. | 26
If you’re going to try this…
• Read all the literature on stream processing[1]
• ...
© Rocana, Inc. All Rights Reserved. | 27
Things I didn’t talk about
• Reprocessing data when bad code / transformations ar...
© Rocana, Inc. All Rights Reserved. | 28
Questions?
@fwiffo | batman@rocana.com
Upcoming SlideShare
Loading in …5
×

Building a system for machine and event-oriented data - SF HUG Nov 2015

190 views

Published on

Presentation by Joey Echeverria of Rocana

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building a system for machine and event-oriented data - SF HUG Nov 2015

  1. 1. © Rocana, Inc. All Rights Reserved. | 1 JOEY ECHEVERRIA | @fwiffo | November 4th, 2015 San Francisco Hadoop Users Group Building a System for Machine and Event-Oriented Data
  2. 2. © Rocana, Inc. All Rights Reserved. | 2 Context
  3. 3. © Rocana, Inc. All Rights Reserved. | 3 Joey • Where I work: Rocana – Director of Engineering • Where I used to work: Cloudera (‘11 – ’15), NSA • Distributed systems, security, data processing, “big data”
  4. 4. © Rocana, Inc. All Rights Reserved. | 4 Free stuff! • Tweet @rocanainc with #SFHUG – best three tweets get a book
  5. 5. © Rocana, Inc. All Rights Reserved. | 5 What we do • Build a system for the operation of modern data centers • Triage and diagnostics, exploration, trends, advanced analytics of complex systems • Our data: • logs, metrics, human activity, anything that occurs in the data center • “Enterprise Software” (i.e. we build for others.) • Today: how we built what we built
  6. 6. © Rocana, Inc. All Rights Reserved. | 6 Our typical customer use cases • >100K events / sec (8.6B events / day), sub-second end to end latency, full fidelity retention, critical use cases • Quality of service - “are credit card transactions happening fast enough?” • Fraud detection - “detect, investigate, prosecute, and learn from fraud.” • Forensic diagnostics - “what really caused the outage last friday?” • Security - “who’s doing what, where, when, why, and how, and is that ok?” • User behavior - ”capture and correlate user behavior with system performance, then feed it to downstream systems in realtime.”
  7. 7. © Rocana, Inc. All Rights Reserved. | 7 10,000 foot view
  8. 8. © Rocana, Inc. All Rights Reserved. | 8 High level architecture
  9. 9. © Rocana, Inc. All Rights Reserved. | 9 Guarantees • No single point of failure exists • All components scale horizontally[1] • Data retention and latency is a function of cost, not tech[1] • Every event is delivered provided no more than N - 1 failures occur (where N is the kafka replication level) • All operations, including upgrade, are online[2] • Every event is (or appears to be) delivered exactly once[3] [1] we’re positive there’s a limit, but thus far it has been cost. [2] from the user’s perspective, at a system level. [3] when queried via our UI. lots of details here.
  10. 10. © Rocana, Inc. All Rights Reserved. | 10 Events
  11. 11. © Rocana, Inc. All Rights Reserved. | 11 Modeling our world • Everything is an event • Each event contains a timestamp, type, location, host, service, body, and type-specific attributes (k/v pairs) • Build specialized aggregates as necessary - just optimized views of the data
  12. 12. © Rocana, Inc. All Rights Reserved. | 12 Event schema { id: string, ts: long, event_type_id: int, location: string, host: string, service: string, body: [ null, bytes ], attributes: map<string> }
  13. 13. © Rocana, Inc. All Rights Reserved. | 13 Event types • Some event types are standard • syslog, http, log4j, generic text record, … • Users define custom event types • Producers populate event type • Transformations can turn one event type into another • Event type metadata tells downstream systems how to interpret body and attributes
  14. 14. © Rocana, Inc. All Rights Reserved. | 14 Ex: generic syslog event event_type_id: 100, // rfc3164, rfc5424 (syslog) body: … // raw syslog message bytes attributes: { // extracted fields from body syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”, syslog_severity: “6”, // info severity syslog_facility: “3”, // daemon facility syslog_process: “dhclient”, syslog_pid: “668”, … }
  15. 15. © Rocana, Inc. All Rights Reserved. | 15 Ex: generic http event event_type_id: 102, // generic http event body: … // raw http log message bytes attributes: { http_req_method: “GET”, http_req_vhost: “w2a-demo-02”, http_req_path: “/api/v1/search?q=service%3Asshd&p=1&s=200”, http_req_query: “q=service%3Asshd&p=1&s=200”, http_resp_code: “200”, … }
  16. 16. © Rocana, Inc. All Rights Reserved. | 16 Consumers
  17. 17. © Rocana, Inc. All Rights Reserved. | 17 Consumers • …do most of the work • Parallelism • Kafka offset management • Message de-duplication • Transformation (embedded library) • Dead letter queue support • Downstream system knowledge
  18. 18. © Rocana, Inc. All Rights Reserved. | 18 Inside a consumer
  19. 19. © Rocana, Inc. All Rights Reserved. | 19 Metrics and time series
  20. 20. © Rocana, Inc. All Rights Reserved. | 20 Aggregation • Mostly for time series metrics • Two halves: on write and on query • Data model: (dimensions) => (aggregates) • On write • reduce(a: A, b: A) => A over window • Store “base” aggregates, all associative and commutative • On query • Perform same aggregate or derivative aggregates • Group by the same dimensions • SQL (Impala)
  21. 21. © Rocana, Inc. All Rights Reserved. | 21 Aside: late arriving data (it’s a thing) • Never trust a (wall) clock • Producer determines observation time, rest of the system uses this always • Data that shows up late always processed according to observation time • Aggregation consequences • The same time window can appear multiple times • Solution: aggregate every N seconds, potentially generating multiple aggregates for the same time bin • This is real and you must deal with it • Do what we did or • Build a system that mutates/replaces aggregates already output or • Delay aggregate output for some slop time; drop it if late data shows up
  22. 22. © Rocana, Inc. All Rights Reserved. | 22 Ex: service event volume by host and minute • Dimensions: ts, window, location, host, service, metric • On write, aggregates: count, sum, min, max, last • epoch, 60000, us-west-2a, w2a-demo-1, sshd, event_volume => 17, 42, 1, 10, 8 • On query: • SELECT floor(ts / 60000) as bin, loc, host, service, metric, sum(value_sum) FROM metrics WHERE ts BETWEEN x AND y AND metric = ”event_volume” GROUP BY bin, loc, host, service, metric • If late arriving data existed in events, the same dimensions would repeat with a another set of aggregates and would be rolled up as a result of the group by • tl;dr: normal window aggregation operations
  23. 23. © Rocana, Inc. All Rights Reserved. | 23 Extension, pain, and advice
  24. 24. © Rocana, Inc. All Rights Reserved. | 24 Extending the system • Custom producers • Custom consumers • Event types • Parser / transformation plugins • Custom metric definition and aggregate functions • Custom processing jobs on landed data
  25. 25. © Rocana, Inc. All Rights Reserved. | 25 Pain (aka: the struggle is real) • Lots of tradeoffs when picking a stream processing solution • Apache Samza: right features, but low level programming model, not supported by vendors. missing security features. • Apache Storm: too rigid, too slow. not supported by all Hadoop vendors. • Apache Spark streaming: tons of issues initially, but lots of community energy. improving. • @digitallogic: “My heart says Samza, but my head says Spark Streaming.” • Our (current) needs are meager; do work inside consumers. • Stack complexity, (relative im)maturity • Scaling solr cloud to billions of events per day
  26. 26. © Rocana, Inc. All Rights Reserved. | 26 If you’re going to try this… • Read all the literature on stream processing[1] • Treat it like the distributed systems problem it is • Understand, make, and make good on guarantees • Find the right abstractions • Never trust the hand waving or “hello worlds” • Fully evaluate the projects/products in this space • Understand it’s not just about search [1] wait, like all of it? yeah, like all of it.
  27. 27. © Rocana, Inc. All Rights Reserved. | 27 Things I didn’t talk about • Reprocessing data when bad code / transformations are detected • Dealing with data quality issues (“the struggle is real” part 2) • The user interface and all the fancy analytics • data visualization and exploration • event search • anomalous trend and event detection • metric, source, and event correlation • motif finding • noise reduction and dithering • Event delivery semantics (e.g. at least once, exactly once, etc.) • Alerting
  28. 28. © Rocana, Inc. All Rights Reserved. | 28 Questions? @fwiffo | batman@rocana.com

×