Building a Lambda Architecture with Elasticsearch at Yieldbot


Published on

2014-05-06 Presentation to Boston Elasticsearch Meetup on Yieldbot's use of Elasticsearch in a Lambda Architecture

Published in: Engineering, Technology, Business

Building a Lambda Architecture with Elasticsearch at Yieldbot

  1. 1. May 06, 2014 Building a Lambda Architecture with Elasticsearch at Yieldbot Richard Shea, CTO @shearic David White, Platform Architect @dtabwhite
  2. 2. Batch computation layer (canonical eg. Hadoop -> HBase) Real-time computation layer (canonical eg. Storm -> Cassandra) Serving layer (query HBase, query Cassandra, mix and return) Slide 2 Lambda Architecture Summary
  3. 3. Clickstreams of Events (pageviews, impressions, clicks, etc) Events contain attributes Aggregating Counts and Performance Breakdowns by Several Dimensions Slide 3 Our Use Case
  4. 4. Slide 4 Our Prior Approach Two different types of systems Two different access patterns Query ability limited Batch (Hbase) Realtime (Redis)
  5. 5. Slide 5 Kafka Persisted event queue Consumers keep track of offset Horizontally scalable, topics can be partitioned, etc.
  6. 6. Slide 6 Real-time Layer of Lambda with ES Daily Index of “raw” events – each event is a document Elasticsearch Kafka River to index Real-time processing is trivial, just indexing events Aggregation of Real-time info pushed to query-time
  7. 7. Slide 7 Batch Layer of Lambda with ES Monthly Index of Aggregated Data Documents Hourly Re-index events from archived, covers real-time issues Aggregate desires breakdowns into documents When done, note most recent hour completed
  8. 8. Slide 8 Serving Layer of Lambda with ES Query Aggregated Data Documents as much as possible Query Raw events from last aggregated available to present Combine Aggregated and Raw query results together and return We use Node.js, natural fit
  9. 9. Slide 9 Why Elasticsearch? - calculations query-time and flexible - real-time is simple Real-time - some pre-calculation - query-time ties it together Batch Serving - queries are flexible - batch and real-time query access patterns similar
  10. 10. Slide 10 More Elasticsearch Goodies Kibana - Mostly real-time events - Aggregated documents useful too Snapshotting for backups Real-time data daily indexes are optimized
  11. 11. Slide 11 Future ES Aggregations Split cluster with Tribe Nodes Aggregation via Spark
  12. 12. Slide 12 Good Lessons Use index aliases Build in operational plan to re-index doc_values for raw events and high cardinality query results
  13. 13. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.