Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Insight DE project

343 views

Published on

Project for Insight Data Engineering Fellowship project

Published in: Technology
  • Be the first to comment

Insight DE project

  1. 1. YeezyScore A comparison of stream processing software By: Kat Chuang @katychuang
  2. 2. 10 mins
  3. 3. High level overview Kat Chuang @katychuang Batch Streaming Microbatching Storm Trident Spark Streaming Released 2011 2010 Delivery Semantics Exactly Once Exactly once State Management Yes Yes Latency Seconds Seconds Output MapState Resilient Distributed Dataset (RDD) Throughput 10k/nodes/sec? 400k/nodes/sec?
  4. 4. Test Cases Metrics 1. Does every message pass through the pipeline? 2. How fast does each message take to process? Data 1. Timestamps Kat Chuang @katychuang
  5. 5. Timestamp1 (Timestamp1, Timestamp2) (Timestamp1, Timestamp2) Timestamp1 Pipelines Kat Chuang @katychuang
  6. 6. 1. Does every message pass through the pipeline? Kat Chuang @katychuang This is a scatterplot
  7. 7. 2. How fast does each message take to process? Kat Chuang @katychuang This is a scatterplot
  8. 8. Storm Trident Vs Spark Streaming Storm Trident Spark Streaming Stream processing framework that also does micro-batching. Great for transforming or computing as data flows in. Complex event processing (CEP), continuous computation. Task-Parallel Computations, i. e. reading Twitter streams Batch processing framework that also does micro-batching. Great for combining with historical data. ML algos included. Requires HDFS-backed data source. Data-Parallel Computations, i. e. offering recommendations
  9. 9. Kat Chuang Data Engineering Fellow #DE-2015c hello@katychuang.com Github: katychuang Twitter: katychuang IG: katychuang.nyc

×