Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real time data viz with Spark Streaming, Kafka and D3.js


Published on

Presented to the Cornell Data Hackathon at the NYC Tech Campus
March 6-8, 2015

Published in: Engineering
  • Dating for everyone is here: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice pipeline and explaination!
    Are you sure you want to  Yes  No
    Your message goes here

Real time data viz with Spark Streaming, Kafka and D3.js

  1. 1. Stream processing and visualization for transaction investigation Using Kafka, Spark, and D3.js Ben Laird Capital One Labs
  2. 2. C1 Labs Data Science About me Cornell Engineering ’07 BS, Operations Research Johns Hopkins ‘12 MS, Applied Math • Data Engineer • Northrop Grumman • IBM • Space Debris Tracking • NLP of intel documents • Counter-IED GIS analysis Cornell expectations Cornell reality
  3. 3. C1 Labs Data Science Now: Data Scientist at Capital One Labs
  4. 4. C1 Labs Data Science A technical challenge: Build a dynamic, rich visualization of large, streaming data Normally, we have two options Small data Easy visualization Big data No visualization
  5. 5. C1 Labs Data Science Data Science: More than just Hadoop • Understanding all the requirements of your problem and the architecture that meets those demands is an ever important for a data scientist • Data processing solution doesn’t matter if you have a 1hr load time in the browser. • Visualization doesn’t matter if there is no way to process/store data Stream Handling Stream Processing Intermediate Storage Web Server/Frame work Event Based Comm Browser Viz
  6. 6. C1 Labs Data Science Our system must be able to process and visualize a real time transaction stream • Requirement: System must handle 1B+ transactions • Loading 1B records on the client side isn’t feasible • Our data is not only big, it is live. • Assume a stream of 50 records/second
  7. 7. C1 Labs Data Science Proposed solution: Use existing big data tools to process stream before web stack Tool Purpose Apache Kafka Distributed Messaging for transaction stream Apache Spark Streaming Distributed processing of transaction stream. Aggregate to levels that can be handled by browser MongoDB Intermediate storage in Capped Collection for web server access Node.js Server side framework for web server and Mongo interaction Event based communication – Pass new data from stream into browser Crossfilter Client side data index DC.js/D3.js D3.js graphics and intergration with Crossfilter How/Why did I pick these for our architecture?
  8. 8. C1 Labs Data Science A foray into data visualization tools From the beautiful: Minard Map, 1869 Source:
  9. 9. C1 Labs Data Science to the ‘not beautiful’ Sources:,
  10. 10. C1 Labs Data Science With most solutions, you face a trade off between ease of use and flexibility • If you need a quick solution or don’t need full control or customization, there are fantastic options • Tableau • ElasticSearch Kibana
  11. 11. C1 Labs Data Science D3.js provides an extremely powerful way of joining data with completely custom graphics Limitless possibilities. Complete control over data and viz. Not trivial to use though!
  12. 12. C1 Labs Data Science Bind data directly to elements in the DOM. Create graphics from scratch
  13. 13. C1 Labs Data Science All about finding the right level of abstraction. Introduce DC.js • Don’t always want to construct bar charts from the ground up. • Build axes, ticks, set colors, scales, bar widths, height, projections...Too tedious sometimes • DC.js adds a thin layer on top of d3.js to construct most chart types and to link charts together for fast filtering.
  14. 14. C1 Labs Data Science DC.js combines d3.js with Square’s crossfilter • Built by • Javascript library for very fast (<50ms) filtering of multi-dimensional datasets • Developed for transaction analysis (Perfect!) • Very fast sorting and filtering • Downside: Only practical up to a couple million records.
  15. 15. C1 Labs Data Science Need some backend processing to aggregate data before we hit the web stack • Developed by LinkedIn • Fast, scalable messaging publish- subscribe service that runs on a distributed cluster Transaction Stream Transaction Processing • Part of the larger Apache Spark compute engine • Fast, in-memory streaming processing over sliding windows • Handles data aggregation steps • Can be used to run ML algorithms
  16. 16. C1 Labs Data Science What is Apache Spark? Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Source: content/uploads/2013/10/McDonough-spark-tutorial_spark-summit- 2013.pdf
  17. 17. C1 Labs Data Science Word Count in Spark vs Java MapReduce scala> val rdd = sc.textFile("all_text_corpus.txt”) scala> val allWords = rdd.flatMap(sentence=>sentence.split(" ”) scala> val counts =>(word,1)).reduceByKey(_+_) scala>{case (k,v)=>(v,k)} .sortByKey(ascending=false) .map{case (v,k)=>(k,v)}.take(25) Array(("",70230), (the,63641), (and,38896), (of,34986), (to,31743), (a,22481), (in,18710), (his,14712), (was,13963), (that,13735), (he,13588), (I,11761), (with,11308), (had,9303), (her,8429), (not,7900), (as,7641), (it,7626), (for,7619), (at,7574), (on,7350), (is,6383), (you,6173), (be,5525), (by,5315))
  18. 18. C1 Labs Data Science Word Count in Spark vs Java MapReduce
  19. 19. C1 Labs Data Science Transaction Aggregation with Spark Batch up incoming transactions every 30 seconds, and compute average transaction size and total number of transactions for every merchant, zip code for a 5 min sliding window. Write batched results to MongoDB
  20. 20. C1 Labs Data Science MongoDB for intermediate storage • Use capped collection to immediately find last element. • No costly O(N) or worse searches. • Tap into Mongo with Node.js
  21. 21. C1 Labs Data Science Node.js and for server side updates • Add listener in client side javascript
  22. 22. C1 Labs Data Science Demo!