Building a Data Pipeline from Scratch - Joe Crobak

12,903 views
12,130 views

Published on

http://www.hakkalabs.co/articles/building-data-pipeline-scratch

Published in: Software
1 Comment
56 Likes
Statistics
Notes
No Downloads
Views
Total views
12,903
On SlideShare
0
From Embeds
0
Number of Embeds
174
Actions
Shares
0
Downloads
326
Comments
1
Likes
56
Embeds 0
No embeds

No notes for slide

Building a Data Pipeline from Scratch - Joe Crobak

  1. 1. From Scratch 1 Joe Crobak @joecrobak ! Tuesday, June 24, 2014 Axium Lyceum - New York, NY BUILDING A DATA PIPELINE
  2. 2. INTRODUCTION 2 Software Engineer @ Project Florida ! Previously: •Foursquare •Adconion Media Group •Joost
  3. 3. OVERVIEW 3 Why do we care? Defining Data Pipeline Events System Architecture
  4. 4. 4 DATA PIPELINES ARE EVERYWHERE
  5. 5. RECOMMENDATIONS 5 http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  6. 6. RECOMMENDATIONS 6 Clicks Views Recommendations http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  7. 7. AD NETWORKS 7
  8. 8. AD NETWORKS 8 Clicks Impressions User Ad Profile
  9. 9. SEARCH 9 http://lucene.apache.org/solr/
  10. 10. SEARCH 10 Search Rankings Page Rank http://www.jevans.com/pubnetmap.html
  11. 11. A / B TESTING 11 https://flic.kr/p/4ieVGa
  12. 12. A / B TESTING 12 https://flic.kr/p/4ieVGa A conversions B conversions Experiment Analysis
  13. 13. DATA WAREHOUSING 13 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
  14. 14. DATA WAREHOUSING 14 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/ key metrics user events Data Warehouse
  15. 15. 15 WHAT IS A DATA PIPELINE?
  16. 16. DATA PIPELINE 16 A Data Pipeline is a unified system for capturing events for analysis and building products.
  17. 17. DATA PIPELINE 17 click data user events Data Warehouse web visits email sends … Product Features Ad Hoc analysis •Counting •Machine Learning •Extract Transform Load (ETL)
  18. 18. DATA PIPELINE 18 A Data Pipeline is a unified system for capturing events for analysis and building products.
  19. 19. 19 EVENTS
  20. 20. EVENTS 20 Each of these actions can be thought of as an event.
  21. 21. COARSE-GRAINED EVENTS 21 •Events are captured as a by-product. •Stored in text logs used primarily for debugging and secondarily for analysis.
  22. 22. COARSE-GRAINED EVENTS 22 127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969! IP Address Timestamp Action Status •Events are captured as a •Stored in debugging and secondarily for analysis.
  23. 23. COARSE-GRAINED EVENTS 23 Implicit tracking—i.e. a “page load” event is a proxy for ≥1 other event. ! e.g. event GET /newsfeed corresponds to: •App Load (but only if this is the first time loaded this session) •Timeline load, user is in “group A” of an A/B Test These implementations details have to be known at analysis time.
  24. 24. FINE-GRAINED EVENTS 24 Record events like: •app opened •auto refresh •user pull down refresh ! Rather than: •GET /newsfeed
  25. 25. FINE-GRAINED EVENTS 25 Annotate events with contextual information like: •view the user was on •which button was clicked
  26. 26. FINE-GRAINED EVENTS 26 Decouple logging and analysis. Create events for everything!
  27. 27. FINE-GRAINED EVENTS 27 A couple of schema-less formats are popular (e.g. JSON and CSV), but they have drawbacks. •harder to change schemas •inefficient •require writing parsers
  28. 28. SCHEMA 28 Used to describe data, providing a contract about fields and their types. ! Two schemas are compatible if you can read data written in schema 1 with schema 2.
  29. 29. SCHEMA 29 Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
  30. 30. SCHEMA 30 https://engineering.twitter.com/research/publication/the-unified-logging-infrastructure-for-data-analytics-at-twitter Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
  31. 31. SCHEMA 31 client:page:section:component:element:action e.g.: ! iphone:home:mentions:tweet:button:click! ! Count iPhone users clicking from home page: ! iphone:home:*:*:*:click! ! Count home clicks on buttons or avatars: ! *:home:*:*:{button,avatar}:click
  32. 32. 32 KEY COMPONENTS
  33. 33. EVENT FRAMEWORK 33 For easily generating events from your applications
  34. 34. EVENT FRAMEWORK 34 For applications
  35. 35. BIG MESSAGE BUS 35 •Horizontally scalable •Redundant •APIs / easy to integrate
  36. 36. BIG MESSAGE BUS 36 •Scribe (Facebook) •Apache Chukwa •Apache Flume •Apache Kafka* ! •Horizontally scalable •Redundant •APIs / easy to integrate * My recommendation
  37. 37. DATA PERSISTENCE 37 For storing your events in files for batch processing
  38. 38. DATA PERSISTENCE 38 For processing Kite Software Development Kit http://kitesdk.org/ ! Spring Hadoop http://projects.spring.io/spring-hadoop/
  39. 39. WORKFLOW MANAGEMENT 39 For coordinating the tasks in your data pipeline
  40. 40. WORKFLOW MANAGEMENT 40 … or your own system written in your own language of choice. * For pipeline
  41. 41. SERIALIZATION FRAMEWORK 41 Used for converting an Event to bytes on disk. Provides efficient, cross-language framework for serializing/deserializing data.
  42. 42. SERIALIZATION FRAMEWORK 42 •Apache Avro* •Apache Thrift •Protocol Buffers (google) Used for disk framework for serializing/deserializing data.
  43. 43. BATCH PROCESSING AND AD HOC ANALYSIS 43 •Apache Hadoop (MapReduce) •Apache Hive (or other SQL-on-Hadoop) •Apache Spark
  44. 44. SYSTEM OVERVIEW 44 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data flow workflow engine Production DB dumps
  45. 45. SYSTEM OVERVIEW (OPINIONATED) 45 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data flow workflow engine Production DB dumps Apache Avro Apache Kafka Luigi
  46. 46. NEXT STEPS 46 This architecture opens up a lot of possibilities •Near-real time computation—Apache Storm, Apache Samza (incubating), Apache Spark streaming. •Sharing information between services asynchronously—e.g. to augment user profile information. •Cross-datacenter replication •Columnar storage
  47. 47. LAMBDA ARCHITECTURE 47 Term coined by Nathan Marz (creator of Apache Storm) for hybrid batch and real- time processing. ! Batch processing is treated as source of truth, and real-time updates models/insights between batches.
  48. 48. LAMBDA ARCHITECTURE 48 http://lambda-architecture.net/
  49. 49. SUMMARY 49 •Data Pipelines are everywhere. •Useful to think of data as events. •A unified data pipeline is very powerful. •Plethora of open-source tools to build data pipeline.
  50. 50. FURTHER READING 50 The Unified Logging Infrastructure for Data Analytics at Twitter ! The Log: What every software engineer should know about real-time data's unifying abstraction (Jay Kreps, LinkedIn) ! Big Data by Nathan Marz and James Warren ! Implementing Microservice Architectures
  51. 51. THANK YOU 51 Questions? ! Shameless plug: www.hadoopweekly.com
  52. 52. 52 EXTRA SLIDES
  53. 53. WHY KAFKA? 53 • https://kafka.apache.org/ documentation.html#design • Pull model works well • Easy to configure and deploy • Good JVM support • Well-integrated with the LinkedIn stack
  54. 54. WHY LUIGI? 54 • Scripting language (you’ll end up writing scripts anyway) • Simplicity (low learning curve) • Idempotency • Easy to deploy
  55. 55. WHY AVRO? 55 • Self-describing files • Integrated with nearly everything in the ecosystem • CLI tools for dumping to JSON, CSV

×