Building a Data Pipeline from Scratch - Joe Crobak
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Building a Data Pipeline from Scratch - Joe Crobak

on

  • 343 views

http://www.hakkalabs.co/articles/building-data-pipeline-scratch

http://www.hakkalabs.co/articles/building-data-pipeline-scratch

Statistics

Views

Total Views
343
Views on SlideShare
238
Embed Views
105

Actions

Likes
3
Downloads
18
Comments
0

1 Embed 105

http://www.hakkalabs.co 105

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building a Data Pipeline from Scratch - Joe Crobak Presentation Transcript

  • 1. From Scratch 1 Joe Crobak @joecrobak ! Tuesday, June 24, 2014 Axium Lyceum - New York, NY BUILDING A DATA PIPELINE
  • 2. INTRODUCTION 2 Software Engineer @ Project Florida ! Previously: •Foursquare •Adconion Media Group •Joost
  • 3. OVERVIEW 3 Why do we care? Defining Data Pipeline Events System Architecture
  • 4. 4 DATA PIPELINES ARE EVERYWHERE
  • 5. RECOMMENDATIONS 5 http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  • 6. RECOMMENDATIONS 6 Clicks Views Recommendations http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  • 7. AD NETWORKS 7
  • 8. AD NETWORKS 8 Clicks Impressions User Ad Profile
  • 9. SEARCH 9 http://lucene.apache.org/solr/
  • 10. SEARCH 10 Search Rankings Page Rank http://www.jevans.com/pubnetmap.html
  • 11. A / B TESTING 11 https://flic.kr/p/4ieVGa
  • 12. A / B TESTING 12 https://flic.kr/p/4ieVGa A conversions B conversions Experiment Analysis
  • 13. DATA WAREHOUSING 13 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
  • 14. DATA WAREHOUSING 14 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/ key metrics user events Data Warehouse
  • 15. 15 WHAT IS A DATA PIPELINE?
  • 16. DATA PIPELINE 16 A Data Pipeline is a unified system for capturing events for analysis and building products.
  • 17. DATA PIPELINE 17 click data user events Data Warehouse web visits email sends … Product Features Ad Hoc analysis •Counting •Machine Learning •Extract Transform Load (ETL)
  • 18. DATA PIPELINE 18 A Data Pipeline is a unified system for capturing events for analysis and building products.
  • 19. 19 EVENTS
  • 20. EVENTS 20 Each of these actions can be thought of as an event.
  • 21. COARSE-GRAINED EVENTS 21 •Events are captured as a by-product. •Stored in text logs used primarily for debugging and secondarily for analysis.
  • 22. COARSE-GRAINED EVENTS 22 127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969! IP Address Timestamp Action Status •Events are captured as a •Stored in debugging and secondarily for analysis.
  • 23. COARSE-GRAINED EVENTS 23 Implicit tracking—i.e. a “page load” event is a proxy for ≥1 other event. ! e.g. event GET /newsfeed corresponds to: •App Load (but only if this is the first time loaded this session) •Timeline load, user is in “group A” of an A/B Test These implementations details have to be known at analysis time.
  • 24. FINE-GRAINED EVENTS 24 Record events like: •app opened •auto refresh •user pull down refresh ! Rather than: •GET /newsfeed
  • 25. FINE-GRAINED EVENTS 25 Annotate events with contextual information like: •view the user was on •which button was clicked
  • 26. FINE-GRAINED EVENTS 26 Decouple logging and analysis. Create events for everything!
  • 27. FINE-GRAINED EVENTS 27 A couple of schema-less formats are popular (e.g. JSON and CSV), but they have drawbacks. •harder to change schemas •inefficient •require writing parsers
  • 28. SCHEMA 28 Used to describe data, providing a contract about fields and their types. ! Two schemas are compatible if you can read data written in schema 1 with schema 2.
  • 29. SCHEMA 29 Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
  • 30. SCHEMA 30 https://engineering.twitter.com/research/publication/the-unified-logging-infrastructure-for-data-analytics-at-twitter Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
  • 31. SCHEMA 31 client:page:section:component:element:action e.g.: ! iphone:home:mentions:tweet:button:click! ! Count iPhone users clicking from home page: ! iphone:home:*:*:*:click! ! Count home clicks on buttons or avatars: ! *:home:*:*:{button,avatar}:click
  • 32. 32 KEY COMPONENTS
  • 33. EVENT FRAMEWORK 33 For easily generating events from your applications
  • 34. EVENT FRAMEWORK 34 For applications
  • 35. BIG MESSAGE BUS 35 •Horizontally scalable •Redundant •APIs / easy to integrate
  • 36. BIG MESSAGE BUS 36 •Scribe (Facebook) •Apache Chukwa •Apache Flume •Apache Kafka* ! •Horizontally scalable •Redundant •APIs / easy to integrate * My recommendation
  • 37. DATA PERSISTENCE 37 For storing your events in files for batch processing
  • 38. DATA PERSISTENCE 38 For processing Kite Software Development Kit http://kitesdk.org/ ! Spring Hadoop http://projects.spring.io/spring-hadoop/
  • 39. WORKFLOW MANAGEMENT 39 For coordinating the tasks in your data pipeline
  • 40. WORKFLOW MANAGEMENT 40 … or your own system written in your own language of choice. * For pipeline
  • 41. SERIALIZATION FRAMEWORK 41 Used for converting an Event to bytes on disk. Provides efficient, cross-language framework for serializing/deserializing data.
  • 42. SERIALIZATION FRAMEWORK 42 •Apache Avro* •Apache Thrift •Protocol Buffers (google) Used for disk framework for serializing/deserializing data.
  • 43. BATCH PROCESSING AND AD HOC ANALYSIS 43 •Apache Hadoop (MapReduce) •Apache Hive (or other SQL-on-Hadoop) •Apache Spark
  • 44. SYSTEM OVERVIEW 44 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data flow workflow engine Production DB dumps
  • 45. SYSTEM OVERVIEW (OPINIONATED) 45 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data flow workflow engine Production DB dumps Apache Avro Apache Kafka Luigi
  • 46. NEXT STEPS 46 This architecture opens up a lot of possibilities •Near-real time computation—Apache Storm, Apache Samza (incubating), Apache Spark streaming. •Sharing information between services asynchronously—e.g. to augment user profile information. •Cross-datacenter replication •Columnar storage
  • 47. LAMBDA ARCHITECTURE 47 Term coined by Nathan Marz (creator of Apache Storm) for hybrid batch and real- time processing. ! Batch processing is treated as source of truth, and real-time updates models/insights between batches.
  • 48. LAMBDA ARCHITECTURE 48 http://lambda-architecture.net/
  • 49. SUMMARY 49 •Data Pipelines are everywhere. •Useful to think of data as events. •A unified data pipeline is very powerful. •Plethora of open-source tools to build data pipeline.
  • 50. FURTHER READING 50 The Unified Logging Infrastructure for Data Analytics at Twitter ! The Log: What every software engineer should know about real-time data's unifying abstraction (Jay Kreps, LinkedIn) ! Big Data by Nathan Marz and James Warren ! Implementing Microservice Architectures
  • 51. THANK YOU 51 Questions? ! Shameless plug: www.hadoopweekly.com
  • 52. 52 EXTRA SLIDES
  • 53. WHY KAFKA? 53 • https://kafka.apache.org/ documentation.html#design • Pull model works well • Easy to configure and deploy • Good JVM support • Well-integrated with the LinkedIn stack
  • 54. WHY LUIGI? 54 • Scripting language (you’ll end up writing scripts anyway) • Simplicity (low learning curve) • Idempotency • Easy to deploy
  • 55. WHY AVRO? 55 • Self-describing files • Integrated with nearly everything in the ecosystem • CLI tools for dumping to JSON, CSV