Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Build a Data-Driven Company: From Infrastructure to Insights

1,213 views

Published on

Companies like Buffer, SeatGeek, and Asana aren’t just talking about the value of data, they’re building data infrastructure that can actually deliver it. Join this 45-minute webinar to learn why these companies are investing in data and what you need to know to keep up.

Published in: Technology
  • Be the first to comment

How to Build a Data-Driven Company: From Infrastructure to Insights

  1. 1. #datastack#datastack Shaun
  2. 2. #datastack#datastack What you’re going to learn 1 How top engineering organizations are building their data infrastructure The 7 core challenges of data integration Why companies like Asana, Buffer, and SeatGeek choose Redshift for their analytics warehouse ...and much more! 2 3 Shaun
  3. 3. #datastack Data Infrastructure: Then and Now Dillon
  4. 4. #datastack The traditional approach: ETL Dillon END USERBI TEAMETL TEAM EDW TEAM A B D CZ P SUMMAR Y ELT - Heavy Transformation Restricted Q&AOLAP / Silos SUMMAR Y F E
  5. 5. #datastack How companies are doing it today: ELT Dillon Modeling Layer Transform at Query FFF Database Extract Load - name: first_purchasers type: single_value base_view: orders measures: [orders.customer.all] Analytics Viz & Exploration C C C Transform (and Explore!)
  6. 6. #datastack Benefits of this approach 1.Redshift is performant enough to handle most transformations 2.Users prefer performing transformations in a language they already use (SQL) or with UI 3.Transformations are much simpler, more transparent 4.Performing transformations alongside raw data is great for auditability Dillon
  7. 7. #datastack Data infrastructure has geek cred Shaun
  8. 8. #datastack Data infrastructure has geek cred Shaun
  9. 9. #datastack Data infrastructure has geek cred Shaun
  10. 10. #datastack Data infrastructure has geek cred Shaun
  11. 11. #datastack#datastack Data Integration Data Warehouse BI/Analytics What the stack looks likeShaun
  12. 12. #datastack Data Integration Shaun
  13. 13. #datastack Why consolidation matters
  14. 14. #datastack#datastack internal analytics Shaun
  15. 15. #datastack Quick poll Shaun What top five data sources are a top priority for you to integrate/keep integrated? ● production databases ● events ● error logs ● billing ● email marketing ● crm ● advertising ● erp ● a/b testing ● support
  16. 16. #datastack “A year ago, we were facing a lot of stability problems with our data processing. When there was a major shift in a graph, people immediately questioned the data integrity. It was hard to distinguish interesting insights from bugs. Data science is already an art so you need the infrastructure to give you trustworthy answers to the questions you ask. 99% correctness is not good enough. And on the data infrastructure team, we were spending a lot of time churning on fighting urgent fires, and that prevented us from making much long-term progress. It was painful.” - Marco Gallotta, Asana, How to Build Stable, Accessible Data Infrastructure at a Startup
  17. 17. #datastack “Our story would end here if real-time processing were perfect. But it’s not: some events can come in days late, some time ranges need to be re- processed after initial ingestion due to code changes or data revisions, various components of the real-time pipeline can fail, and so on.” - Gian Merlino, MetaMarkets, Building a Data Pipeline That Handles Billions of Events in Real-Time
  18. 18. #datastack 7 core challenges of data integration Connections: Every API is a unique and special snowflake Accuracy: Ordering data on a distributed system Latency: Large object data stores (Amazon S3, Redshift) are optimized for batches not streams Scale: Data will grow exponentially as your company grows Flexibility: you’re interacting with systems you don’t control Monitoring: Notifications for expired credentials, errors, notifications of disruptions Maintenance: Justifying investment in ongoing maintenance/improvement Shaun
  19. 19. #datastack Or...try Pipeline Shaun Ad Platforms Customer SupportWeb Data Marketing Automation CRM PaymentsEcommerce
  20. 20. #datastack Warehousing Infrastructure Shaun
  21. 21. #datastack Analytics warehouse Shaun Redshift is the most common analytics warehouse. Chosen by: Asana, Braintree, Looker, Seatgeek, VigLink, Buffer
  22. 22. #datastack#datastack awesome Shaun
  23. 23. #datastack#datastack AirBnB experiment Hive Redshift Test 1: 3 billion rows of data 28 minutes <6 minutes Test 2: two joins with millions of rows 182 seconds 8 seconds Cost $1.29/hour/node $0.85/hour/node Shaun
  24. 24. #datastack Periscope research Shaun
  25. 25. #datastack DiamondStream’s dashboard query performance Shaun
  26. 26. #datastack Business Intelligence & Analytics Dillon
  27. 27. #datastack#datastack A broken model Dillon ● Feedback loop is broken ● Disparate reporting ● Non-unified decision making ● Versioning ● Reusability is lost Marketing Finance AM
  28. 28. #datastack Constraints of SQL Dillon SQL is versatile, but shares the same flavor as assembly-only languages such as Perl Can write but not read Promotes one-off, piecemeal analysis Disparate interpretation
  29. 29. #datastack The critical multiplier: modeling Dillon Any SQL Data Warehouse Modeling Layer What’s our most successful marketing campaign How does our Q4 Pipeline looks? Who are our healthiest / happiest customers?
  30. 30. #datastack#datastack analytics Dillon ● Data access ● Uniform definitions ● A Shared View ● Collaboration ● Analytical Speed
  31. 31. #datastack What You Can Do Dillon
  32. 32. #datastack#datastack analytics tools Dillon Week 1 Week 2-3 RJMetrics Pipeline BLOCKS
  33. 33. #datastack#datastack marketing
  34. 34. #datastack#datastack marketing
  35. 35. #datastack#datastack analytics
  36. 36. #datastack#datastack analytics
  37. 37. #datastack Thank you!

×