Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collecting and Making Sense of Diverse Data at WayUp


Published on

my 20-minute talk at DataEngConf 2017 in NYC, in the Startup track.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Collecting and Making Sense of Diverse Data at WayUp

  1. 1. Collecting and Making Sense of Diverse Data at WayUp Harlan D. Harris, PhD Director of Data Science DataEngConf 2017 Thanks to: JJ Fliegelman (CTO) WayUp Engineers!
  2. 2. Why we built WayUp... The leading digital platform for employers to reach, recruit, and engage candidates in an authentic way. … with the focus on college students and recent grads. One of thirty innovative companies changing the world. 2
  3. 3. Talking about Choices ● Where we focus effort ○ Event Collection & Data Refinement ● Tech stack ○ Segment, Redshift, dbt, Periscope ● Warehouse table design ○ ELT, layers & abstractions ● We’re Hiring! 4
  4. 4. Data Sources 5
  5. 5. Why We Warehouse Support Business Analytics and Product (Data Science) ● Clean, Normalized Tables ● Abstract over Changes in Systems ● Right Type of Domain Knowledge 6 Data Reflects the World Decisions & Products Reflect the World
  6. 6. Tech Stack for Analytics ● Segment ● Amazon Redshift ● S3/Spectrum ● dbt ● Periscope + targeted tools 7 Avoid Vendor Lock-in; Design to Minimize Downstream Impact
  7. 7. Event Tracking ● Heap approach ○ Developers don’t make choices ○ Automatically get every load and click ○ UI changes can lose continuity ● Traditional approach ○ Developers choose what to track ○ Can miss stuff -- requires communication! ○ Can keep semantic continuity across changes ○ Less lock-in 8 “Actions with Meaning”
  8. 8. Redshift and Spectrum ● Value of familiarity, broad support ● Sweet spot in scale, room to grow ● Spectrum ○ External tables on S3 CSV ○ Query and join like internal tables ○ Avoid or delay loading until needed ○ Use Transform tools to load 9
  9. 9. The ELT Pattern ● “Data Lake” in columnar database ● Piped in via Segment data loader ● Transform on-database vs. in-transit ● Requires compute power, but space is cheap ● Can be more agile, “schema on read” 10 “most data transformation use cases can be much more effectively handled in-database rather than in some external processing layer” -dbt
  10. 10. dbt 11
  11. 11. Abstraction Layers 12 Raw LookupStaging Analytics Reporting Product (Spectrum)
  12. 12. Dimension Tables and Activity Streams 13 hist_user now_userdim_user act_ user actor ts action object ob_type properties Alice Nov 2nd viewed sales-123 listing { pos: 3 } fact table with specific, consistent structure (see WeWork talk!)
  13. 13. What We’ve Learned; Where We’re Going ● Pay close attention to what you store, and how you refine data ● Tools now are amazing ● Design with empathy and creativity 14 ● Grow it with the business! ● Build insights & products to help our users and customers!
  14. 14. Thanks! Data Scientist (Recommender Systems) Data Engineer (this stuff!) FS & BE Engineers (Python), @harlanh We’re Hiring!