SlideShare a Scribd company logo
Collecting and Making Sense of
Diverse Data at WayUp
Harlan D. Harris, PhD
Director of Data Science
DataEngConf 2017
Thanks to:
JJ Fliegelman (CTO)
WayUp Engineers!
Why we built WayUp...
The leading digital platform for
employers to reach, recruit, and
engage candidates in an authentic way.
… with the focus on college students and recent grads.
One of thirty innovative
companies changing the
world.
2
Talking about Choices
● Where we focus effort
○ Event Collection & Data Refinement
● Tech stack
○ Segment, Redshift, dbt, Periscope
● Warehouse table design
○ ELT, layers & abstractions
● We’re Hiring!
4
Data Sources
5
Why We Warehouse
Support Business Analytics and Product (Data Science)
● Clean, Normalized Tables
● Abstract over Changes in Systems
● Right Type of Domain Knowledge
6
Data Reflects
the World
Decisions & Products
Reflect the World
Tech Stack for Analytics
● Segment
● Amazon Redshift
● S3/Spectrum
● dbt
● Periscope +
targeted tools
7
Avoid Vendor Lock-in;
Design to Minimize
Downstream Impact
Event Tracking
● Heap approach
○ Developers don’t make choices
○ Automatically get every load and click
○ UI changes can lose continuity
● Traditional approach
○ Developers choose what to track
○ Can miss stuff -- requires communication!
○ Can keep semantic continuity across
changes
○ Less lock-in
8
“Actions with
Meaning”
Redshift and Spectrum
● Value of familiarity, broad support
● Sweet spot in scale, room to grow
● Spectrum
○ External tables on S3 CSV
○ Query and join like internal tables
○ Avoid or delay loading until needed
○ Use Transform tools to load
9
The ELT Pattern
● “Data Lake” in columnar database
● Piped in via Segment data loader
● Transform on-database vs. in-transit
● Requires compute power, but space is cheap
● Can be more agile, “schema on read”
10
“most data transformation use cases can be much more
effectively handled in-database rather than in some
external processing layer” -dbt
dbt
11
https://blog.fishtownanalytics.com/what-exactly-is-dbt-47ba57309068
Abstraction Layers
12
Raw
LookupStaging
Analytics
Reporting Product
(Spectrum)
Dimension Tables and Activity Streams
13
hist_user
now_userdim_user
act_
user
actor
ts
action
object
ob_type
properties
Alice
Nov 2nd
viewed
sales-123
listing
{ pos: 3 }
fact table with
specific, consistent
structure
(see WeWork talk!)
What We’ve Learned; Where We’re Going
● Pay close attention to
what you store, and
how you refine data
● Tools now are amazing
● Design with empathy
and creativity
14
● Grow it with the
business!
● Build insights &
products to help our
users and customers!
Thanks!
Data Scientist (Recommender Systems)
Data Engineer (this stuff!)
FS & BE Engineers (Python)
harlan@wayup.com, @harlanh
We’re Hiring!

More Related Content

What's hot

Creating stunning data analytics dashboard using php and flex
Creating stunning data analytics dashboard using php and flexCreating stunning data analytics dashboard using php and flex
Creating stunning data analytics dashboard using php and flex
10n Software, LLC
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 

What's hot (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to Big Data & Hadoop
Introduction to Big Data & Hadoop Introduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
 
MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB
MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDBMongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB
MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Thilga
ThilgaThilga
Thilga
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
Creating stunning data analytics dashboard using php and flex
Creating stunning data analytics dashboard using php and flexCreating stunning data analytics dashboard using php and flex
Creating stunning data analytics dashboard using php and flex
 
DataTables view CKAN monthly live
DataTables view   CKAN monthly liveDataTables view   CKAN monthly live
DataTables view CKAN monthly live
 
Big Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companiesBig Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companies
 
"Interactive Deep Analytics" Dashboard
"Interactive Deep Analytics" Dashboard"Interactive Deep Analytics" Dashboard
"Interactive Deep Analytics" Dashboard
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - Cassandra
 
Tor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeTor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lake
 
Business intelligence
Business intelligence Business intelligence
Business intelligence
 
Design | expose ap is with cqrs
Design | expose ap is with cqrsDesign | expose ap is with cqrs
Design | expose ap is with cqrs
 
ODA Use-Case: XaitPorter Appliance
ODA Use-Case: XaitPorter ApplianceODA Use-Case: XaitPorter Appliance
ODA Use-Case: XaitPorter Appliance
 
Design | expose ap is with cqr
Design | expose ap is with cqrDesign | expose ap is with cqr
Design | expose ap is with cqr
 
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
 

Similar to Collecting and Making Sense of Diverse Data at WayUp

Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
VMware Tanzu
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 

Similar to Collecting and Making Sense of Diverse Data at WayUp (20)

LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data Engineering
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 

Collecting and Making Sense of Diverse Data at WayUp

  • 1. Collecting and Making Sense of Diverse Data at WayUp Harlan D. Harris, PhD Director of Data Science DataEngConf 2017 Thanks to: JJ Fliegelman (CTO) WayUp Engineers!
  • 2. Why we built WayUp... The leading digital platform for employers to reach, recruit, and engage candidates in an authentic way. … with the focus on college students and recent grads. One of thirty innovative companies changing the world. 2
  • 3.
  • 4. Talking about Choices ● Where we focus effort ○ Event Collection & Data Refinement ● Tech stack ○ Segment, Redshift, dbt, Periscope ● Warehouse table design ○ ELT, layers & abstractions ● We’re Hiring! 4
  • 6. Why We Warehouse Support Business Analytics and Product (Data Science) ● Clean, Normalized Tables ● Abstract over Changes in Systems ● Right Type of Domain Knowledge 6 Data Reflects the World Decisions & Products Reflect the World
  • 7. Tech Stack for Analytics ● Segment ● Amazon Redshift ● S3/Spectrum ● dbt ● Periscope + targeted tools 7 Avoid Vendor Lock-in; Design to Minimize Downstream Impact
  • 8. Event Tracking ● Heap approach ○ Developers don’t make choices ○ Automatically get every load and click ○ UI changes can lose continuity ● Traditional approach ○ Developers choose what to track ○ Can miss stuff -- requires communication! ○ Can keep semantic continuity across changes ○ Less lock-in 8 “Actions with Meaning”
  • 9. Redshift and Spectrum ● Value of familiarity, broad support ● Sweet spot in scale, room to grow ● Spectrum ○ External tables on S3 CSV ○ Query and join like internal tables ○ Avoid or delay loading until needed ○ Use Transform tools to load 9
  • 10. The ELT Pattern ● “Data Lake” in columnar database ● Piped in via Segment data loader ● Transform on-database vs. in-transit ● Requires compute power, but space is cheap ● Can be more agile, “schema on read” 10 “most data transformation use cases can be much more effectively handled in-database rather than in some external processing layer” -dbt
  • 13. Dimension Tables and Activity Streams 13 hist_user now_userdim_user act_ user actor ts action object ob_type properties Alice Nov 2nd viewed sales-123 listing { pos: 3 } fact table with specific, consistent structure (see WeWork talk!)
  • 14. What We’ve Learned; Where We’re Going ● Pay close attention to what you store, and how you refine data ● Tools now are amazing ● Design with empathy and creativity 14 ● Grow it with the business! ● Build insights & products to help our users and customers!
  • 15. Thanks! Data Scientist (Recommender Systems) Data Engineer (this stuff!) FS & BE Engineers (Python) harlan@wayup.com, @harlanh We’re Hiring!