SlideShare a Scribd company logo
Data Engineering at Udemy
Keeyong Han
Principal Data Architect @Udemy
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
About Me
• 20+ years of experience from 9 different
companies
• Currently manages Data team at Udemy
• Prior to joining Udemy
– Manager of data/search team at Polyvore
– Director of Engineering at Yahoo Search
– Started career from Samsung Electronics in Korea
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Agenda
• Typical Evolution of Data Processing
• Data Engineering at Udemy
• Lessons Learned
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
TYPICAL EVOLUTION OF
DATA PROCESSING
From a small start-up
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
In the beginning
• You don’t have any data 
• So no data infrastructure or data science
– The most important thing is to survive and to keep
iterating
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
After a struggle you have some data
• Now you survived and now you have some
data to work with
– Data analysts are hired
– They want to analyze the data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Then …
• You don’t know where the data is exactly
• You find your data but
– It is not clean and is missing key information
– Data is likely not in the format you want
• You store them in non-optimal storage
– MySQL is likely used to store all kinds of data
• But MySQL doesn’t scale
– You ask analysts to query MySQL
• They will kill the web site a few times 
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (I)
• You have to find a scalable and separate storage for
data analysis
– This is called Data Warehouse or Data Analytics
– This will be the central storage for your important data
– Udemy uses AWS Redshift
• Migrate some data from MySQL
– Key/Value data to NoSQL solution (Cassandra/Hbase,
MongoDB, …)
– Log type of data (use Nginx log for example)
– MySQL should only have key data which is needed from
Web service
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (II)
• The goal is to put every data into a single
storage
– This is the most important and the very first step
toward becoming “true” data organization
– This storage should be separated from runtime
storage (MySQL for example)
– This storage should be scalable
– Being consistent is more important than being
correct in the beginning
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now You Add More Data
• Different Ways of Collecting Data
– This is called ETL (Extract, Transform and Load)
– Different Aspects to Consider
• Size: 1KB to 20GB
• Frequency: Hourly, Daily, Weekly, Monthly
• How to collect data:
– FTP, API, Webhook, S3, HTTP, mysql commandline
• You will have multiple data collection workflows
– Use cronjob (or some scheduler) to manage
– Udemy uses Pinball (Open Source from Pinterest)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Will Look Like
Your
Cool
Web
Service
Log Files
MySQL
Key/Value
Data
Warehouse
External
Data SourcesETL
ETL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Simple Data import
• Just use some script language
– Many data sources are small and simple enough
to use a script language
• Udemy uses Python for this purpose
– Implemented a set of Python classes to handle
different types of data import
– Plan to open source this in 1st half of 2016
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Large Data Batch Import
• Large data import and processing will require
more scalable solution
• Hadoop can be used for this purpose
– SQL on Hadoop: Hive, Tajo, Presto and so on
– Pig, Java MapReduce
• Spark is getting a lot of attention and we plan
to evaluate
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Data import
• Some of data better be imported as it
happens
• This requires different type of technology
– Realtime Data Message Queue: Kafka, Kinesis
– Realtime Data Consumer: Storm, Samza, Spark
Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (I)
• Build Summary Tables
– Having raw data tables is good but it can be too
detailed and too much information
– Build these tables in your Data Warehouse
• Track the performance of key metrics
– This should be from summary tables above
– You need dashboard tool (build one or use 3rd
party solution – Birst, chartIO, Tableau and so on)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (II)
• Provide this data to Data Science team
– Draw insight and create feedback loop
– Build machine learned models for recommendation,
search ranking and so on
– The topic for the next session (Thanks Larry!)
• Supporting Data Science from Infrastructure
– This will require scalable infrastructure
– Example: Scoring every pairs of user/course in Udemy
• 7M users X 30K courses = 210B pairs of computation
– You need scalable Serving Layer (Cassandra, Hbase, …)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
DATA ENGINEERING AT UDEMY
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Warehouse/Analytics
• We use AWS Redshift as our data warehouse
(or data analytics backend)
• What is AWS Redshift?
– Scalable Postgresql Engine up to 1.6PB of data
– Roughly it is 600 USD per TB per month
– Mainly for offline batch processing
– Supports bulk update (through AWS S3)
– Two type of options: Compute vs. Storage
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Kind of Data Stored in Redshift
• 800+ tables with 2.4TB of data
• Key tables from MySQL
• Email Marketing Data
• Ads Campaign Performance Data
• SEO data from Google
• Data from Web access log
• Support Ticket Data
• A/B Test Data (Mobile, Web)
• Human curated data from Google Spreadsheets
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Details on ETL Pipelines
• All data pipelines are scheduled through
Pinball
– Every 5 minutes, hourly, daily, weekly and monthly
• Most pipelines are purely in Python
• Some uses Hadoop/Hive and Hadoop/Pig for
Batch Processing
• Start using Kinesis for Realtime Processing
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Pinball Screenshot
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Batching Processing Infrastructure
• We use Hadoop 2.6 with Hive and Pig
– CDH 5.4 (community version)
• We use our own hadoop cluster and AWS EMR
(ElasticMapReduce) at the same time
– This is used to do ETL on massive data
– This is also used to run massive model/scoring
pipelines from Data Science team
• Plan to evaluate Spark potentially as an
alternative
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Processing
• Applications
– The first application is to process web access log
– Eventually we plan to use this to generate
personalized recommendation on-the-fly
• Plan to use AWS Kinesis
– Evaluated Apache Kafka and AWS Kinesis
• They are very similar but Kafka is an open source while
Kinesis is a managed service from AWS
• Decided to use AWS Kinesis
• Plan to evaluate Realtime Consumer
– Samza, Storm, Spark Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What is Kinesis (Kafka)?
• Realtime data processing service in AWS
– Publisher-Subscriber message broker
– Very similar to Kafka
• It has two components
– One is message queue where stream of data is stored
• 24 hours of retention period
• Pay hourly by the read/write rate of the queue
– The other is KCL (Kinesis Client Library)
• Using this, build Data Producer application or Data
Consumer Application
• This can be combined with Storm, Spark Streaming, …
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Serving Layer
• Redshift isn’t a good fit to read out the data in
realtime fashion so you need something else
• We are using (or plan to use) the followings:
– Cassandra
– Redis
– ElasticSearch
– MySQL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Looks Like
Udemy
Log Files
(Nginx)
MySQL
Key/Value
(Cassandra)
Data
Warehouse
(Redshift)
External
Data Sources
Data Science Pipeline
ETL
ETL
Data Science Pipeline
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
LESSONS LEARNED
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
• As a small start-up survive first and then work
on data
• Starting point is to store all data in a single
location (data warehouse)
• Start with batch processing and then realtime
• Consider the type of data you store
– Log vs. Key/Value vs. Transactional Record
• Store data in the form of log (change history)
– So that you can always go back and debug/replay
• Cloud is good unless you have really massive
data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Q & A
Udemy is Hiring
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015

More Related Content

What's hot

Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Open Analytics
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
Databricks
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaret
Databricks
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Formulatedby
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
GraphAware
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax Academy
 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
Karthik Murugesan
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Albert Wong
 
Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012
Andrei Savu
 
Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
Sri Ambati
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA
 
Pm.ais ummit 180917 final
Pm.ais ummit 180917 finalPm.ais ummit 180917 final
Pm.ais ummit 180917 final
Nisha Talagala
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
Databricks
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
sfbiganalytics
 

What's hot (20)

Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaret
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012
 
Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
 
Pm.ais ummit 180917 final
Pm.ais ummit 180917 finalPm.ais ummit 180917 final
Pm.ais ummit 180917 final
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
 

Similar to Data Engineering at Udemy

Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data Analytics
Bo Yang
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
Michael Stephenson
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
SATOSHI TAGOMORI
 
AzureSynapse.pptx
AzureSynapse.pptxAzureSynapse.pptx
AzureSynapse.pptx
Udaiappa Ramachandran
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
Lynchpin Analytics Consultancy
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
jhkrug
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
Amazon Web Services
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
TeddyIswahyudi1
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle
 
Holistic Approach To Monitoring
Holistic Approach To MonitoringHolistic Approach To Monitoring
Holistic Approach To Monitoring
Melanie Cey
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 

Similar to Data Engineering at Udemy (20)

Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data Analytics
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
 
AzureSynapse.pptx
AzureSynapse.pptxAzureSynapse.pptx
AzureSynapse.pptx
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Holistic Approach To Monitoring
Holistic Approach To MonitoringHolistic Approach To Monitoring
Holistic Approach To Monitoring
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 

Data Engineering at Udemy

  • 1. Data Engineering at Udemy Keeyong Han Principal Data Architect @Udemy Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 2. About Me • 20+ years of experience from 9 different companies • Currently manages Data team at Udemy • Prior to joining Udemy – Manager of data/search team at Polyvore – Director of Engineering at Yahoo Search – Started career from Samsung Electronics in Korea Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 3. Agenda • Typical Evolution of Data Processing • Data Engineering at Udemy • Lessons Learned Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 4. TYPICAL EVOLUTION OF DATA PROCESSING From a small start-up Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 5. In the beginning • You don’t have any data  • So no data infrastructure or data science – The most important thing is to survive and to keep iterating Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 6. After a struggle you have some data • Now you survived and now you have some data to work with – Data analysts are hired – They want to analyze the data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 7. Then … • You don’t know where the data is exactly • You find your data but – It is not clean and is missing key information – Data is likely not in the format you want • You store them in non-optimal storage – MySQL is likely used to store all kinds of data • But MySQL doesn’t scale – You ask analysts to query MySQL • They will kill the web site a few times  Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 8. Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 9. Now what to do? (I) • You have to find a scalable and separate storage for data analysis – This is called Data Warehouse or Data Analytics – This will be the central storage for your important data – Udemy uses AWS Redshift • Migrate some data from MySQL – Key/Value data to NoSQL solution (Cassandra/Hbase, MongoDB, …) – Log type of data (use Nginx log for example) – MySQL should only have key data which is needed from Web service Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 10. Now what to do? (II) • The goal is to put every data into a single storage – This is the most important and the very first step toward becoming “true” data organization – This storage should be separated from runtime storage (MySQL for example) – This storage should be scalable – Being consistent is more important than being correct in the beginning Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 11. Now You Add More Data • Different Ways of Collecting Data – This is called ETL (Extract, Transform and Load) – Different Aspects to Consider • Size: 1KB to 20GB • Frequency: Hourly, Daily, Weekly, Monthly • How to collect data: – FTP, API, Webhook, S3, HTTP, mysql commandline • You will have multiple data collection workflows – Use cronjob (or some scheduler) to manage – Udemy uses Pinball (Open Source from Pinterest) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 12. How It Will Look Like Your Cool Web Service Log Files MySQL Key/Value Data Warehouse External Data SourcesETL ETL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 13. Simple Data import • Just use some script language – Many data sources are small and simple enough to use a script language • Udemy uses Python for this purpose – Implemented a set of Python classes to handle different types of data import – Plan to open source this in 1st half of 2016 Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 14. Large Data Batch Import • Large data import and processing will require more scalable solution • Hadoop can be used for this purpose – SQL on Hadoop: Hive, Tajo, Presto and so on – Pig, Java MapReduce • Spark is getting a lot of attention and we plan to evaluate Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 15. Realtime Data import • Some of data better be imported as it happens • This requires different type of technology – Realtime Data Message Queue: Kafka, Kinesis – Realtime Data Consumer: Storm, Samza, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 16. What’s Next? (I) • Build Summary Tables – Having raw data tables is good but it can be too detailed and too much information – Build these tables in your Data Warehouse • Track the performance of key metrics – This should be from summary tables above – You need dashboard tool (build one or use 3rd party solution – Birst, chartIO, Tableau and so on) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 17. What’s Next? (II) • Provide this data to Data Science team – Draw insight and create feedback loop – Build machine learned models for recommendation, search ranking and so on – The topic for the next session (Thanks Larry!) • Supporting Data Science from Infrastructure – This will require scalable infrastructure – Example: Scoring every pairs of user/course in Udemy • 7M users X 30K courses = 210B pairs of computation – You need scalable Serving Layer (Cassandra, Hbase, …) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 18. DATA ENGINEERING AT UDEMY Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 19. Data Warehouse/Analytics • We use AWS Redshift as our data warehouse (or data analytics backend) • What is AWS Redshift? – Scalable Postgresql Engine up to 1.6PB of data – Roughly it is 600 USD per TB per month – Mainly for offline batch processing – Supports bulk update (through AWS S3) – Two type of options: Compute vs. Storage Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 20. Kind of Data Stored in Redshift • 800+ tables with 2.4TB of data • Key tables from MySQL • Email Marketing Data • Ads Campaign Performance Data • SEO data from Google • Data from Web access log • Support Ticket Data • A/B Test Data (Mobile, Web) • Human curated data from Google Spreadsheets Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 21. Details on ETL Pipelines • All data pipelines are scheduled through Pinball – Every 5 minutes, hourly, daily, weekly and monthly • Most pipelines are purely in Python • Some uses Hadoop/Hive and Hadoop/Pig for Batch Processing • Start using Kinesis for Realtime Processing Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 22. Pinball Screenshot Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 23. Batching Processing Infrastructure • We use Hadoop 2.6 with Hive and Pig – CDH 5.4 (community version) • We use our own hadoop cluster and AWS EMR (ElasticMapReduce) at the same time – This is used to do ETL on massive data – This is also used to run massive model/scoring pipelines from Data Science team • Plan to evaluate Spark potentially as an alternative Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 24. Realtime Processing • Applications – The first application is to process web access log – Eventually we plan to use this to generate personalized recommendation on-the-fly • Plan to use AWS Kinesis – Evaluated Apache Kafka and AWS Kinesis • They are very similar but Kafka is an open source while Kinesis is a managed service from AWS • Decided to use AWS Kinesis • Plan to evaluate Realtime Consumer – Samza, Storm, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 25. What is Kinesis (Kafka)? • Realtime data processing service in AWS – Publisher-Subscriber message broker – Very similar to Kafka • It has two components – One is message queue where stream of data is stored • 24 hours of retention period • Pay hourly by the read/write rate of the queue – The other is KCL (Kinesis Client Library) • Using this, build Data Producer application or Data Consumer Application • This can be combined with Storm, Spark Streaming, … Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 26. Data Serving Layer • Redshift isn’t a good fit to read out the data in realtime fashion so you need something else • We are using (or plan to use) the followings: – Cassandra – Redis – ElasticSearch – MySQL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 27. How It Looks Like Udemy Log Files (Nginx) MySQL Key/Value (Cassandra) Data Warehouse (Redshift) External Data Sources Data Science Pipeline ETL ETL Data Science Pipeline Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 28. LESSONS LEARNED Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 29. • As a small start-up survive first and then work on data • Starting point is to store all data in a single location (data warehouse) • Start with batch processing and then realtime • Consider the type of data you store – Log vs. Key/Value vs. Transactional Record • Store data in the form of log (change history) – So that you can always go back and debug/replay • Cloud is good unless you have really massive data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 30. Q & A Udemy is Hiring Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Editor's Notes

  1. Logging format. Don’t try to take a snapshot and do aggregation
  2. Add diagram MySQL was likely to be used to store all data and used by data analysts
  3. Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  4. Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  5. Realtime recommendation