SlideShare a Scribd company logo
themidgame-tubea data platform for finding YouTube influencers on brand names
Pedro Moy
Data Sources / Scraping
2
19M videos
500 GB
18K brands
Data Pipeline
3
YARN
AWS Elastic MapReduce
Thrift API
MapReduce MapReduce
Design Choices
1. AWS Elastic MapReduce (EMR) vs. Mesosphere (with Docker)
 zero deployment efforts
 EMR with YARN, dynamically scale your cluster
2. batch-only design
 lower-maintenance (no Kafka, no Storm/Spark Streaming)
3. S3+HDFS together
 cheap unlimited storage with HDFS batch performance
 safely shut down your cluster
4
System Scalability Test With Spot Instances
5
Slave
Nodes
Batch
Process
Time
Overall
Throughput
Cost
Rate
Total
Cost
5 48 mins 10.4 GB/min $2.1/h $1.7
10 32 mins 15.6 GB/min $3.8/h $2.0
15 21 mins 23.8 GB/min $5.5/h $1.9
Load: +500 GB / +19M videos
Cluster Setup: Master: m3.xlarge
Slave: m2.4xlarge (bided at $0.1/h from $1.0/h)
Scalability Challenges & Salted Tables
6
Data Column
Channel Name
Channel ID
Video Title
Transcript Word Set
Word Frequency
RowKey = SALT+Brand+MetricType+…
…NumMetric+VideoID
SALTING = predefining “table” (region) splits using the SALT
HBase Schema
RowKey0
RowKey1
RowLey2
…
Born in Brazil
Love mixing with different cultures!
Former Data Analytics Engineer
Boston MA
Studying machine learning (part-time)
OMSCS, Georgia Tech
MSc, Geophysics
King Abdullah University of Science and Tech.
BSc, Mechanical Engineering
University of Massachusetts Lowell
7
System Scalability Test With Spot Instances[2]
8
Core EC2
Instances
Spark
Executors
Spark
Jobs Time
Batch
Process Time
Overall
Throughput
Total Cost Savings
5 119 23 mins 48 mins (25) 10.4 GB/min $2.1/h ($1.7) 68%
10 239 12 mins 32 mins (20) 15.6 GB/min $3.8/h ($2.0) 70%
15 359 7 mins 21 mins (14) 23.8 GB/min $5.5/h ($1.9) 70%
Load: +500 GB / +19M videos
Cluster Setup: Master:
Core:
Spark Setup: 2 GB Memory/Executor
m3.xlarge 15 GB RAM 13 units 4 cores
m2.4xlarge 68 GB RAM 26 units 8 cores

More Related Content

What's hot

HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud PartyHaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HPCC Systems
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
Timan Rebel
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
Amazon Web Services
 
Enhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolutionEnhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolution
Edward Kim
 
Gx4
Gx4Gx4
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
Zalando adtech lab
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon Web Services
 
AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012
Rodrigo Campos
 
Agx 2
Agx 2Agx 2
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Danny Abukalam
 
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
Amazon Web Services
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgateguest40fc7cd
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
geetachauhan
 
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
Preferred Networks
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveDataWorks Summit
 
Whirr devdown
Whirr devdownWhirr devdown
Whirr devdown
Puppet
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
Bigstep
 

What's hot (18)

HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud PartyHaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Enhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolutionEnhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolution
 
Gx4
Gx4Gx4
Gx4
 
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012
 
Agx 2
Agx 2Agx 2
Agx 2
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
 
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
 
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
 
Whirr devdown
Whirr devdownWhirr devdown
Whirr devdown
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 

Similar to themidgame-tube-slides

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
Amazon Web Services
 
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
Sri Ambati
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Amazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
 
Galaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWSGalaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWS
Enis Afgan
 
The future of tape
The future of tapeThe future of tape
The future of tape
Josef Weingand
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
Red_Hat_Storage
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentation
Eliran Yamin
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
Vijay Rayapati
 
IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160
Josef Weingand
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
Amazon Web Services
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
Andrew Yongjoon Kong
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
Cs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computingCs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computing
kartiko edhi
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform
confluent
 
Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2
cmcavoy
 

Similar to themidgame-tube-slides (20)

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
Galaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWSGalaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWS
 
The future of tape
The future of tapeThe future of tape
The future of tape
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentation
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
 
IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Cs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computingCs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computing
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform
 
Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

themidgame-tube-slides

  • 1. themidgame-tubea data platform for finding YouTube influencers on brand names Pedro Moy
  • 2. Data Sources / Scraping 2 19M videos 500 GB 18K brands
  • 3. Data Pipeline 3 YARN AWS Elastic MapReduce Thrift API MapReduce MapReduce
  • 4. Design Choices 1. AWS Elastic MapReduce (EMR) vs. Mesosphere (with Docker)  zero deployment efforts  EMR with YARN, dynamically scale your cluster 2. batch-only design  lower-maintenance (no Kafka, no Storm/Spark Streaming) 3. S3+HDFS together  cheap unlimited storage with HDFS batch performance  safely shut down your cluster 4
  • 5. System Scalability Test With Spot Instances 5 Slave Nodes Batch Process Time Overall Throughput Cost Rate Total Cost 5 48 mins 10.4 GB/min $2.1/h $1.7 10 32 mins 15.6 GB/min $3.8/h $2.0 15 21 mins 23.8 GB/min $5.5/h $1.9 Load: +500 GB / +19M videos Cluster Setup: Master: m3.xlarge Slave: m2.4xlarge (bided at $0.1/h from $1.0/h)
  • 6. Scalability Challenges & Salted Tables 6 Data Column Channel Name Channel ID Video Title Transcript Word Set Word Frequency RowKey = SALT+Brand+MetricType+… …NumMetric+VideoID SALTING = predefining “table” (region) splits using the SALT HBase Schema RowKey0 RowKey1 RowLey2 …
  • 7. Born in Brazil Love mixing with different cultures! Former Data Analytics Engineer Boston MA Studying machine learning (part-time) OMSCS, Georgia Tech MSc, Geophysics King Abdullah University of Science and Tech. BSc, Mechanical Engineering University of Massachusetts Lowell 7
  • 8. System Scalability Test With Spot Instances[2] 8 Core EC2 Instances Spark Executors Spark Jobs Time Batch Process Time Overall Throughput Total Cost Savings 5 119 23 mins 48 mins (25) 10.4 GB/min $2.1/h ($1.7) 68% 10 239 12 mins 32 mins (20) 15.6 GB/min $3.8/h ($2.0) 70% 15 359 7 mins 21 mins (14) 23.8 GB/min $5.5/h ($1.9) 70% Load: +500 GB / +19M videos Cluster Setup: Master: Core: Spark Setup: 2 GB Memory/Executor m3.xlarge 15 GB RAM 13 units 4 cores m2.4xlarge 68 GB RAM 26 units 8 cores