SlideShare a Scribd company logo
1 of 36
Getting to 1.5M Ads per Second
How DataXu Manages Big Data
AWS, DataXu, Qubole
March 30th, 2015
Today’s speakers
Yekesa Kosuru
VP of Engineering,
DataXu
Ashish Dubey
Solutions Architect,
Qubole
Scott Ward
Solutions Architect,
AWS
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Housekeeping
• The recording link will be distributed to all registrants via email after
the webinar next week
• Please submit your questions and comments using the Chat with
Presenters box located at the bottom left corner of your screen
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Technologies and techniques for working
productively with data, at any scale.
Big Data
Creating Value from Data Assets
Recommendations,
Collective Intelligence
Machine Learning
Visualization
Dashboards
Business Intelligence
Measuring Functionality
and Services
Ad Hoc Queries
A/B Testing
Hypothesis Testing &
Predictions
Statistical
Analysis
Learning from Social
Media Conversations
Sentiment Analysis
SOCIAL
BIG DATA
Machine Learning Dashboards
Business Intelligence
Ad Hoc Queries
A/B Testing
Statistical
Analysis
Sentiment Analysis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle
Big Data AWS Cloud
Potentially Massive Data Sets Massive, virtually unlimited capacity
Iterative, experimental style of data manipulation
and analysis
Iterative, experimental style of infrastructure
deployment/usage
Frequently not a steady-state workload;
peaks and valleys
Efficient with highly variable workloads
Time to results is key
Parallel compute clusters from single data source
Hard to configure/manage
Managed services for data storage and analysis
Big Data + AWS
AWS Data Services
Data
Velocity
Variety
Volume
Structured, Unstructured, Text, Binary
Gigabytes, Terabytes, Petabytes
Millisecond, Second, Minute, Hour, Day
EC2EBS
Instance Storage
RedshiftRDS
SQL Stores
EMR
Hadoop
DynamoDB
NoSQL
Kinesis
Stream
Storage Services
S3 Cloud
Front
Glacier
Elasticache
Caching
Data
Pipeline
Orchestrate
Amazon Elastic Map Reduce
Hosted Hadoop Framework
• Easy to use and fully managed
• Secure
• Resizable clusters to support processing needs
• Support for EC2 spot instances
• Use many query tools to support analysis of
your data
– Hive, Pig, Hbase, Spark, BI Tools, etc
• EMR-FS for an S3 backed data store.
• Direct integration with other AWS data stores
– S3, Redshift, DynamoDB
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon EMR Architecture
EMR Security
• Security groups for master and
slave instances
• Instances launch in your VPC
• Encrypt data in S3
• Control who can access S3 data
• API requests required signed key
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon Redshift
Petabyte Scale Data Warehouse
• Fully managed data warehouse solution
• Able to achieve petabyte scale at $1000
per TB per year
• Integrates with existing data warehouse
tools
• Scales through columnar storage and
parallel query execution
• Data load directly from S3
• Integration with Amazon EMR
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB,
Amazon EMR, Amazon S3, HDFS/SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in
Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
Amazon Redshift Security
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)
Qubole Cluster Settings
Qubole Cluster Set up with AWS Credentials
Qubole Query Types
Qubole Dashboard
Agenda Slide
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
| 26
DataXu Introduction
Disruptive on-demand software platform relied upon by the world’s
leading brands
A petabyte scale marketing cloud that enables Fortune 500 brands to
manage data, insight and action to maximize Marketing ROI
The industry’s #1 rated programmatic marketing technology
spun out of MIT by the founders
One of the fastest growing companies in the Inc. 500
| 27
DataXu Quick Statistics
Big data + Real time decisions
Big Data
Processing
13 petabytes
of data
20 terabytes/day
consumer data intake
Real-Time
Decisioning
42 billion
decisions per second
1,500,000
Inbound Queries Per Second
Dozens of
algorithms across mobile,
social, native, display,
video and TV
Predictive
Modeling
Executing 10,000+
investments simultaneously
10M variables
considered per investment
decision using next gen
machine learning
Enterprise-
Cloud
Infrastructure
14
data centers
35,000+
CPU cores
Patent portfolio for real-time decision systems
Exclusive license from MIT to Algebra Of Systems IPR
| 28
Programmatic buying exploits real time signals to
drive greater ROI.
Analyze the attributes
available at bidding time
Assess the value of each
impression to determine a bid
price and the creative to serve
Learn from served
impressions to adjust future
bidding and creative delivery
OptimizeAppraiseAnalyze
Context Geo O.S.
Time Demo Etc.
| 29
• On-premise and Cloud
• Why Cloud/AWS
– Automation, API driven
– All Data in One Place
– Improved Testability
– Deep Security
– Breadth and Depth of Services
– Costs, Pay As You Go
– Auto Scaling (Scalability, Elasticity)
– Disaster Recovery and Business Continuity
DataXu in the Cloud
AWS
| 30
DataXu Data Flows in AWS
Producers Continuous
Processing
Storage
Analytics
CDN
Real Time
Bidding
Retargeting
Platform
Qubole
Kinesis S3 Redshift
Machine
LearningStreaming
Data Collection
Analysts
Data Scientists
Engineers
| 31
Why Qubole
Managed Service
• Auto Scaling
• Spot Pricing
• No Opex
• Redundant Clusters
• Data Security
Single Unified Interface
• Rich Unified Experience
• Data Discovery tool
• Query Templates
• Administration and Monitoring
Performance Optimizations
• Overall better performance than other
Hadoop clusters in the cloud
Automation
• Workflow, Scheduler
• SDK
Support
• 24 X 7 deep expertise support
| 32
Unified Experience
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Easy of use
for anyone
| 33
• Use VPC, pick AZ’s appropriately to match reservations
• Use hybrid spot pricing strategy
• Use tags for better reporting
• Seek Qubole help for cluster tuning
Qubole Cluster Best Practices
| 34
Data Security & Privacy
• AWS offers comprehensive data security
• Security & Privacy
– VPC
– IAM Policies, Users, Roles
– S3 Buckets, Bucket Policies & HTTPS
– Security Groups, Whitelist IP CIDR
– Key Management Service & CloudHSM
– Server Side and Client Side Encryption
| 35
Right tool for right workload
Large scale ETL
Interactive
Discovery
Queries
Machine
Learning/Real time
queries
High Performance
DW
Queries/Reporting
backend
Use Case / Technology
Questions?
DataXu
Yekesa Kosuru
ykosuru@dataxu.com
www.dataxu.com
Qubole
Ashish Dubey
adubey@qubole.com
www.qubole.com
AWS
Scott Ward
scotward@amazon.com
aws.amazon.com

More Related Content

What's hot

16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
infolive
 

What's hot (20)

Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analytics
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Modernize Legacy and Enterprise Application Through Implementation of Cloud N...
Modernize Legacy and Enterprise Application Through Implementation of Cloud N...Modernize Legacy and Enterprise Application Through Implementation of Cloud N...
Modernize Legacy and Enterprise Application Through Implementation of Cloud N...
 
Lecture1
Lecture1Lecture1
Lecture1
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 

Viewers also liked

Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Joydeep Sen Sarma
 

Viewers also liked (20)

Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
RDO-Packstack Workshop
RDO-Packstack Workshop RDO-Packstack Workshop
RDO-Packstack Workshop
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Creating a fortigate vpn network & security blog
Creating a fortigate vpn   network & security blogCreating a fortigate vpn   network & security blog
Creating a fortigate vpn network & security blog
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure Workloads
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Ready
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 

Similar to Getting to 1.5M Ads/sec: How DataXu manages Big Data

Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
Nojan Emad
 

Similar to Getting to 1.5M Ads/sec: How DataXu manages Big Data (20)

Microsoft Azure Fundamentals
Microsoft Azure FundamentalsMicrosoft Azure Fundamentals
Microsoft Azure Fundamentals
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the Cloud
 
Accelerate your Cloud Success with Platform Services
Accelerate your Cloud Success with Platform ServicesAccelerate your Cloud Success with Platform Services
Accelerate your Cloud Success with Platform Services
 
What's new in AWS?
What's new in AWS?What's new in AWS?
What's new in AWS?
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Get Started with Microsoft Azure.pptx
Get Started with Microsoft Azure.pptxGet Started with Microsoft Azure.pptx
Get Started with Microsoft Azure.pptx
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
Cloud Native Apps
Cloud Native AppsCloud Native Apps
Cloud Native Apps
 
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
 
AWS Webcast - Sumo Logic
AWS Webcast - Sumo LogicAWS Webcast - Sumo Logic
AWS Webcast - Sumo Logic
 
Microsoft Azure
Microsoft AzureMicrosoft Azure
Microsoft Azure
 
Amazon AWS vs Azure Cloud vs Kubernetes
Amazon AWS vs Azure Cloud vs KubernetesAmazon AWS vs Azure Cloud vs Kubernetes
Amazon AWS vs Azure Cloud vs Kubernetes
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure
 
RightScale Webinar: Hybrid-IT: Connecting Your On-Premises Infrastructure Wit...
RightScale Webinar: Hybrid-IT: Connecting Your On-Premises Infrastructure Wit...RightScale Webinar: Hybrid-IT: Connecting Your On-Premises Infrastructure Wit...
RightScale Webinar: Hybrid-IT: Connecting Your On-Premises Infrastructure Wit...
 
AWS January 2016 Webinar Series - Getting Started with Big Data on AWS
AWS January 2016 Webinar Series - Getting Started with Big Data on AWSAWS January 2016 Webinar Series - Getting Started with Big Data on AWS
AWS January 2016 Webinar Series - Getting Started with Big Data on AWS
 
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 

More from Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
Qubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
Qubole
 

More from Qubole (16)

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Recently uploaded

Recently uploaded (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 

Getting to 1.5M Ads/sec: How DataXu manages Big Data

  • 1. Getting to 1.5M Ads per Second How DataXu Manages Big Data AWS, DataXu, Qubole March 30th, 2015
  • 2. Today’s speakers Yekesa Kosuru VP of Engineering, DataXu Ashish Dubey Solutions Architect, Qubole Scott Ward Solutions Architect, AWS
  • 3. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 4. Housekeeping • The recording link will be distributed to all registrants via email after the webinar next week • Please submit your questions and comments using the Chat with Presenters box located at the bottom left corner of your screen
  • 5. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 6.
  • 7. Technologies and techniques for working productively with data, at any scale. Big Data
  • 8. Creating Value from Data Assets Recommendations, Collective Intelligence Machine Learning Visualization Dashboards Business Intelligence Measuring Functionality and Services Ad Hoc Queries A/B Testing Hypothesis Testing & Predictions Statistical Analysis Learning from Social Media Conversations Sentiment Analysis SOCIAL BIG DATA Machine Learning Dashboards Business Intelligence Ad Hoc Queries A/B Testing Statistical Analysis Sentiment Analysis
  • 9. Generation Collection & storage Analytics & computation Collaboration & sharing Big Data Lifecycle
  • 10. Big Data AWS Cloud Potentially Massive Data Sets Massive, virtually unlimited capacity Iterative, experimental style of data manipulation and analysis Iterative, experimental style of infrastructure deployment/usage Frequently not a steady-state workload; peaks and valleys Efficient with highly variable workloads Time to results is key Parallel compute clusters from single data source Hard to configure/manage Managed services for data storage and analysis Big Data + AWS
  • 11. AWS Data Services Data Velocity Variety Volume Structured, Unstructured, Text, Binary Gigabytes, Terabytes, Petabytes Millisecond, Second, Minute, Hour, Day EC2EBS Instance Storage RedshiftRDS SQL Stores EMR Hadoop DynamoDB NoSQL Kinesis Stream Storage Services S3 Cloud Front Glacier Elasticache Caching Data Pipeline Orchestrate
  • 12. Amazon Elastic Map Reduce Hosted Hadoop Framework • Easy to use and fully managed • Secure • Resizable clusters to support processing needs • Support for EC2 spot instances • Use many query tools to support analysis of your data – Hive, Pig, Hbase, Spark, BI Tools, etc • EMR-FS for an S3 backed data store. • Direct integration with other AWS data stores – S3, Redshift, DynamoDB
  • 13. Master instance group Task instance groupCore instance group HDFS HDFS Amazon S3 Amazon Redshift Amazon DynamoDB Amazon EMR Architecture
  • 14. EMR Security • Security groups for master and slave instances • Instances launch in your VPC • Encrypt data in S3 • Control who can access S3 data • API requests required signed key Master instance group Task instance groupCore instance group HDFS HDFS Amazon S3 Amazon Redshift Amazon DynamoDB
  • 15. Amazon Redshift Petabyte Scale Data Warehouse • Fully managed data warehouse solution • Able to achieve petabyte scale at $1000 per TB per year • Integrates with existing data warehouse tools • Scales through columnar storage and parallel query execution • Data load directly from S3 • Integration with Amazon EMR
  • 16. Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 17. • SSL to secure data in transit • Encryption to secure data at rest – AES-256; hardware accelerated – All blocks on disks and in Amazon S3 encrypted – HSM/CloudHSM • No direct access to compute nodes • Amazon VPC support 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal Security Group JDBC/ODBC Amazon Redshift Security
  • 18.
  • 19. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 20. 2014 Usage Statistics for Qubole on AWS: • Total QCUH processed in 2014 = 40.6 million • Total nodes managed in 2014 = 2.5 million • Total PB processed in 2014 = 519 Operations Analyst Marketing Ops Analyst Data Architects Business Users Product Support Customer Support Developer Sales Ops Product Managers Developer Tools Service Management Data Workbench Cloud Data Platform BI & DW Systems • SDK • API • Analysis • Security • Job Scheduler • Data Governance • Analytics templates • Monitoring • Support • Collaboration • Workflow & Map/Reduce • Auto Scaling • Cloud Optimization • Data Connectors• YARN • Presto & Hive• Spark & Pig Hadoop Ecosystem (Apache Open Source)
  • 22. Qubole Cluster Set up with AWS Credentials
  • 25. Agenda Slide • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 26. | 26 DataXu Introduction Disruptive on-demand software platform relied upon by the world’s leading brands A petabyte scale marketing cloud that enables Fortune 500 brands to manage data, insight and action to maximize Marketing ROI The industry’s #1 rated programmatic marketing technology spun out of MIT by the founders One of the fastest growing companies in the Inc. 500
  • 27. | 27 DataXu Quick Statistics Big data + Real time decisions Big Data Processing 13 petabytes of data 20 terabytes/day consumer data intake Real-Time Decisioning 42 billion decisions per second 1,500,000 Inbound Queries Per Second Dozens of algorithms across mobile, social, native, display, video and TV Predictive Modeling Executing 10,000+ investments simultaneously 10M variables considered per investment decision using next gen machine learning Enterprise- Cloud Infrastructure 14 data centers 35,000+ CPU cores Patent portfolio for real-time decision systems Exclusive license from MIT to Algebra Of Systems IPR
  • 28. | 28 Programmatic buying exploits real time signals to drive greater ROI. Analyze the attributes available at bidding time Assess the value of each impression to determine a bid price and the creative to serve Learn from served impressions to adjust future bidding and creative delivery OptimizeAppraiseAnalyze Context Geo O.S. Time Demo Etc.
  • 29. | 29 • On-premise and Cloud • Why Cloud/AWS – Automation, API driven – All Data in One Place – Improved Testability – Deep Security – Breadth and Depth of Services – Costs, Pay As You Go – Auto Scaling (Scalability, Elasticity) – Disaster Recovery and Business Continuity DataXu in the Cloud AWS
  • 30. | 30 DataXu Data Flows in AWS Producers Continuous Processing Storage Analytics CDN Real Time Bidding Retargeting Platform Qubole Kinesis S3 Redshift Machine LearningStreaming Data Collection Analysts Data Scientists Engineers
  • 31. | 31 Why Qubole Managed Service • Auto Scaling • Spot Pricing • No Opex • Redundant Clusters • Data Security Single Unified Interface • Rich Unified Experience • Data Discovery tool • Query Templates • Administration and Monitoring Performance Optimizations • Overall better performance than other Hadoop clusters in the cloud Automation • Workflow, Scheduler • SDK Support • 24 X 7 deep expertise support
  • 33. | 33 • Use VPC, pick AZ’s appropriately to match reservations • Use hybrid spot pricing strategy • Use tags for better reporting • Seek Qubole help for cluster tuning Qubole Cluster Best Practices
  • 34. | 34 Data Security & Privacy • AWS offers comprehensive data security • Security & Privacy – VPC – IAM Policies, Users, Roles – S3 Buckets, Bucket Policies & HTTPS – Security Groups, Whitelist IP CIDR – Key Management Service & CloudHSM – Server Side and Client Side Encryption
  • 35. | 35 Right tool for right workload Large scale ETL Interactive Discovery Queries Machine Learning/Real time queries High Performance DW Queries/Reporting backend Use Case / Technology

Editor's Notes

  1. 1.5 million ad requests per sec Billions of impressions per month, Petabytes of data ~10ms round trip average response time, 100ms max Serving in 50+ countries around the world Over 20 TB data collected per day Integrated with over 30 exchanges around the world
  2. No HDFS, there is no reliable way to auto scaling
  3. Pretty innvoative, using spot Qubole has put thoughts into cost effective Spot pricning anf auto scaling Talk about auto scaling – cost optimization HDFS – does not make sense in Qubole, don’t rely on,