Big Data on EC2: Mashing Tech in the Cloud

•

1 like•655 views

This document discusses how a startup serving widgets on popular online publications scaled their infrastructure using Amazon Web Services to handle spikes in traffic from over 1 billion users sharing over 10 billion URLs. They used a hub-and-spoke architecture with components like Cascading, Amazon Elastic MapReduce, and AsterData to analyze user sharing patterns in a cost-effective and horizontally scalable way.

Technology

Scenario…

✤ Given: >10^6 publishers, >10^9 users, >10^10 urls

✤ Early-stage start-up, < 25 people, seeking to minimize spend on Ops
and capex for data centers

✤ Serving widgets on page views for popular online publications: ESPN,
HuffPost, FOX, CS Monitor, CBS Marketwatch, Wired, TechCrunch, etc.

✤ Spikes in popularity of stories leads to elastic demands throughout
the system architecture: serving API, logging, DW, BI, etc.

✤ Business needs to improve user experience by analyzing how people
share online content

ScaleCamp 2009-06-09

System Architecture

✤ 100% infrastructure based on Amazon AWS

✤ Each component designed for cost-effective, horizontal scale-out

✤ AsterData: infrastructure based on a “hub-and-spoke” pattern of batch
jobs and data consolidation

✤ Cascading: abstraction layer for tying together system components

✤ Batch jobs run on Amazon Elastic MapReduce and AsterData SQL/MR

✤ Vertical search based on Bixo and Katta

ScaleCamp 2009-06-09

Cascading

✤ Syntax is for humans, APIs are for software

✤ Cascading defines apps in terms of functions applied to data flows,
incorporates end-points; whereas MR is coarse, key/values are brittle

✤ Direct benefits to team’s responsibilities, process, deliverables

✤ Also consider impact on team size, composition, staffing, training

✤ Ideal for provisioning/monitoring elastic resources, such as EMR

✤ Great potential to allow for migrating apps

✤ (see also: Cascading talk @ Hadoop Summit tomorrow, 6pm)

ScaleCamp 2009-06-09

Amazon Elastic MapReduce

✤ Can scale-out wide and select different instance types at launch

✤ Reads input from S3, writes output to S3, optionally copies logs to S3

✤ Excellent command line tools make dev/test/debug cycle more efficient

✤ Risks for DIY Hadoop clusters: time+cost to acquire leases, data locality,
etc., — EMR resolves these to make large apps more cost-effective

✤ Simplifies needs for Ops — which can be quite difficult and expensive
to staff for Big Data apps, and pose a risk to scale-out strategy

✤ See the Cascading examples: LogAnalyzer for CloudFront and Multitool

ScaleCamp 2009-06-09

AsterData nCluster Cloud Edition

✤ Scalable, fault-tolerant, relational database — directly in the cloud

✤ Takes only minutes to go from idea to a prototype which scales well;
accessible by data scientists who have less coding background

✤ Use of SQL unions on map phase, plus other SQL primitives for shuffle
and reduce phases, allows for complex MR workflows

✤ Leveraging SQL primitives during shuffle and reduce phases allows for
automated optimizations

✤ Contributed code among developer community, implementing libraries
for popular algorithms based on In-Database MapReduce

ScaleCamp 2009-06-09

ShareThis as Case Study

See also: ShareThis case study, "Cascading"
by Chris K Wensel, in…

ScaleCamp 2009-06-09

Contacts:
http://sharethis.com
@pacoid on Twitter
http://cascading.org
http://asterdata.com/product/ncluster_cloud.php
http://aws.amazon.com/elasticmapreduce
http://github.com/emi/bixo
http://katta.sourceforge.net
http://www.hadoopbook.com

ScaleCamp 2009-06-09

What's hot

PowerStream: Propelling Energy Innovation with Predictive Analytics SingleStore

High availability, real-time and scalable architecturesJampp

Modeling the Smart and Connected City of the Future with Kafka and SparkSingleStore

Processing 19 billion messages in real time and NOT dying in the processJampp

Razorfish - Amazon EMR usecaseguestda111d9

Big problems Big Data, simple solutionsClaudio Pontili

Multi objective tasks scheduling algorithm for cloud computing throughput opt...Shakas Technologies

Change Data Capture - Scale by the Bay 2019Petr Zapletal

Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...Coburn Watson

Howtomakeyourown gi sdashboardGeoMedeelel

0 supermapproductsintroductionGeoMedeelel

Practical Experiences With ArcGIS ServerWisconsin Land Information Association

02 supermapiclientforjavascriptintroductionGeoMedeelel

01 supermapiportaloverviewGeoMedeelel

Autonomous analytics on streaming dataClaudiu Barbura

01 supermapiserverintroductionGeoMedeelel

Scalable crawling with Kafka, scrapy and spark - November 2021Max Lapan

AWS Customer Presentation - Cloud MadeAmazon Web Services

Cloud computing's truly open silver lining: OpenStackAsociatia ProLinux

Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSingleStore

What's hot (20)

PowerStream: Propelling Energy Innovation with Predictive Analytics

High availability, real-time and scalable architectures

Modeling the Smart and Connected City of the Future with Kafka and Spark

Processing 19 billion messages in real time and NOT dying in the process

Razorfish - Amazon EMR usecase

Big problems Big Data, simple solutions

Multi objective tasks scheduling algorithm for cloud computing throughput opt...

Change Data Capture - Scale by the Bay 2019

Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...

Howtomakeyourown gi sdashboard

0 supermapproductsintroduction

Practical Experiences With ArcGIS Server

02 supermapiclientforjavascriptintroduction

01 supermapiportaloverview

Autonomous analytics on streaming data

01 supermapiserverintroduction

Scalable crawling with Kafka, scrapy and spark - November 2021

AWS Customer Presentation - Cloud Made

Cloud computing's truly open silver lining: OpenStack

Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark

Viewers also liked

Ahead of the curve with EcommerceRahul Singh

Social Media for Atlantic Canada MarketingMediaBadger

Social media for brands.pdfAnjanette Delgado

Canadaalbinonunez

Women and Social Media in CanadaAyelet Baron

How Canadians Shopguest970d5

CanadaeconomicsNorth Gwinnett Middle School

Maglev trainsSelf-employed

Wind power forecasting accuracy and uncertainty in FinlandVTT Technical Research Centre of Finland Ltd

Kristof Coussement - The Debate: the Future of (Big) Data Analytics SoftwareBAQMaR

Animatronics PortfolioBrian Jaecker-Jones

Histmojurassicpark 090807051516-phpapp01ajeesh n

eMarketer Webinar: Demographics in Canada—Age-based Digital BehaviorseMarketer

Stan winston & animatronicsJason Seon-gyu Park

AnimatronicsPLANB1

Maglev trains pptLokesh Choudhary

The Impact of Big Data On Marketing Analytics (UpStream Software)Revolution Analytics

Ultra wideband technology (UWB)Mustafa Khaleel

Canada country presentation_presentationTeresa Maroto Ramos

Wireless Application Protocol pptgo2project

Viewers also liked (20)

Ahead of the curve with Ecommerce

Social Media for Atlantic Canada Marketing

Social media for brands.pdf

Canada

Women and Social Media in Canada

How Canadians Shop

Canadaeconomics

Maglev trains

Wind power forecasting accuracy and uncertainty in Finland

Kristof Coussement - The Debate: the Future of (Big) Data Analytics Software

Animatronics Portfolio

Histmojurassicpark 090807051516-phpapp01

eMarketer Webinar: Demographics in Canada—Age-based Digital Behaviors

Stan winston & animatronics

Animatronics

Maglev trains ppt

The Impact of Big Data On Marketing Analytics (UpStream Software)

Ultra wideband technology (UWB)

Canada country presentation_presentation

Wireless Application Protocol ppt

Similar to Big Data on EC2: Mashing Tech in the Cloud

Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...DATAVERSITY

AWS Start-Up Tour 2009 / ShareThisPaco Nathan

High Performance Computing on AWS: Accelerating Innovation with virtually unl...Amazon Web Services

Building a Big Data & Analytics Platform using AWS Amazon Web Services

Cloud Computing Realities - Getting past the hype and setting your cloud stra...Compuware APM

Apache Cassandra: NoSQL in the enterprisejbellis

ESA and the CloudNetcetera

Cloud Computing ...changes everythingLew Tucker

Leveraging Mainframe Data for Modern Analyticsconfluent

Simplify Your Migration to AWS and Cut Costs by 30% with TSO LogicAmazon Web Services

Optimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS SummitAmazon Web Services

Cloud Has Become the New Normal: TCS Amazon Web Services

Zeller Edm Summit Agile Deployment Of Predictive AnalyticsRonald.Ramos

High Performance Computing on AWSAmazon Web Services

Cloud Computingwebscale

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)Amazon Web Services

Giga Spaces Alternative To GAE_JavaOne 09Amnon Raviv

AWS Summit 2018 SummaryAshish Mrig

Cloud Architecture - Multi Cloud, Edge, On-PremiseAraf Karsh Hamid

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network

Similar to Big Data on EC2: Mashing Tech in the Cloud (20)

Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...

AWS Start-Up Tour 2009 / ShareThis

High Performance Computing on AWS: Accelerating Innovation with virtually unl...

Building a Big Data & Analytics Platform using AWS

Cloud Computing Realities - Getting past the hype and setting your cloud stra...

Apache Cassandra: NoSQL in the enterprise

ESA and the Cloud

Cloud Computing ...changes everything

Leveraging Mainframe Data for Modern Analytics

Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic

Optimizing Cost and Capacity for Compute - CMP302 - Santa Clara AWS Summit

Cloud Has Become the New Normal: TCS

Zeller Edm Summit Agile Deployment Of Predictive Analytics

High Performance Computing on AWS

Cloud Computing

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Giga Spaces Alternative To GAE_JavaOne 09

AWS Summit 2018 Summary

Cloud Architecture - Multi Cloud, Edge, On-Premise

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Histor y of HAM Radio presentation slidevu2urc

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Artificial Intelligence: Facts and MythsJoaquim Jorge

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

How to convert PDF to text with Nanonetsnaman860154

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Histor y of HAM Radio presentation slide

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

How to Troubleshoot Apps for the Modern Connected Worker

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Automating Google Workspace (GWS) & more with Apps Script

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Exploring the Future Potential of AI-Enabled Smartphone Processors

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

CNv6 Instructor Chapter 6 Quality of Service

Data Cloud, More than a CDP by Matt Robison

🐬 The future of MySQL is Postgres 🐘

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Artificial Intelligence: Facts and Myths

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

How to convert PDF to text with Nanonets

Driving Behavioral Change for Information Management through Data-Driven Gree...

Big Data on EC2: Mashing Tech in the Cloud

1. Big Data on EC2: Mashing Technology in the Cloud ShareThis Paco Nathan, Data Insights Team ScaleCamp 2009-06-09

2. Scenario… ✤ Given: >10^6 publishers, >10^9 users, >10^10 urls ✤ Early-stage start-up, < 25 people, seeking to minimize spend on Ops and capex for data centers ✤ Serving widgets on page views for popular online publications: ESPN, HuffPost, FOX, CS Monitor, CBS Marketwatch, Wired, TechCrunch, etc. ✤ Spikes in popularity of stories leads to elastic demands throughout the system architecture: serving API, logging, DW, BI, etc. ✤ Business needs to improve user experience by analyzing how people share online content ScaleCamp 2009-06-09

3. System Architecture ✤ 100% infrastructure based on Amazon AWS ✤ Each component designed for cost-effective, horizontal scale-out ✤ AsterData: infrastructure based on a “hub-and-spoke” pattern of batch jobs and data consolidation ✤ Cascading: abstraction layer for tying together system components ✤ Batch jobs run on Amazon Elastic MapReduce and AsterData SQL/MR ✤ Vertical search based on Bixo and Katta ScaleCamp 2009-06-09

4. ScaleCamp 2009-06-09

5. Cascading ✤ Syntax is for humans, APIs are for software ✤ Cascading defines apps in terms of functions applied to data flows, incorporates end-points; whereas MR is coarse, key/values are brittle ✤ Direct benefits to team’s responsibilities, process, deliverables ✤ Also consider impact on team size, composition, staffing, training ✤ Ideal for provisioning/monitoring elastic resources, such as EMR ✤ Great potential to allow for migrating apps ✤ (see also: Cascading talk @ Hadoop Summit tomorrow, 6pm) ScaleCamp 2009-06-09

6. Amazon Elastic MapReduce ✤ Can scale-out wide and select different instance types at launch ✤ Reads input from S3, writes output to S3, optionally copies logs to S3 ✤ Excellent command line tools make dev/test/debug cycle more efficient ✤ Risks for DIY Hadoop clusters: time+cost to acquire leases, data locality, etc., — EMR resolves these to make large apps more cost-effective ✤ Simplifies needs for Ops — which can be quite difficult and expensive to staff for Big Data apps, and pose a risk to scale-out strategy ✤ See the Cascading examples: LogAnalyzer for CloudFront and Multitool ScaleCamp 2009-06-09

7. AsterData nCluster Cloud Edition ✤ Scalable, fault-tolerant, relational database — directly in the cloud ✤ Takes only minutes to go from idea to a prototype which scales well; accessible by data scientists who have less coding background ✤ Use of SQL unions on map phase, plus other SQL primitives for shuffle and reduce phases, allows for complex MR workflows ✤ Leveraging SQL primitives during shuffle and reduce phases allows for automated optimizations ✤ Contributed code among developer community, implementing libraries for popular algorithms based on In-Database MapReduce ScaleCamp 2009-06-09

8. ShareThis as Case Study See also: ShareThis case study, "Cascading" by Chris K Wensel, in… ScaleCamp 2009-06-09

9. Contacts: http://sharethis.com @pacoid on Twitter http://cascading.org http://asterdata.com/product/ncluster_cloud.php http://aws.amazon.com/elasticmapreduce http://github.com/emi/bixo http://katta.sourceforge.net http://www.hadoopbook.com ScaleCamp 2009-06-09

Big Data on EC2: Mashing Tech in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data on EC2: Mashing Tech in the Cloud

Similar to Big Data on EC2: Mashing Tech in the Cloud (20)

More from George Ang

More from George Ang (20)

Recently uploaded

Recently uploaded (20)

Big Data on EC2: Mashing Tech in the Cloud