We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLIONHadoop clusters
Forrester wave report named Amazon EMR the #1 enterprise hadoop solution because of it’s integration with various data stores, it’s ecosystem of vendors and the number of customers the service supports.
Hi, my name is Anupam Singh. I am the Vice President of Technology at MarketShare.
MarketShare builds solutions for marketing organizations at Fortune 100 companies. Our customers provide us data and we provide a cloud based analytic applications to improve the efficiency of our customer’s marketing.
So, what are the big challenges that we face? Our entire business is based on scaling complex data modeling. Our scaling challenges are across 4 major dimensions. Each customer has 10s of terabytes of data. The data comes from hundreds of data sources. This data has thousands of variables to analyze. And we need to do this for hundreds of customers. Let us look at the various stages to build a solution that scales.
The first stage is bringing the data together. Today’s marketing organization is faced with hundreds of data sources. Consider this picture where we bring together data from the customer’s website, the advertising logs from their vendors, revenue data from the ERP systems, variables like Seasonality & Economy. As you can see, we have to gather more than 40 data sources in this single picture. Just managing the storage for daily, weekly and monthly updates is a challenge.
A lot of this data is machine generated. And it is not ready for analytics. Each data source has to be scrubbed and cleaned through an ETL pipeline before doing analytics. Our ETL pipelines have 20-30 main stages with 100s of sub-stages. Scheduling these and correcting data errors is one of our biggest technical challenges. We will dive deeper into this later. Once the data has been cleaned, it is ready for analytics.
Many of our customers have never seen these data sources in a single dashboard. Even before running the data through our proprietary modeling platform, we can help our customers get dashboards on previous data black holes.
The term data scientist has been in vogue lately. At MarketShare, we have a large team of modelers who run modeling on the cloud. As the data has been cleaned up, the modelers run thousands of different equations. Many analytic applications stop their cloud usage at reporting. At MarketShare, we believe that reporting is not enough to answer the questions. Building a predictive model is key to answering business questions on terabytes of data. We use the cloud to build custom models for each one of our customers. We use the power of distributed systems to validate these models for accuracy.
Once the models have been prepared, they are deployed in an easy to use application. It should be noted that reducing big data should not mean that the user is lost in a forest of reports. At MarketShare, we believe in simplifying access to Big Data. We hide the model complexity behind easy to use applications that let our users build many different scenarios for their business.
So, what does all this give our customers? We have been able to release many different applications on top of this analytics pipeline. The first one is marketing efficiency. The second application is Attribution. The third one is Dynamic Pricing.
So, what makes this pipeline run? Our entire analytics workflow is built using various services from Amazon as building blocks. Our applications are deployed behind the elastic load balancer service. The data is stored in Storage services like S3, RDS and we are trying out Dynamo DB. Our analytics jobs are executed on dynamic clusters provided by elastic map reduce.
So, let us quickly go under the hood. 3 years ago, we started with a hadoop cluster to store all our data. Very quickly we noticed two important things with the cluster. The first observation is that however big we made the cluster, jobs kept running into each other. Try as we might, the cluster would get hot for some time when many different stages would start executing at the same time. The second observation was how unused the cluster was for large periods of our time. So, while we are spending a lot of dollars on this large cluster, our customers are still unhappy with the response times!
So, what was our solution? We rewrote our entire data pipeline to run many different clusters. So,
Big Data Discovery WorkshopBrainstorm pilot use casesIdentify data sources and formatsReview business and financial driversRecommended use casesRoadmap for data migration and production rolloutReference architectureEstimated pilot costNext stepsEMR BootcampInteractive onsite workshop (is not classroom training)Work w/customer to architect, install, and config EMRRun and debug production job flowsCustomer’s dataset(s) must be on S3
Big Data Marketing in the AWS Cloud: Improving Cross-Media Effectiveness - Webinar
Big Data Marketing in the AWS Cloud: ImprovingCross-Media Effectiveness
Welcome Sheri Sullivan Senior Marketing Manager Global SI Ecosystem Amazon Web Services
Webinar Overview• Submit Your Questions using the Q/A tool.• A copy of today’s presentation will be made available on: • AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/ • AWS YouTube Channel@ http://www.youtube.com/user/AmazonWebServices Special Note: Today’s Webinar is being recorded.
What We’ll Cover• Intro to AWS Database and Big Data Services• Customer Use Cases and Solutions• Delivering Cross-Media Analytics• MarketShare Planner Platform
John Gannon AWS BusinessDevelopment Manager email@example.com
Big Data and Databases on AWS Managed services designed to reduce administration, accelerate deployment, and minimize the cost of analysis and experimentation DynamoDB Schema-less data store that enables fast deployment of new applications without the burden of database administration Relational Database Service (RDS) Manage existing database applications without the effort required to provision, upgrade, backup and scale highly available instances ElastiCache Accelerate data retrieval performance by caching data in memory and avoiding slower disk-based systems Elastic MapReduce (EMR) Hadoop-based infrastructure service enabling the parallel processing of massive amounts of data
Amazon Relational Database ServiceRDS is a fully managed Relational database service that issimple to deploy, easy to scale, reliable and cost-effective Choice of Database Engines Fully Managed Service Push Button Scalability Fault Tolerance with Multi-AZ Works with EC2 & ElastiCache
Amazon DynamoDBDynamoDB is a fully managed NoSQL databaseservice that provides extremely fast andpredictable performance with seamless scalability Authors of NoSQL Zero Administration Low Latency SSD’s Unlimited Potential Storage and Throughput
AMAZON ELASTIC MAPREDUCE Reduces complexity & cost of Hadoop Management Integrates with AWS Services and 3rd Party vendors Highly customizable
Amazon EMR is the #1 Enterprise Hadoop SolutionAWS is “the mostprominent Hadoop cloudservice provider” and“leads the pack (ofLeaders) due to itsproven, feature-rich ElasticMapReduce service…”-The Forrester Wave™:Enterprise HadoopSolutions Q1 2012
Success StoryBusiness Challenge Needed a real-time analytics tool to determine dynamic live event pricing during the ticket sales life cycle Optimize event ticket pricing, improve yield management & generate incremental revenueAWS Services Elastic Load Amazon Elastic Amazon SimpleDB Amazon Simple Balancer MapReduce Amazon CloudWatch Email Service (SES)Business Benefits Ease of use, reducing developers’ infrastructure management time by 3 hours per day Estimated 80% cost reduction annually, compared to fixed service costs
Elastic Data ManagementMulti-Cluster, Elastic, Failure Resistant
Who we are MarketShare MarketShare Planner™ Price™ The global marketer partner of choice MarketShare MarketShare for understanding, optimizing and 360™ Optimizer™ driving revenue MarketShare Platform Cloud modeling | Saas infrastructure | Data connectors• Recognized industry leader Risky Strong• Bets Contenders Performers Leaders Cloud-based software solutions Strong• Over half the Fortune 100• Strong media and agency Current Offering partnerships• Global presence Weak Weak Strategy Strong
Terabytes per 1000+ variables customer Data ArchitectClient Data ETL Reportin Modeling g Sim-OptFTP Scale Complex Modeling Simulation Engineer Modeling Sim-Opts Tool Stack Production Stack Stack Tables Tables Tables Tables Application Modeler100+ Customers 100+ data sources
Brand Product Earned media ETL Organic search Reporting Modeling Innovation Quality Events Conferences Controllable Bing WOM Google Trade shows Sales Blogs Social media Twitter Awareness Training Owned PR Facebook Service Support media Commerce Simulatio Website Content Consideration DisplaysFTP n Shelf space In store Google Paid Search Bing Discounts Purchase Bundles Banner Ads Coupons Promotions Display Video Ads Magazine Offering Print Newspaper Pricing Competition TV Applicati Radio on Broadcast Signs Interest Seasonality Digital rates Non- Stock market signage Catalog Direct Mobile controllable mail email Paid media Economy Outdoor Direct
SummaryDesign your data pipeline for a multi-cluster environment • Write Configurable ETL to become independent, partitioned workflows • A cluster that stays up the entire month is not elastic Save your intermediate results in low cost storage • Think about compression • Do not underestimate schema complexityLoosely coupled architecture has failure points • Save state obsessively • Build restart-ability into your architecture
Programs to help you get started with Big Data on AWS Big Data EMR Discovery EMR Training Bootcamp WorkshopIdentify and prioritize target Deploy a sample use case 3 day intensive Big Data use cases with real customer data developer training
EMR Training Schedule• Los Angeles, CA – 10/16-10/18• Boston, MA – 10/30-11/1• Mountain View, CA – 11/13-11/15• Dallas, TX – 11/27-11/29• New York, NY – 12/11-12/13Visit http://bit.ly/AWS_EMR_Training for class details and registration
Questions?Contact:William MerchanVP, Business DevelopmentMarketSharewmerchan@marketshare.comJohn GannonBusiness Development Manager, AWSjgannon@amazon.com