• Save
End Note - AWS India Summit 2012
Upcoming SlideShare
Loading in...5
×
 

End Note - AWS India Summit 2012

on

  • 1,129 views

Big Data End Note for AWS Summit in India as given by Dr. Werner Vogels

Big Data End Note for AWS Summit in India as given by Dr. Werner Vogels

Statistics

Views

Total Views
1,129
Views on SlideShare
1,129
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Elasticity works from just 1 EC2 instance to many thousands. Just dial up and down as required.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLION Hadoop clusters
  • Yelp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpers.Yelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”
  • The more misspelled words you collect from your customers, the better spellcheck application you can createYelp is using AWS services to regularly process customer generated data to improve spell check on their web site.
  • The more searches you collect, the better recommendations you can provide.Yelp is using AWS services to deliver features such as hotel or restaurants recommendations, review highlights and search hints.
  • AWS Case Study: RazorfishRazorfish, a digital advertising and marketing firm, segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions. Doing so requires applying data mining methods across historical click streams to identify effective segmentation and categorization algorithms and techniques. These click streams are generated when a visitor navigates a web site or catalog, leaving behind patterns that can indicate a user’s interests. Algorithms are then implemented on systems that can batch execute at the appropriate scale against current data sets ranging in size from dozens of Gigabytes to Terabytes. The algorithms are also customized on a client-by-client basis to observe online/offline sales and customer loyalty data. Results of the analysis are loaded into ad-serving and cross-selling systems that in turn deliver the segmentation results in real time. A common issue Razorfish has found with customer segmentation is the need to process gigantic click stream data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on the data network of a media or social networking site. Building in-house infrastructure to analyze these click stream datasets requires investment in expensive “headroom” to handle peak demand. Without the expensive computing resources, Razorfish risks losing clients that require Razorfish to have sufficient resources at hand during critical moments.In addition, applications that can’t scale to handle increasingly large datasets can cause delays in identifying and applying algorithms that could drive additional revenue. As the sample data set grows (i.e. more users, more pages, more clicks), fewer applications are available that can handle the load and provide a timely response. Meanwhile, as the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement. It was thus imperative for Razorfish to implement customer segmentation algorithms in a way that could be applied and executed independently of the scale of the incoming data and supporting infrastructure.Prior to implementing the AWS based solution, Razorfish relied on a traditional hosting environment that utilized high-cost SAN equipment for storage, a proprietary distributed log processing cluster of 30 servers, and several high-end SQL servers. In preparation for the 2009 holiday season, demand for targeted advertising increased. To support this need, Razorfish faced a potential cost of over $500,000 in additional hardware expenses, a procurement time frame of about two months, and the need for an additional senior operations/database administrator. Furthermore, due to downstream dependencies, they needed their daily processing cycle to complete within 18 hours. However, given the increased data volume, Razorfish expected their processing cycle to extend past two days for each run even after the potential investment in human and computing resources.To deal with the combination of huge datasets and custom segmentation targeting activities, coupled with price sensitive clients, Razorfish decided to move away from their rigid data infrastructure status quo. This migration helped Razorfish process vast amounts of data to handle the need for rapid scaling at both the application and infrastructure levels. Razorfish selected Ad Serving integration, Amazon Web Services (AWS), Amazon Elastic MapReduce (a hosted Apache Hadoop service), Cascading, and a variety of chosen applications to power their targeted advertising system based on these benefits:Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”
  • Big data, the term for scanning loads of information for possibly profitable patterns, is a growing sector of corporate technology. Mostly people think in terms of online behavior, like mouse clicks, LinkedIn affiliations and Amazon shopping choices. But other big databases in the real world, lying around for years, are there to exploit.A company called the Climate Corporation was formed in 2006 by two former Google employees who wanted to make use of the vast amount of free data published by the National Weather Service on heat and precipitation patterns around the country. At first they called the company WeatherBill, and used the data to sell insurance to businesses that depended heavily on the weather, from ski resorts and miniature golf courses to house painters and farmers.It did pretty well, raising more than $50 million from the likes of Google Ventures, Khosla Ventures, and Allen & Company. The problem was, it was hard to sell insurance policies to so many little businesses, even using an online shopping model. People like having their insurance explained. The answer was to get even more data, and focus on the agriculture market through the same sales force that sells federal crop insurance.“We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture,” says David Friedberg, chief executive of the Climate Corporation, a name WeatherBill started using Tuesday. “We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat.”The product, insurance against things like drought, too much rain at the planting or the harvest, or an early freeze, is sold through 10,000 agents nationwide. The Climate Corporation, which also added Byron Dorgan, the former senator from North Dakota, to its board on Tuesday, will very likely get into insurance for specialty crops like tomatoes and grapes, which do not have federal insurance.Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product. Mr. Friedberg was an early member of the corporate development team at Google. The co-founder, SirajKhaliq, worked in distributed computing, which involves apportioning big data computing problems across multiple machines. He works as the Climate Corporation’s chief technical officer. Out of the staff of 60 in the company’s San Francisco office (another 30 work in the field) about 12 have doctorates, in areas like environmental science and applied mathematics.“They like that this is a real-world problem, not just clicks on a Web site,” Mr. Friedberg says.He figures that the Climate Corporation is one of the world’s largest users of MapReduce, an increasingly popular software technique for making sense of very large data systems. The number crunching is performed on Amazon.com’s Amazon Web Services computers.The Climate Corporation is working with data intended to judge how different crops will react to certain soils, water and heat. It might be valuable to commodities traders as well, but Mr. Friedberg figures the better business is to expand in farming. Besides the other crops, he is looking at offering the service in Canada and Brazil, or anywhere else that he can get decent long-term data. It’s unlikely he’ll get the quality he got from the federal government, for a price anywhere near “free.”The Climate CorporationKey TakeawaysCascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.SolutionData scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.BenefitsBy utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data.The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention.Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.
  • In 2009, the company acquired Adtuitive, a startup Internet advertising company.Adtuitive’s ad server was completely hosted on Amazon Web Services and served targeted retail ads at a rate of over 100 million requests per month. Aduititve’s configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon ElasticMapReduce. Amazon Elastic MapReduce runs on a custom domain-specific language that uses the Cascading application programming interface.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”elp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpersYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesAdsTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”To learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html
  • “Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.
  • Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.
  • Netflix hasmore than 25 million streaming members and is growing rapidly. Their end users stream movies and TV shows from smart TV’s, laptops, phones, and tablets, resulting in over 50 billion events per day.
  • Netflix stores all of this data in Amazon S3, approximately 1 Petabyte.
  • AWS Case Study: Ticketmaster and MarketShareThe Business ChallengesThe Pricemaster application is a web-based tool designed to optimize live event ticket pricing, improve yield management and generate incremental revenue. The tool takes a holistic approach to maximizing ticket revenue: it optimizes pre-sale and initial pricing all the way through dynamic pricing post on-sale.However, before development could begin, MarketShare had to find an infrastructure that could support the application’s dual challenges: limited upfront capital and managing the fluctuating nature of analytic workloads.Amazon Web ServicesAfter examining their options, MarketShare decided to power Pricemaster using Amazon Web Services (AWS). The AWS feature stack provides the scalability, usability, and on-demand pricing required to support the application’s intricate cluster architecture and complex MATLAB simulations.Pricemaster’s AWS environment includes four large and extra large Amazon EC2 instances supporting a variety of nodes. The diagram below reveals the Amazon EC2 configuration:The pricing application’s Amazon EC2 instances are connected to a central database within Amazon Amazon RDS. In addition, Pricemaster’s AWS infrastructure includes Amazon ELB for traffic distribution, Amazon SimpleDB for non-relational data storage, Amazon Elastic MapReduce for large-scale data processing, as well as Amazon SES. The Pricemaster team monitors all of these resources with Amazon CloudWatch.The diagram below details the application’s AWS-based architecture.The Business BenefitsThe Pricemaster team credits AWS’s ease of use, specifically that of Amazon Elastic MapReduce and Amazon RDS, with reducing its developers’ infrastructure management time by three hours per day—valuable hours the developers can now spend expanding the capabilities of the Pricemaster solution.With AWS’s on-demand pricing, MarketShare also estimates that it reduces costs by over 80% annually, compared to fixed service costs. As the Pricemaster tool continues to grow, the company anticipates even further savings with Amazon Web Services.MarketShare continues to expand its use of AWS for partners such as Ticketmaster saving time, money and providing a superior solution that is flexible, secure and scalable.
  • For example, one of our customers, FourSquare, has built this visualization of customer sign-ups From November of 2008 to June of 2011. this visualization helps understand global service adoption over time. You can create similar visualizations with packages such as gplot or R graphics package.

End Note - AWS India Summit 2012 End Note - AWS India Summit 2012 Presentation Transcript

  • Data without Limits Dr. Werner Vogels CTO, Amazon.com
  • Human Genome ProjectCollaborative project to sequence every single letterof the human genetic code.13 years and $billions to complete.Gigabyte scale datasets (transferred between sites oniPods!)
  • Beyond the Human Genome45+ species sequenced: mouse, rat, gorilla, rabbit,platypus, nematode, zebra fish...Compare genomes between species to identifybiologically interesting areas of the genome.100Gb scale datasets. Increased computationalrequirements.
  • The Next GenerationNew sequencing instruments lead to a dramaticdrop in cost and time required to sequence a genome.Sequence and compare genetic code of individuals tofind areas of variation. Much more interesting.Terabyte scale datasets. Significant computationalrequirements.
  • The 1000 Genomes ProjectsPublic/private consortium to build world’s largestcollection of human genetic variation.Hugely important dataset to drive new insight intoknown genetic traits, and the identification of new ones.Vast, complex data and computational resources required,beyond reach of most research groups and hospitals.
  • 1000 Genomes in the CloudThe 1000 Genomes data made available to all on AWS.Stored for free as part of the Public Datasets program.Updated regularly.200Tb. 1700 individual genomes. As much compute andstorage as required available to all.
  • The CloudHelps do the science we are capable of
  • 50,000 coreCycleCloud Super Computerrunning on the Amazon Cloud
  • How big is 50,000 cores?Why does it matter?
  • (W.H.O./Globocan 2008)
  • Every day is crucial and costly
  • Find matches in millions of keys
  • Challenge: To run a virtual screen with a higher accuracy algorithm & 21 million compounds
  • Metric Count Compute Hours of 109,927 hours Work Compute Days of 4,580 days Work Using CycleCloud & Amazon Cloud Compute Years of 12.55 years TheWork impossible run finished in... Ligand Count ~21 million ligandsUsing CycleCloud & Amazon Cloud The impossible run finished in...
  • 3 Hoursfor $4828.85/hr
  • Instead of $20+ Million in Infrastructure
  • Every day is crucial and costly
  • Big Data powered by AWSBIG-DATA The collection and analysis of large amounts of data to create a competitive advantage
  • Big Data powered by AWS Big Data Verticals Social Media/Advertising Oil & Gas Retail Life Sciences Financial Services Security Network/Gaming User Anti-virus Demographics Targeted Recommendations Monte Carlo Advertising Simulations Seismic Genome Fraud Usage Analysis Analysis Detection analysis Image and Transaction Risk Video Analysis Analysis Image In-game Processing Recognition metrics
  • Big Data powered by AWS Storage Big Data Compute Challenges start at relatively small volumes 100 GB 1,000 PB
  • Big Data powered by AWS Storage Big Data ComputeWhen data sets and data analytics need to scale to thepoint that you have to start innovating around how to collect, store, organize, analyze and share it
  • Big Data powered by AWS
  • Big Data powered by AWS Storage Innovation Compute DynamoDB Glacier HPC EMR S3 Spot
  • Storage Big Data Compute Unconstrained data growth 95% of the 1.2 zettabytes of ZB data in the digital universe is unstructured 70% of of this is user- generated content EB Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – PB 2012. Source: IDC TBGB
  • Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsSocial Graphs Images/videosFacebook, Linked-in, Contacts Traffic, security camerasApplication server logs TwitterWeb sites, games 50m tweets/day 1,400% growth per year
  • Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsMobile connected worldSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs using, easier to collect) Twitter (more peopleWeb sites, games 50m tweets/day 1,400% growth per year
  • Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart grids More aspects of dataSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (variety, depth, location, frequency) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
  • Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsPossible to understandSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (not just answer specific questions) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
  • Storage Big Data Compute Why now?Who is your consumer really?What do people really like?What is happening socially with your products?How do people really use your product?
  • Storage Big Data Compute Why now? Web sites Sensor data Blogs/Reviews/Emails/Pictures Weather, water, smart grids Social Graphs Images/videosMore server logs => better results data Facebook, Linked-in, Contacts Application Traffic, security cameras Twitter Web sites, games 50m tweets/day 1,400% growth per year
  • BIGGER IS BETTER
  • UNCERTAINTY
  • Big Data requiresNO LIMITS
  • Storage Big Data Compute From one instance…
  • Storage Big Data Compute …to thousands
  • Storage Big Data Compute and back again…
  • Big Data PipelineCollect | Store | Organize | Analyze | Share
  • Storage Big Data Compute Where do you put your slice of it? Collection - Ingestion AWS Direct Connect AWS Import/Export Queuing Amazon Storage GatewayDedicated bandwidth between Physical transfer of media Reliable messaging for task Shrink-wrapped gateway for you site and AWS into and out of AWS distribution & collection volume synchronization
  • Storage Big Data Compute Where do you put your slice of it?Relational Database Service DynamoDB Simple Storage Service (S3) Fully managed database NoSQL, Schemaless, Object datastore up to 5TB per (MySQL, Oracle, MSSQL) Provisioned throughput object database 99.999999999% durability
  • Storage Big Data Compute Where do you put your slice of it? Glacier Long term cold storage From $0.01 per GB/Month 99.999999999% durability
  • Storage Big Data Compute Glacier - Full lifecycle big data management Data import Computation & Long term archive Visualization Once data analysis complete, Physical shipping of devices for HPC & EMR cluster jobs of many entire resultant dataset placed in creation of data in AWS thousands of cores cold storage rather than tape e.g. Cost effective when comparede.g. 50TB of Seismic data created e.g. 200TB of visualization data to tape, retrieval in 3-5 hours if as EBS volumes in a Gluster file generated from cluster processing required system
  • Storage Big Data Compute How quick do you need to read it? Single digit ms 10s-100s ms <5 hours DynamoDB S3 Glacier Social scale applications Any object, any app Media & asset archivesProvisioned throughput performance 99.999999999% durability Extremely low cost Flexible consistency models Objects up to 5TB in size S3 levels of durability Performance Scale Price
  • Storage Big Data Compute Operate at any scale Unlimited data Performance Scale Price
  • Storage Big Data Compute Pay for only what you use Provisioned IOPS Volume usedProvisioned read/write performance Pay for volume stored per per Dynamo table/EBS volume month & puts/getsPay for a given provisioned capacity No capacity planning required whether used or not to maintain unlimited storage Performance Scale Price
  • Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it?
  • Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it? Direct Connect EC2 Cloud Formation Import/Export GPUs Simple Workflow S3 Elastic Map Reduce S3 DynamoDB
  • Amazon Elastic MapReduce
  • Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduceElastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types suchas spot
  • Elastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types such as spot Feature Details Scalable Use as many or as few compute instances running Hadoop as you want. Modify the number of instances while your job flow is running Integrated with other Works seamlessly with S3 as origin and output. services Integrates with DynamoDB Comprehensive Supports languages such as Hive and Pig for defining analytics, and allows complex definitions in Cascading, Java, Ruby, Perl, Python, PHP, R, or C++ Cost effective Works with Spot instance types Monitoring Monitor job flows from with the management console
  • But what is it?
  • A frameworkSplits data into piecesLets processing occur Gathers the results
  • S3 + DynamoDBInput data
  • S3 + DynamoDB Input dataCode Elastic MapReduce
  • S3 + DynamoDB Input dataCode Elastic Name MapReduce node
  • S3 + DynamoDB Input dataCode Elastic Name MapReduce node Elastic cluster
  • S3 + DynamoDB Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
  • S3 + DynamoDB Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  • S3 + DynamoDB Input dataCode Elastic Name Output MapReduce node S3 + DynamoDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  • S3 + DynamoDBInput data Output S3 + DynamoDB
  • Very large click log (e.g TBs)
  • Lots of actions by John SmithVery large click log (e.g TBs)
  • Lots of actions by John SmithVery large click log (e.g TBs) Split the log into many small pieces
  • Process in an EMR cluster Lots of actions by John SmithVery large click log (e.g TBs) Split the log into many small pieces
  • Process in an EMR cluster Lots of actions by John SmithVery large click log (e.g TBs) Split the Aggregate log into the results many small from all pieces the nodes
  • Process in an EMR cluster Lots of actions by John Smith WhatVery large John click log (e.g TBs) Smith Split the Aggregate log into the results did many small from all pieces the nodes
  • WhatVery large John click log (e.g TBs) Smith Insight in a fraction of the time did
  • 1 instance for 100 hours =100 instances for 1 hour
  • Small instance = $8
  • Operated 2 million+ Hadoop clusters last year
  • Features powered by Amazon Elastic MapReduce: People Who Viewed this Also Viewed Review highlights Auto complete as you type on search Search spelling suggestions Top searches Ads200 Elastic MapReduce jobs per day Processing 3TB of data
  • Features driven by MapReduce
  • Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduce "With Amazon Elastic MapReduce, there was no upfront investment in hardware, no hardware procurement delay, and no need to hire additional operations staff. Because of the flexibility of the platform, our first new online advertising campaign experienced a 500% increase in return on ad spend from a similar campaign a year before.”
  • Data Analytics3.5 billion records Execute batch processing data sets ranging in size from dozens of “Our first client71 million unique cookies Gigabytes to Terabytes campaign experienced1.7 million targeted ads a 500% increase in Building in-house infrastructure torequired per day analyze these click stream datasets their return on ad requires investment in expensive spend from a similar “headroom” to handle peak demand. campaign a year before” User recently purchased a sports movie Targeted Ad and is searching for video games (1.7 Million per day)
  • “AWS gave us the flexibility to bring a massive amount of capacity online in a short period of time and allowed us to do so in an operationally DynamoDB: straightforward way. over 500,000 writes per second AWS is now Shazam’s cloud provider of choice,” Amazon EMR: Jason Titus,more than 1 million writes CTO per second
  • Step 1: Tracking Step 2: Panel Step 3: DashboardWe’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into thekeeps track of all website visited, software used, to install it. We know not only their digital Wakoopa dashboard in real-time. Log in,and/or ads seen. habits, but also their offline and create beautiful visualizations and demographics and behavior. useful reports.
  • TechnologyPanel AWS Activity SQS EMR RDS Data Kamek* Metrics S3 Wakoopa dashboard
  • Rediff uses Amazon EMR along with Amazon S3 toperform data mining, log processing and analytics fortheir online business. Inputs gained are used to powera better user experience on their portal.Rediff needed 12-15 hours to run this on a 10-12 nodecluster on premise. AWS gave choice and flexibility ofan on demand model which can be scaled up anddown and shortened the time required to process data.
  • More than 25 Million Streaming Members50 Billion Events Per Day
  • S3~1 PB of data stored in Amazon S3
  • Users Overtime
  • Leader in 2011 Gartner IaaS Magic Quadrant
  • Cloud enables big data collection
  • Cloud enables big data processing
  • Cloud enables big data collaboration
  • aws.amazon.com get started with the free tier