Data without Limits      Dr. Werner Vogels      CTO, Amazon.com
Human Genome ProjectCollaborative project to sequence every single letterof the human genetic code.13 years and $billions ...
Beyond the Human Genome45+ species sequenced: mouse, rat, gorilla, rabbit,platypus, nematode, zebra fish...Compare genomes...
The Next GenerationNew sequencing instruments lead to a dramaticdrop in cost and time required to sequence a genome.Sequen...
The 1000 Genomes ProjectsPublic/private consortium to build world’s largestcollection of human genetic variation.Hugely im...
1000 Genomes in the CloudThe 1000 Genomes data made available to all on AWS.Stored for free as part of the Public Datasets...
The CloudHelps do the science we are capable of
50,000 coreCycleCloud Super Computerrunning on the Amazon Cloud
How big is 50,000 cores?Why does it matter?
(W.H.O./Globocan 2008)
Every day is crucial and costly
Find matches in millions of keys
Challenge: To run a virtual screen with a higher accuracy algorithm & 21 million compounds
Metric        Count   Compute Hours of         109,927 hours             Work   Compute Days of          4,580 days       ...
3 Hoursfor $4828.85/hr
Instead of $20+    Million in Infrastructure
Every day is crucial and costly
Big Data powered by AWSBIG-DATA  The collection and analysis of large     amounts of data to create a        competitive a...
Big Data powered by AWS                                 Big Data Verticals                                                ...
Big Data powered by AWS           Storage          Big Data               Compute              Challenges start at relativ...
Big Data powered by AWS           Storage        Big Data   ComputeWhen data sets and data analytics need to scale to thep...
Big Data powered by AWS
Big Data powered by AWS           Storage     Innovation    Compute DynamoDB            Glacier   HPC           EMR       ...
Storage       Big Data                Compute           Unconstrained data growth                                         ...
Storage          Big Data            Compute                            Why now?Web sites                                 ...
Storage          Big Data            Compute                            Why now?Web sites                                 ...
Storage          Big Data            Compute                            Why now?Web sites                                 ...
Storage          Big Data            Compute                            Why now?Web sites                                 ...
Storage    Big Data      Compute                 Why now?Who is your consumer really?What do people really like?What is ha...
Storage          Big Data            Compute                             Why now? Web sites                               ...
BIGGER IS BETTER
UNCERTAINTY
Big Data requiresNO LIMITS
Storage    Big Data            Compute          From one instance…
Storage   Big Data        Compute          …to thousands
Storage   Big Data          Compute          and back again…
Big Data PipelineCollect | Store | Organize | Analyze | Share
Storage                Big Data                       Compute                              Where do you put your slice of ...
Storage           Big Data                     Compute                              Where do you put your slice of it?Rela...
Storage       Big Data                      Compute      Where do you put your slice of it?                         Glacie...
Storage               Big Data                         Compute                            Glacier - Full lifecycle big dat...
Storage                Big Data                     Compute                                How quick do you need to read i...
Storage     Big Data               Compute          Operate at any scale     Unlimited data            Performance      Sc...
Storage                  Big Data               Compute                   Pay for only what you use       Provisioned IOPS...
Storage          Big Data              Compute“Big data” change the dynamics of computation and data sharing   Collection ...
Storage          Big Data              Compute“Big data” change the dynamics of computation and data sharing   Collection ...
Amazon Elastic MapReduce
Storage             Big Data              Compute                            Hadoop-as-a-Service – Elastic MapReduceElasti...
Elastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrat...
But what is it?
A frameworkSplits data into piecesLets processing occur   Gathers the results
S3 + DynamoDBInput data
S3 + DynamoDB        Input dataCode   Elastic       MapReduce
S3 + DynamoDB        Input dataCode   Elastic       Name       MapReduce     node
S3 + DynamoDB        Input dataCode   Elastic       Name       MapReduce     node                               Elastic   ...
S3 + DynamoDB        Input dataCode   Elastic       Name       MapReduce     node                                         ...
S3 + DynamoDB        Input dataCode   Elastic                 Name       MapReduce               node                     ...
S3 + DynamoDB        Input dataCode   Elastic                 Name                              Output       MapReduce    ...
S3 + DynamoDBInput data                               Output                             S3 + DynamoDB
Very large click log (e.g TBs)
Lots of actions             by John SmithVery large click log (e.g TBs)
Lots of actions             by John SmithVery large click log (e.g TBs)    Split the               log into             ma...
Process in an                               EMR cluster             Lots of actions             by John SmithVery large cl...
Process in an                               EMR cluster             Lots of actions             by John SmithVery large cl...
Process in an                               EMR cluster             Lots of actions             by John Smith             ...
WhatVery large                                       John click log (e.g TBs)                                       Smith ...
1 instance for 100 hours            =100 instances for 1 hour
Small instance = $8
Operated 2 million+ Hadoop clusters last year
Features powered by Amazon Elastic           MapReduce:     People Who Viewed this Also Viewed             Review highligh...
Features driven by MapReduce
Storage        Big Data              Compute   Hadoop-as-a-Service – Elastic MapReduce                   "With Amazon Elas...
Data Analytics3.5 billion records             Execute batch processing data sets                                ranging in...
“AWS gave us the flexibility to bring a massive                               amount of capacity online in a short period ...
Step 1: Tracking                                     Step 2: Panel                                   Step 3: DashboardWe’v...
TechnologyPanel                         AWS        Activity   SQS   EMR   RDS   Data                                      ...
Rediff uses Amazon EMR along with Amazon S3 toperform data mining, log processing and analytics fortheir online business. ...
More than 25 Million Streaming Members50 Billion Events Per Day
S3~1 PB of data stored in Amazon S3
Users Overtime
Leader in 2011 Gartner IaaS     Magic Quadrant
Cloud enables big data      collection
Cloud enables big data     processing
Cloud enables big data    collaboration
aws.amazon.com get started with the free tier
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
Upcoming SlideShare
Loading in...5
×

End Note - AWS India Summit 2012

875

Published on

Big Data End Note for AWS Summit in India as given by Dr. Werner Vogels

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
875
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Elasticity works from just 1 EC2 instance to many thousands. Just dial up and down as required.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLION Hadoop clusters
  • Yelp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpers.Yelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”
  • The more misspelled words you collect from your customers, the better spellcheck application you can createYelp is using AWS services to regularly process customer generated data to improve spell check on their web site.
  • The more searches you collect, the better recommendations you can provide.Yelp is using AWS services to deliver features such as hotel or restaurants recommendations, review highlights and search hints.
  • AWS Case Study: RazorfishRazorfish, a digital advertising and marketing firm, segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions. Doing so requires applying data mining methods across historical click streams to identify effective segmentation and categorization algorithms and techniques. These click streams are generated when a visitor navigates a web site or catalog, leaving behind patterns that can indicate a user’s interests. Algorithms are then implemented on systems that can batch execute at the appropriate scale against current data sets ranging in size from dozens of Gigabytes to Terabytes. The algorithms are also customized on a client-by-client basis to observe online/offline sales and customer loyalty data. Results of the analysis are loaded into ad-serving and cross-selling systems that in turn deliver the segmentation results in real time. A common issue Razorfish has found with customer segmentation is the need to process gigantic click stream data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on the data network of a media or social networking site. Building in-house infrastructure to analyze these click stream datasets requires investment in expensive “headroom” to handle peak demand. Without the expensive computing resources, Razorfish risks losing clients that require Razorfish to have sufficient resources at hand during critical moments.In addition, applications that can’t scale to handle increasingly large datasets can cause delays in identifying and applying algorithms that could drive additional revenue. As the sample data set grows (i.e. more users, more pages, more clicks), fewer applications are available that can handle the load and provide a timely response. Meanwhile, as the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement. It was thus imperative for Razorfish to implement customer segmentation algorithms in a way that could be applied and executed independently of the scale of the incoming data and supporting infrastructure.Prior to implementing the AWS based solution, Razorfish relied on a traditional hosting environment that utilized high-cost SAN equipment for storage, a proprietary distributed log processing cluster of 30 servers, and several high-end SQL servers. In preparation for the 2009 holiday season, demand for targeted advertising increased. To support this need, Razorfish faced a potential cost of over $500,000 in additional hardware expenses, a procurement time frame of about two months, and the need for an additional senior operations/database administrator. Furthermore, due to downstream dependencies, they needed their daily processing cycle to complete within 18 hours. However, given the increased data volume, Razorfish expected their processing cycle to extend past two days for each run even after the potential investment in human and computing resources.To deal with the combination of huge datasets and custom segmentation targeting activities, coupled with price sensitive clients, Razorfish decided to move away from their rigid data infrastructure status quo. This migration helped Razorfish process vast amounts of data to handle the need for rapid scaling at both the application and infrastructure levels. Razorfish selected Ad Serving integration, Amazon Web Services (AWS), Amazon Elastic MapReduce (a hosted Apache Hadoop service), Cascading, and a variety of chosen applications to power their targeted advertising system based on these benefits:Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”
  • Big data, the term for scanning loads of information for possibly profitable patterns, is a growing sector of corporate technology. Mostly people think in terms of online behavior, like mouse clicks, LinkedIn affiliations and Amazon shopping choices. But other big databases in the real world, lying around for years, are there to exploit.A company called the Climate Corporation was formed in 2006 by two former Google employees who wanted to make use of the vast amount of free data published by the National Weather Service on heat and precipitation patterns around the country. At first they called the company WeatherBill, and used the data to sell insurance to businesses that depended heavily on the weather, from ski resorts and miniature golf courses to house painters and farmers.It did pretty well, raising more than $50 million from the likes of Google Ventures, Khosla Ventures, and Allen & Company. The problem was, it was hard to sell insurance policies to so many little businesses, even using an online shopping model. People like having their insurance explained. The answer was to get even more data, and focus on the agriculture market through the same sales force that sells federal crop insurance.“We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture,” says David Friedberg, chief executive of the Climate Corporation, a name WeatherBill started using Tuesday. “We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat.”The product, insurance against things like drought, too much rain at the planting or the harvest, or an early freeze, is sold through 10,000 agents nationwide. The Climate Corporation, which also added Byron Dorgan, the former senator from North Dakota, to its board on Tuesday, will very likely get into insurance for specialty crops like tomatoes and grapes, which do not have federal insurance.Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product. Mr. Friedberg was an early member of the corporate development team at Google. The co-founder, SirajKhaliq, worked in distributed computing, which involves apportioning big data computing problems across multiple machines. He works as the Climate Corporation’s chief technical officer. Out of the staff of 60 in the company’s San Francisco office (another 30 work in the field) about 12 have doctorates, in areas like environmental science and applied mathematics.“They like that this is a real-world problem, not just clicks on a Web site,” Mr. Friedberg says.He figures that the Climate Corporation is one of the world’s largest users of MapReduce, an increasingly popular software technique for making sense of very large data systems. The number crunching is performed on Amazon.com’s Amazon Web Services computers.The Climate Corporation is working with data intended to judge how different crops will react to certain soils, water and heat. It might be valuable to commodities traders as well, but Mr. Friedberg figures the better business is to expand in farming. Besides the other crops, he is looking at offering the service in Canada and Brazil, or anywhere else that he can get decent long-term data. It’s unlikely he’ll get the quality he got from the federal government, for a price anywhere near “free.”The Climate CorporationKey TakeawaysCascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.SolutionData scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.BenefitsBy utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data.The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention.Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.
  • In 2009, the company acquired Adtuitive, a startup Internet advertising company.Adtuitive’s ad server was completely hosted on Amazon Web Services and served targeted retail ads at a rate of over 100 million requests per month. Aduititve’s configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon ElasticMapReduce. Amazon Elastic MapReduce runs on a custom domain-specific language that uses the Cascading application programming interface.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”elp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpersYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesAdsTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”To learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html
  • “Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.
  • Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.
  • Netflix hasmore than 25 million streaming members and is growing rapidly. Their end users stream movies and TV shows from smart TV’s, laptops, phones, and tablets, resulting in over 50 billion events per day.
  • Netflix stores all of this data in Amazon S3, approximately 1 Petabyte.
  • AWS Case Study: Ticketmaster and MarketShareThe Business ChallengesThe Pricemaster application is a web-based tool designed to optimize live event ticket pricing, improve yield management and generate incremental revenue. The tool takes a holistic approach to maximizing ticket revenue: it optimizes pre-sale and initial pricing all the way through dynamic pricing post on-sale.However, before development could begin, MarketShare had to find an infrastructure that could support the application’s dual challenges: limited upfront capital and managing the fluctuating nature of analytic workloads.Amazon Web ServicesAfter examining their options, MarketShare decided to power Pricemaster using Amazon Web Services (AWS). The AWS feature stack provides the scalability, usability, and on-demand pricing required to support the application’s intricate cluster architecture and complex MATLAB simulations.Pricemaster’s AWS environment includes four large and extra large Amazon EC2 instances supporting a variety of nodes. The diagram below reveals the Amazon EC2 configuration:The pricing application’s Amazon EC2 instances are connected to a central database within Amazon Amazon RDS. In addition, Pricemaster’s AWS infrastructure includes Amazon ELB for traffic distribution, Amazon SimpleDB for non-relational data storage, Amazon Elastic MapReduce for large-scale data processing, as well as Amazon SES. The Pricemaster team monitors all of these resources with Amazon CloudWatch.The diagram below details the application’s AWS-based architecture.The Business BenefitsThe Pricemaster team credits AWS’s ease of use, specifically that of Amazon Elastic MapReduce and Amazon RDS, with reducing its developers’ infrastructure management time by three hours per day—valuable hours the developers can now spend expanding the capabilities of the Pricemaster solution.With AWS’s on-demand pricing, MarketShare also estimates that it reduces costs by over 80% annually, compared to fixed service costs. As the Pricemaster tool continues to grow, the company anticipates even further savings with Amazon Web Services.MarketShare continues to expand its use of AWS for partners such as Ticketmaster saving time, money and providing a superior solution that is flexible, secure and scalable.
  • For example, one of our customers, FourSquare, has built this visualization of customer sign-ups From November of 2008 to June of 2011. this visualization helps understand global service adoption over time. You can create similar visualizations with packages such as gplot or R graphics package.
  • Transcript of "End Note - AWS India Summit 2012"

    1. 1. Data without Limits Dr. Werner Vogels CTO, Amazon.com
    2. 2. Human Genome ProjectCollaborative project to sequence every single letterof the human genetic code.13 years and $billions to complete.Gigabyte scale datasets (transferred between sites oniPods!)
    3. 3. Beyond the Human Genome45+ species sequenced: mouse, rat, gorilla, rabbit,platypus, nematode, zebra fish...Compare genomes between species to identifybiologically interesting areas of the genome.100Gb scale datasets. Increased computationalrequirements.
    4. 4. The Next GenerationNew sequencing instruments lead to a dramaticdrop in cost and time required to sequence a genome.Sequence and compare genetic code of individuals tofind areas of variation. Much more interesting.Terabyte scale datasets. Significant computationalrequirements.
    5. 5. The 1000 Genomes ProjectsPublic/private consortium to build world’s largestcollection of human genetic variation.Hugely important dataset to drive new insight intoknown genetic traits, and the identification of new ones.Vast, complex data and computational resources required,beyond reach of most research groups and hospitals.
    6. 6. 1000 Genomes in the CloudThe 1000 Genomes data made available to all on AWS.Stored for free as part of the Public Datasets program.Updated regularly.200Tb. 1700 individual genomes. As much compute andstorage as required available to all.
    7. 7. The CloudHelps do the science we are capable of
    8. 8. 50,000 coreCycleCloud Super Computerrunning on the Amazon Cloud
    9. 9. How big is 50,000 cores?Why does it matter?
    10. 10. (W.H.O./Globocan 2008)
    11. 11. Every day is crucial and costly
    12. 12. Find matches in millions of keys
    13. 13. Challenge: To run a virtual screen with a higher accuracy algorithm & 21 million compounds
    14. 14. Metric Count Compute Hours of 109,927 hours Work Compute Days of 4,580 days Work Using CycleCloud & Amazon Cloud Compute Years of 12.55 years TheWork impossible run finished in... Ligand Count ~21 million ligandsUsing CycleCloud & Amazon Cloud The impossible run finished in...
    15. 15. 3 Hoursfor $4828.85/hr
    16. 16. Instead of $20+ Million in Infrastructure
    17. 17. Every day is crucial and costly
    18. 18. Big Data powered by AWSBIG-DATA The collection and analysis of large amounts of data to create a competitive advantage
    19. 19. Big Data powered by AWS Big Data Verticals Social Media/Advertising Oil & Gas Retail Life Sciences Financial Services Security Network/Gaming User Anti-virus Demographics Targeted Recommendations Monte Carlo Advertising Simulations Seismic Genome Fraud Usage Analysis Analysis Detection analysis Image and Transaction Risk Video Analysis Analysis Image In-game Processing Recognition metrics
    20. 20. Big Data powered by AWS Storage Big Data Compute Challenges start at relatively small volumes 100 GB 1,000 PB
    21. 21. Big Data powered by AWS Storage Big Data ComputeWhen data sets and data analytics need to scale to thepoint that you have to start innovating around how to collect, store, organize, analyze and share it
    22. 22. Big Data powered by AWS
    23. 23. Big Data powered by AWS Storage Innovation Compute DynamoDB Glacier HPC EMR S3 Spot
    24. 24. Storage Big Data Compute Unconstrained data growth 95% of the 1.2 zettabytes of ZB data in the digital universe is unstructured 70% of of this is user- generated content EB Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – PB 2012. Source: IDC TBGB
    25. 25. Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsSocial Graphs Images/videosFacebook, Linked-in, Contacts Traffic, security camerasApplication server logs TwitterWeb sites, games 50m tweets/day 1,400% growth per year
    26. 26. Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsMobile connected worldSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs using, easier to collect) Twitter (more peopleWeb sites, games 50m tweets/day 1,400% growth per year
    27. 27. Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart grids More aspects of dataSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (variety, depth, location, frequency) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
    28. 28. Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsPossible to understandSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (not just answer specific questions) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
    29. 29. Storage Big Data Compute Why now?Who is your consumer really?What do people really like?What is happening socially with your products?How do people really use your product?
    30. 30. Storage Big Data Compute Why now? Web sites Sensor data Blogs/Reviews/Emails/Pictures Weather, water, smart grids Social Graphs Images/videosMore server logs => better results data Facebook, Linked-in, Contacts Application Traffic, security cameras Twitter Web sites, games 50m tweets/day 1,400% growth per year
    31. 31. BIGGER IS BETTER
    32. 32. UNCERTAINTY
    33. 33. Big Data requiresNO LIMITS
    34. 34. Storage Big Data Compute From one instance…
    35. 35. Storage Big Data Compute …to thousands
    36. 36. Storage Big Data Compute and back again…
    37. 37. Big Data PipelineCollect | Store | Organize | Analyze | Share
    38. 38. Storage Big Data Compute Where do you put your slice of it? Collection - Ingestion AWS Direct Connect AWS Import/Export Queuing Amazon Storage GatewayDedicated bandwidth between Physical transfer of media Reliable messaging for task Shrink-wrapped gateway for you site and AWS into and out of AWS distribution & collection volume synchronization
    39. 39. Storage Big Data Compute Where do you put your slice of it?Relational Database Service DynamoDB Simple Storage Service (S3) Fully managed database NoSQL, Schemaless, Object datastore up to 5TB per (MySQL, Oracle, MSSQL) Provisioned throughput object database 99.999999999% durability
    40. 40. Storage Big Data Compute Where do you put your slice of it? Glacier Long term cold storage From $0.01 per GB/Month 99.999999999% durability
    41. 41. Storage Big Data Compute Glacier - Full lifecycle big data management Data import Computation & Long term archive Visualization Once data analysis complete, Physical shipping of devices for HPC & EMR cluster jobs of many entire resultant dataset placed in creation of data in AWS thousands of cores cold storage rather than tape e.g. Cost effective when comparede.g. 50TB of Seismic data created e.g. 200TB of visualization data to tape, retrieval in 3-5 hours if as EBS volumes in a Gluster file generated from cluster processing required system
    42. 42. Storage Big Data Compute How quick do you need to read it? Single digit ms 10s-100s ms <5 hours DynamoDB S3 Glacier Social scale applications Any object, any app Media & asset archivesProvisioned throughput performance 99.999999999% durability Extremely low cost Flexible consistency models Objects up to 5TB in size S3 levels of durability Performance Scale Price
    43. 43. Storage Big Data Compute Operate at any scale Unlimited data Performance Scale Price
    44. 44. Storage Big Data Compute Pay for only what you use Provisioned IOPS Volume usedProvisioned read/write performance Pay for volume stored per per Dynamo table/EBS volume month & puts/getsPay for a given provisioned capacity No capacity planning required whether used or not to maintain unlimited storage Performance Scale Price
    45. 45. Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it?
    46. 46. Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it? Direct Connect EC2 Cloud Formation Import/Export GPUs Simple Workflow S3 Elastic Map Reduce S3 DynamoDB
    47. 47. Amazon Elastic MapReduce
    48. 48. Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduceElastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types suchas spot
    49. 49. Elastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types such as spot Feature Details Scalable Use as many or as few compute instances running Hadoop as you want. Modify the number of instances while your job flow is running Integrated with other Works seamlessly with S3 as origin and output. services Integrates with DynamoDB Comprehensive Supports languages such as Hive and Pig for defining analytics, and allows complex definitions in Cascading, Java, Ruby, Perl, Python, PHP, R, or C++ Cost effective Works with Spot instance types Monitoring Monitor job flows from with the management console
    50. 50. But what is it?
    51. 51. A frameworkSplits data into piecesLets processing occur Gathers the results
    52. 52. S3 + DynamoDBInput data
    53. 53. S3 + DynamoDB Input dataCode Elastic MapReduce
    54. 54. S3 + DynamoDB Input dataCode Elastic Name MapReduce node
    55. 55. S3 + DynamoDB Input dataCode Elastic Name MapReduce node Elastic cluster
    56. 56. S3 + DynamoDB Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
    57. 57. S3 + DynamoDB Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
    58. 58. S3 + DynamoDB Input dataCode Elastic Name Output MapReduce node S3 + DynamoDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
    59. 59. S3 + DynamoDBInput data Output S3 + DynamoDB
    60. 60. Very large click log (e.g TBs)
    61. 61. Lots of actions by John SmithVery large click log (e.g TBs)
    62. 62. Lots of actions by John SmithVery large click log (e.g TBs) Split the log into many small pieces
    63. 63. Process in an EMR cluster Lots of actions by John SmithVery large click log (e.g TBs) Split the log into many small pieces
    64. 64. Process in an EMR cluster Lots of actions by John SmithVery large click log (e.g TBs) Split the Aggregate log into the results many small from all pieces the nodes
    65. 65. Process in an EMR cluster Lots of actions by John Smith WhatVery large John click log (e.g TBs) Smith Split the Aggregate log into the results did many small from all pieces the nodes
    66. 66. WhatVery large John click log (e.g TBs) Smith Insight in a fraction of the time did
    67. 67. 1 instance for 100 hours =100 instances for 1 hour
    68. 68. Small instance = $8
    69. 69. Operated 2 million+ Hadoop clusters last year
    70. 70. Features powered by Amazon Elastic MapReduce: People Who Viewed this Also Viewed Review highlights Auto complete as you type on search Search spelling suggestions Top searches Ads200 Elastic MapReduce jobs per day Processing 3TB of data
    71. 71. Features driven by MapReduce
    72. 72. Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduce "With Amazon Elastic MapReduce, there was no upfront investment in hardware, no hardware procurement delay, and no need to hire additional operations staff. Because of the flexibility of the platform, our first new online advertising campaign experienced a 500% increase in return on ad spend from a similar campaign a year before.”
    73. 73. Data Analytics3.5 billion records Execute batch processing data sets ranging in size from dozens of “Our first client71 million unique cookies Gigabytes to Terabytes campaign experienced1.7 million targeted ads a 500% increase in Building in-house infrastructure torequired per day analyze these click stream datasets their return on ad requires investment in expensive spend from a similar “headroom” to handle peak demand. campaign a year before” User recently purchased a sports movie Targeted Ad and is searching for video games (1.7 Million per day)
    74. 74. “AWS gave us the flexibility to bring a massive amount of capacity online in a short period of time and allowed us to do so in an operationally DynamoDB: straightforward way. over 500,000 writes per second AWS is now Shazam’s cloud provider of choice,” Amazon EMR: Jason Titus,more than 1 million writes CTO per second
    75. 75. Step 1: Tracking Step 2: Panel Step 3: DashboardWe’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into thekeeps track of all website visited, software used, to install it. We know not only their digital Wakoopa dashboard in real-time. Log in,and/or ads seen. habits, but also their offline and create beautiful visualizations and demographics and behavior. useful reports.
    76. 76. TechnologyPanel AWS Activity SQS EMR RDS Data Kamek* Metrics S3 Wakoopa dashboard
    77. 77. Rediff uses Amazon EMR along with Amazon S3 toperform data mining, log processing and analytics fortheir online business. Inputs gained are used to powera better user experience on their portal.Rediff needed 12-15 hours to run this on a 10-12 nodecluster on premise. AWS gave choice and flexibility ofan on demand model which can be scaled up anddown and shortened the time required to process data.
    78. 78. More than 25 Million Streaming Members50 Billion Events Per Day
    79. 79. S3~1 PB of data stored in Amazon S3
    80. 80. Users Overtime
    81. 81. Leader in 2011 Gartner IaaS Magic Quadrant
    82. 82. Cloud enables big data collection
    83. 83. Cloud enables big data processing
    84. 84. Cloud enables big data collaboration
    85. 85. aws.amazon.com get started with the free tier

    ×