• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
End Note - AWS India Summit 2012
 

End Note - AWS India Summit 2012

on

  • 1,106 views

Big Data End Note for AWS Summit in India as given by Dr. Werner Vogels

Big Data End Note for AWS Summit in India as given by Dr. Werner Vogels

Statistics

Views

Total Views
1,106
Views on SlideShare
1,106
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Elasticity works from just 1 EC2 instance to many thousands. Just dial up and down as required.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLION Hadoop clusters
  • Yelp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpers.Yelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”
  • The more misspelled words you collect from your customers, the better spellcheck application you can createYelp is using AWS services to regularly process customer generated data to improve spell check on their web site.
  • The more searches you collect, the better recommendations you can provide.Yelp is using AWS services to deliver features such as hotel or restaurants recommendations, review highlights and search hints.
  • AWS Case Study: RazorfishRazorfish, a digital advertising and marketing firm, segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions. Doing so requires applying data mining methods across historical click streams to identify effective segmentation and categorization algorithms and techniques. These click streams are generated when a visitor navigates a web site or catalog, leaving behind patterns that can indicate a user’s interests. Algorithms are then implemented on systems that can batch execute at the appropriate scale against current data sets ranging in size from dozens of Gigabytes to Terabytes. The algorithms are also customized on a client-by-client basis to observe online/offline sales and customer loyalty data. Results of the analysis are loaded into ad-serving and cross-selling systems that in turn deliver the segmentation results in real time. A common issue Razorfish has found with customer segmentation is the need to process gigantic click stream data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on the data network of a media or social networking site. Building in-house infrastructure to analyze these click stream datasets requires investment in expensive “headroom” to handle peak demand. Without the expensive computing resources, Razorfish risks losing clients that require Razorfish to have sufficient resources at hand during critical moments.In addition, applications that can’t scale to handle increasingly large datasets can cause delays in identifying and applying algorithms that could drive additional revenue. As the sample data set grows (i.e. more users, more pages, more clicks), fewer applications are available that can handle the load and provide a timely response. Meanwhile, as the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement. It was thus imperative for Razorfish to implement customer segmentation algorithms in a way that could be applied and executed independently of the scale of the incoming data and supporting infrastructure.Prior to implementing the AWS based solution, Razorfish relied on a traditional hosting environment that utilized high-cost SAN equipment for storage, a proprietary distributed log processing cluster of 30 servers, and several high-end SQL servers. In preparation for the 2009 holiday season, demand for targeted advertising increased. To support this need, Razorfish faced a potential cost of over $500,000 in additional hardware expenses, a procurement time frame of about two months, and the need for an additional senior operations/database administrator. Furthermore, due to downstream dependencies, they needed their daily processing cycle to complete within 18 hours. However, given the increased data volume, Razorfish expected their processing cycle to extend past two days for each run even after the potential investment in human and computing resources.To deal with the combination of huge datasets and custom segmentation targeting activities, coupled with price sensitive clients, Razorfish decided to move away from their rigid data infrastructure status quo. This migration helped Razorfish process vast amounts of data to handle the need for rapid scaling at both the application and infrastructure levels. Razorfish selected Ad Serving integration, Amazon Web Services (AWS), Amazon Elastic MapReduce (a hosted Apache Hadoop service), Cascading, and a variety of chosen applications to power their targeted advertising system based on these benefits:Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”
  • Big data, the term for scanning loads of information for possibly profitable patterns, is a growing sector of corporate technology. Mostly people think in terms of online behavior, like mouse clicks, LinkedIn affiliations and Amazon shopping choices. But other big databases in the real world, lying around for years, are there to exploit.A company called the Climate Corporation was formed in 2006 by two former Google employees who wanted to make use of the vast amount of free data published by the National Weather Service on heat and precipitation patterns around the country. At first they called the company WeatherBill, and used the data to sell insurance to businesses that depended heavily on the weather, from ski resorts and miniature golf courses to house painters and farmers.It did pretty well, raising more than $50 million from the likes of Google Ventures, Khosla Ventures, and Allen & Company. The problem was, it was hard to sell insurance policies to so many little businesses, even using an online shopping model. People like having their insurance explained. The answer was to get even more data, and focus on the agriculture market through the same sales force that sells federal crop insurance.“We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture,” says David Friedberg, chief executive of the Climate Corporation, a name WeatherBill started using Tuesday. “We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat.”The product, insurance against things like drought, too much rain at the planting or the harvest, or an early freeze, is sold through 10,000 agents nationwide. The Climate Corporation, which also added Byron Dorgan, the former senator from North Dakota, to its board on Tuesday, will very likely get into insurance for specialty crops like tomatoes and grapes, which do not have federal insurance.Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product. Mr. Friedberg was an early member of the corporate development team at Google. The co-founder, SirajKhaliq, worked in distributed computing, which involves apportioning big data computing problems across multiple machines. He works as the Climate Corporation’s chief technical officer. Out of the staff of 60 in the company’s San Francisco office (another 30 work in the field) about 12 have doctorates, in areas like environmental science and applied mathematics.“They like that this is a real-world problem, not just clicks on a Web site,” Mr. Friedberg says.He figures that the Climate Corporation is one of the world’s largest users of MapReduce, an increasingly popular software technique for making sense of very large data systems. The number crunching is performed on Amazon.com’s Amazon Web Services computers.The Climate Corporation is working with data intended to judge how different crops will react to certain soils, water and heat. It might be valuable to commodities traders as well, but Mr. Friedberg figures the better business is to expand in farming. Besides the other crops, he is looking at offering the service in Canada and Brazil, or anywhere else that he can get decent long-term data. It’s unlikely he’ll get the quality he got from the federal government, for a price anywhere near “free.”The Climate CorporationKey TakeawaysCascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.SolutionData scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.BenefitsBy utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data.The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention.Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.
  • In 2009, the company acquired Adtuitive, a startup Internet advertising company.Adtuitive’s ad server was completely hosted on Amazon Web Services and served targeted retail ads at a rate of over 100 million requests per month. Aduititve’s configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon ElasticMapReduce. Amazon Elastic MapReduce runs on a custom domain-specific language that uses the Cascading application programming interface.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”elp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpersYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesAdsTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”To learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html
  • “Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.
  • Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.
  • Netflix hasmore than 25 million streaming members and is growing rapidly. Their end users stream movies and TV shows from smart TV’s, laptops, phones, and tablets, resulting in over 50 billion events per day.
  • Netflix stores all of this data in Amazon S3, approximately 1 Petabyte.
  • AWS Case Study: Ticketmaster and MarketShareThe Business ChallengesThe Pricemaster application is a web-based tool designed to optimize live event ticket pricing, improve yield management and generate incremental revenue. The tool takes a holistic approach to maximizing ticket revenue: it optimizes pre-sale and initial pricing all the way through dynamic pricing post on-sale.However, before development could begin, MarketShare had to find an infrastructure that could support the application’s dual challenges: limited upfront capital and managing the fluctuating nature of analytic workloads.Amazon Web ServicesAfter examining their options, MarketShare decided to power Pricemaster using Amazon Web Services (AWS). The AWS feature stack provides the scalability, usability, and on-demand pricing required to support the application’s intricate cluster architecture and complex MATLAB simulations.Pricemaster’s AWS environment includes four large and extra large Amazon EC2 instances supporting a variety of nodes. The diagram below reveals the Amazon EC2 configuration:The pricing application’s Amazon EC2 instances are connected to a central database within Amazon Amazon RDS. In addition, Pricemaster’s AWS infrastructure includes Amazon ELB for traffic distribution, Amazon SimpleDB for non-relational data storage, Amazon Elastic MapReduce for large-scale data processing, as well as Amazon SES. The Pricemaster team monitors all of these resources with Amazon CloudWatch.The diagram below details the application’s AWS-based architecture.The Business BenefitsThe Pricemaster team credits AWS’s ease of use, specifically that of Amazon Elastic MapReduce and Amazon RDS, with reducing its developers’ infrastructure management time by three hours per day—valuable hours the developers can now spend expanding the capabilities of the Pricemaster solution.With AWS’s on-demand pricing, MarketShare also estimates that it reduces costs by over 80% annually, compared to fixed service costs. As the Pricemaster tool continues to grow, the company anticipates even further savings with Amazon Web Services.MarketShare continues to expand its use of AWS for partners such as Ticketmaster saving time, money and providing a superior solution that is flexible, secure and scalable.
  • For example, one of our customers, FourSquare, has built this visualization of customer sign-ups From November of 2008 to June of 2011. this visualization helps understand global service adoption over time. You can create similar visualizations with packages such as gplot or R graphics package.

End Note - AWS India Summit 2012 End Note - AWS India Summit 2012 Presentation Transcript