2. Human Genome Project
Collaborative project to sequence every single letter
of the human genetic code.
13 years and $billions to complete.
Gigabyte scale datasets (transferred between sites on
iPods!)
3. Beyond the Human Genome
45+ species sequenced: mouse, rat, gorilla, rabbit,
platypus, nematode, zebra fish...
Compare genomes between species to identify
biologically interesting areas of the genome.
100Gb scale datasets. Increased computational
requirements.
4. The Next Generation
New sequencing instruments lead to a dramatic
drop in cost and time required to sequence a genome.
Sequence and compare genetic code of individuals to
find areas of variation. Much more interesting.
Terabyte scale datasets. Significant computational
requirements.
5. The 1000 Genomes Projects
Public/private consortium to build world’s largest
collection of human genetic variation.
Hugely important dataset to drive new insight into
known genetic traits, and the identification of new ones.
Vast, complex data and computational resources required,
beyond reach of most research groups and hospitals.
6. 1000 Genomes in the Cloud
The 1000 Genomes data made available to all on AWS.
Stored for free as part of the Public Datasets program.
Updated regularly.
200Tb. 1700 individual genomes. As much compute and
storage as required available to all.
17. Challenge: To run a virtual screen with a higher
accuracy algorithm & 21 million compounds
18.
19. Metric Count
Compute Hours of 109,927 hours
Work
Compute Days of 4,580 days
Work
Using CycleCloud & Amazon Cloud
Compute Years of 12.55 years
TheWork
impossible run finished in...
Ligand Count ~21 million ligands
Using CycleCloud & Amazon Cloud
The impossible run finished in...
32. Big Data powered by AWS
BIG-DATA
The collection and analysis of large
amounts of data to create a
competitive advantage
33. Big Data powered by AWS
Big Data Verticals
Social
Media/Advertising Oil & Gas Retail Life Sciences Financial Services Security
Network/Gaming
User
Anti-virus Demographics
Targeted Recommendations
Monte Carlo
Advertising Simulations
Seismic Genome Fraud Usage
Analysis Analysis Detection analysis
Image and
Transaction Risk
Video Analysis Analysis Image In-game
Processing
Recognition metrics
34. Big Data powered by AWS
Storage Big Data Compute
Challenges start at relatively small volumes
100 GB 1,000 PB
35. Big Data powered by AWS
Storage Big Data Compute
When data sets and data analytics need to scale to the
point that you have to start innovating around how to
collect, store, organize, analyze and share it
37. Big Data powered by AWS
Storage Innovation Compute
DynamoDB Glacier HPC EMR
S3 Spot
38. Storage Big Data Compute
Unconstrained data growth
95% of the 1.2 zettabytes of
ZB data in the digital universe is
unstructured
70% of of this is user-
generated content
EB Unstructured data growth
explosive, with estimates of
compound annual growth
(CAGR) at 62% from 2008 –
PB 2012.
Source: IDC
TB
GB
39. Storage Big Data Compute
Why now?
Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
Social Graphs Images/videos
Facebook, Linked-in, Contacts Traffic, security cameras
Application server logs Twitter
Web sites, games 50m tweets/day 1,400% growth per
year
40. Storage Big Data Compute
Why now?
Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
Mobile connected world
Social Graphs
Facebook, Linked-in, Contacts
Images/videos
Traffic, security cameras
Application server logs using, easier to collect) Twitter
(more people
Web sites, games 50m tweets/day 1,400% growth per
year
41. Storage Big Data Compute
Why now?
Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
More aspects of data
Social Graphs
Facebook, Linked-in, Contacts
Images/videos
Traffic, security cameras
Application server logs
(variety, depth, location, frequency) Twitter
Web sites, games 50m tweets/day 1,400% growth per
year
42. Storage Big Data Compute
Why now?
Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
Possible to understand
Social Graphs
Facebook, Linked-in, Contacts
Images/videos
Traffic, security cameras
Application server logs
(not just answer specific questions) Twitter
Web sites, games 50m tweets/day 1,400% growth per
year
43. Storage Big Data Compute
Why now?
Who is your consumer really?
What do people really like?
What is happening socially with your products?
How do people really use your product?
44. Storage Big Data Compute
Why now?
Web sites Sensor data
Blogs/Reviews/Emails/Pictures Weather, water, smart grids
Social Graphs Images/videos
More server logs => better results
data
Facebook, Linked-in, Contacts
Application
Traffic, security cameras
Twitter
Web sites, games 50m tweets/day 1,400% growth per
year
53. Storage Big Data Compute
Where do you put your slice of it?
Collection - Ingestion
AWS Direct Connect AWS Import/Export Queuing Amazon Storage Gateway
Dedicated bandwidth between Physical transfer of media Reliable messaging for task Shrink-wrapped gateway for
you site and AWS into and out of AWS distribution & collection volume synchronization
54. Storage Big Data Compute
Where do you put your slice of it?
Relational Database Service DynamoDB Simple Storage Service (S3)
Fully managed database NoSQL, Schemaless, Object datastore up to 5TB per
(MySQL, Oracle, MSSQL) Provisioned throughput object
database 99.999999999% durability
55. Storage Big Data Compute
Where do you put your slice of it?
Glacier
Long term cold storage
From $0.01 per GB/Month
99.999999999% durability
56. Storage Big Data Compute
Glacier - Full lifecycle big data management
Data import Computation & Long term archive
Visualization
Once data analysis complete,
Physical shipping of devices for HPC & EMR cluster jobs of many
entire resultant dataset placed in
creation of data in AWS thousands of cores
cold storage rather than tape
e.g. Cost effective when compared
e.g. 50TB of Seismic data created e.g. 200TB of visualization data
to tape, retrieval in 3-5 hours if
as EBS volumes in a Gluster file generated from cluster processing
required
system
57. Storage Big Data Compute
How quick do you need to read it?
Single digit ms 10s-100s ms <5 hours
DynamoDB S3 Glacier
Social scale applications Any object, any app Media & asset archives
Provisioned throughput performance 99.999999999% durability Extremely low cost
Flexible consistency models Objects up to 5TB in size S3 levels of durability
Performance
Scale Price
58. Storage Big Data Compute
Operate at any scale
Unlimited data
Performance
Scale Price
59. Storage Big Data Compute
Pay for only what you use
Provisioned IOPS Volume used
Provisioned read/write performance Pay for volume stored per
per Dynamo table/EBS volume month & puts/gets
Pay for a given provisioned capacity No capacity planning required
whether used or not to maintain unlimited storage
Performance
Scale Price
60. Storage Big Data Compute
“Big data” change the dynamics of computation and data sharing
Collection Computation Collaboration
How do I acquire it? What horsepower How do I work with
Where do I put it? can I apply to it? others on it?
61. Storage Big Data Compute
“Big data” change the dynamics of computation and data sharing
Collection Computation Collaboration
How do I acquire it? What horsepower How do I work with
Where do I put it? can I apply to it? others on it?
Direct Connect EC2 Cloud Formation
Import/Export GPUs Simple Workflow
S3 Elastic Map Reduce S3
DynamoDB
64. Storage Big Data Compute
Hadoop-as-a-Service – Elastic MapReduce
Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Integrates with instance types such
as spot
65. Elastic MapReduce
Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Leverage Hive & Pig analytics scripts
Integrates with instance types such as spot
Feature Details
Scalable Use as many or as few compute instances running
Hadoop as you want. Modify the number of
instances while your job flow is running
Integrated with other Works seamlessly with S3 as origin and output.
services Integrates with DynamoDB
Comprehensive Supports languages such as Hive and Pig for defining
analytics, and allows complex definitions in
Cascading, Java, Ruby, Perl, Python, PHP, R, or C++
Cost effective Works with Spot instance types
Monitoring Monitor job flows from with the management
console
77. Lots of actions
by John Smith
Very large
click log
(e.g TBs)
78. Lots of actions
by John Smith
Very large
click log
(e.g TBs) Split the
log into
many small
pieces
79. Process in an
EMR cluster
Lots of actions
by John Smith
Very large
click log
(e.g TBs) Split the
log into
many small
pieces
80. Process in an
EMR cluster
Lots of actions
by John Smith
Very large
click log
(e.g TBs) Split the Aggregate
log into the results
many small from all
pieces the nodes
81. Process in an
EMR cluster
Lots of actions
by John Smith
What
Very large John
click log
(e.g TBs) Smith
Split the Aggregate
log into the results did
many small from all
pieces the nodes
82. What
Very large John
click log
(e.g TBs) Smith
Insight in a fraction of the time
did
86. Features powered by Amazon Elastic
MapReduce:
People Who Viewed this Also Viewed
Review highlights
Auto complete as you type on search
Search spelling suggestions
Top searches
Ads
200 Elastic MapReduce jobs per day
Processing 3TB of data
89. Storage Big Data Compute
Hadoop-as-a-Service – Elastic MapReduce
"With Amazon Elastic MapReduce, there
was no upfront investment in hardware, no
hardware procurement delay, and no need
to hire additional operations staff.
Because of the flexibility of the
platform, our first new online advertising
campaign experienced a 500% increase in
return on ad spend from a similar
campaign a year before.”
90. Data Analytics
3.5 billion records Execute batch processing data sets
ranging in size from dozens of
“Our first client
71 million unique cookies Gigabytes to Terabytes campaign experienced
1.7 million targeted ads a 500% increase in
Building in-house infrastructure to
required per day analyze these click stream datasets
their return on ad
requires investment in expensive spend from a similar
“headroom” to handle peak demand. campaign a year
before”
User recently
purchased a
sports movie Targeted Ad
and is searching
for video games (1.7 Million per day)
91.
92. “AWS gave us the flexibility to bring a massive
amount of capacity online in a short period of
time and allowed us to do so in an operationally
DynamoDB: straightforward way.
over 500,000 writes per
second
AWS is now Shazam’s cloud provider of choice,”
Amazon EMR: Jason Titus,
more than 1 million writes
CTO
per second
93.
94.
95.
96.
97. Step 1: Tracking Step 2: Panel Step 3: Dashboard
We’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into the
keeps track of all website visited, software used, to install it. We know not only their digital Wakoopa dashboard in real-time. Log in,
and/or ads seen. habits, but also their offline and create beautiful visualizations and
demographics and behavior. useful reports.
100. Rediff uses Amazon EMR along with Amazon S3 to
perform data mining, log processing and analytics for
their online business. Inputs gained are used to power
a better user experience on their portal.
Rediff needed 12-15 hours to run this on a 10-12 node
cluster on premise. AWS gave choice and flexibility of
an on demand model which can be scaled up and
down and shortened the time required to process data.
101.
102. More than 25 Million Streaming Members
50 Billion Events Per Day
Elasticity works from just 1 EC2 instance to many thousands. Just dial up and down as required.
Vertical scaling on commodity hardware. Perfect for Hadoop.
Vertical scaling on commodity hardware. Perfect for Hadoop.
Vertical scaling on commodity hardware. Perfect for Hadoop.
Vertical scaling on commodity hardware. Perfect for Hadoop.
We’ve been operating the service for over 3 years now and in the last year alone we’ve operated over 2 MILLION Hadoop clusters
Yelp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpers.Yelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”
The more misspelled words you collect from your customers, the better spellcheck application you can createYelp is using AWS services to regularly process customer generated data to improve spell check on their web site.
The more searches you collect, the better recommendations you can provide.Yelp is using AWS services to deliver features such as hotel or restaurants recommendations, review highlights and search hints.
AWS Case Study: RazorfishRazorfish, a digital advertising and marketing firm, segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions. Doing so requires applying data mining methods across historical click streams to identify effective segmentation and categorization algorithms and techniques. These click streams are generated when a visitor navigates a web site or catalog, leaving behind patterns that can indicate a user’s interests. Algorithms are then implemented on systems that can batch execute at the appropriate scale against current data sets ranging in size from dozens of Gigabytes to Terabytes. The algorithms are also customized on a client-by-client basis to observe online/offline sales and customer loyalty data. Results of the analysis are loaded into ad-serving and cross-selling systems that in turn deliver the segmentation results in real time. A common issue Razorfish has found with customer segmentation is the need to process gigantic click stream data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on the data network of a media or social networking site. Building in-house infrastructure to analyze these click stream datasets requires investment in expensive “headroom” to handle peak demand. Without the expensive computing resources, Razorfish risks losing clients that require Razorfish to have sufficient resources at hand during critical moments.In addition, applications that can’t scale to handle increasingly large datasets can cause delays in identifying and applying algorithms that could drive additional revenue. As the sample data set grows (i.e. more users, more pages, more clicks), fewer applications are available that can handle the load and provide a timely response. Meanwhile, as the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement. It was thus imperative for Razorfish to implement customer segmentation algorithms in a way that could be applied and executed independently of the scale of the incoming data and supporting infrastructure.Prior to implementing the AWS based solution, Razorfish relied on a traditional hosting environment that utilized high-cost SAN equipment for storage, a proprietary distributed log processing cluster of 30 servers, and several high-end SQL servers. In preparation for the 2009 holiday season, demand for targeted advertising increased. To support this need, Razorfish faced a potential cost of over $500,000 in additional hardware expenses, a procurement time frame of about two months, and the need for an additional senior operations/database administrator. Furthermore, due to downstream dependencies, they needed their daily processing cycle to complete within 18 hours. However, given the increased data volume, Razorfish expected their processing cycle to extend past two days for each run even after the potential investment in human and computing resources.To deal with the combination of huge datasets and custom segmentation targeting activities, coupled with price sensitive clients, Razorfish decided to move away from their rigid data infrastructure status quo. This migration helped Razorfish process vast amounts of data to handle the need for rapid scaling at both the application and infrastructure levels. Razorfish selected Ad Serving integration, Amazon Web Services (AWS), Amazon Elastic MapReduce (a hosted Apache Hadoop service), Cascading, and a variety of chosen applications to power their targeted advertising system based on these benefits:Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”
Big data, the term for scanning loads of information for possibly profitable patterns, is a growing sector of corporate technology. Mostly people think in terms of online behavior, like mouse clicks, LinkedIn affiliations and Amazon shopping choices. But other big databases in the real world, lying around for years, are there to exploit.A company called the Climate Corporation was formed in 2006 by two former Google employees who wanted to make use of the vast amount of free data published by the National Weather Service on heat and precipitation patterns around the country. At first they called the company WeatherBill, and used the data to sell insurance to businesses that depended heavily on the weather, from ski resorts and miniature golf courses to house painters and farmers.It did pretty well, raising more than $50 million from the likes of Google Ventures, Khosla Ventures, and Allen & Company. The problem was, it was hard to sell insurance policies to so many little businesses, even using an online shopping model. People like having their insurance explained. The answer was to get even more data, and focus on the agriculture market through the same sales force that sells federal crop insurance.“We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture,” says David Friedberg, chief executive of the Climate Corporation, a name WeatherBill started using Tuesday. “We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat.”The product, insurance against things like drought, too much rain at the planting or the harvest, or an early freeze, is sold through 10,000 agents nationwide. The Climate Corporation, which also added Byron Dorgan, the former senator from North Dakota, to its board on Tuesday, will very likely get into insurance for specialty crops like tomatoes and grapes, which do not have federal insurance.Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product. Mr. Friedberg was an early member of the corporate development team at Google. The co-founder, SirajKhaliq, worked in distributed computing, which involves apportioning big data computing problems across multiple machines. He works as the Climate Corporation’s chief technical officer. Out of the staff of 60 in the company’s San Francisco office (another 30 work in the field) about 12 have doctorates, in areas like environmental science and applied mathematics.“They like that this is a real-world problem, not just clicks on a Web site,” Mr. Friedberg says.He figures that the Climate Corporation is one of the world’s largest users of MapReduce, an increasingly popular software technique for making sense of very large data systems. The number crunching is performed on Amazon.com’s Amazon Web Services computers.The Climate Corporation is working with data intended to judge how different crops will react to certain soils, water and heat. It might be valuable to commodities traders as well, but Mr. Friedberg figures the better business is to expand in farming. Besides the other crops, he is looking at offering the service in Canada and Brazil, or anywhere else that he can get decent long-term data. It’s unlikely he’ll get the quality he got from the federal government, for a price anywhere near “free.”The Climate CorporationKey TakeawaysCascading provides data scientists at The Climate Corporation a solid foundation to develop advanced machine learning applications in Cascalog that get deployed directly onto Amazon EMR clusters consisting of 2000+ cores. This results in significantly improved productivity with lower operating costs.SolutionData scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.All these workflows and data processing jobs are deployed directly onto Amazon Elastic MapReduce into their own dedicated clusters. Depending on the size of data sets and the complexity of the algorithms, clusters consisting of up to 200 processor cores are utilized for data normalization workflows, and clusters consisting of over 2000 processor cores are utilized for risk analysis and climate modeling workflows.BenefitsBy utilizing Amazon Elastic MapReduce and Cascalog, data scientists at The Climate Corporation are able to focus on solving business challenges rather than worrying about setting up a complex infrastructure or trying to figure out how to use it to process the vast amounts of complex data.The Climate Corporation is able to effectively manage its costs by using Amazon Elastic MapReduce and using dedicated cluster resources for each workflow individually. This allows them to utilize the resources only when they are needed, and not have to invest in hardware resources and systems administrators to manage their own private shared cluster where they’d have to optimize their workflows and schedule them to avoid resource contention.Furthermore, Cascading provides data scientists at The Climate Corporation a common foundation for creating both their batch-oriented machine learning workflows in Cascalog, and Hadoop streaming workflows directly in Cascading. These applications are developed locally on the developers’ desktops, and then get instantly deployed onto dedicated Amazon Elastic MapReduce clusters for testing and production use. This minimizes the amount of iterative utilization of the cluster resources, thus allowing The Climate Corporation to manage its costs by utilizing the infrastructure for productive data processing only.
In 2009, the company acquired Adtuitive, a startup Internet advertising company.Adtuitive’s ad server was completely hosted on Amazon Web Services and served targeted retail ads at a rate of over 100 million requests per month. Aduititve’s configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon ElasticMapReduce. Amazon Elastic MapReduce runs on a custom domain-specific language that uses the Cascading application programming interface.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”elp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpersYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesAdsTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”To learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html
“Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.
Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.
Netflix hasmore than 25 million streaming members and is growing rapidly. Their end users stream movies and TV shows from smart TV’s, laptops, phones, and tablets, resulting in over 50 billion events per day.
Netflix stores all of this data in Amazon S3, approximately 1 Petabyte.
AWS Case Study: Ticketmaster and MarketShareThe Business ChallengesThe Pricemaster application is a web-based tool designed to optimize live event ticket pricing, improve yield management and generate incremental revenue. The tool takes a holistic approach to maximizing ticket revenue: it optimizes pre-sale and initial pricing all the way through dynamic pricing post on-sale.However, before development could begin, MarketShare had to find an infrastructure that could support the application’s dual challenges: limited upfront capital and managing the fluctuating nature of analytic workloads.Amazon Web ServicesAfter examining their options, MarketShare decided to power Pricemaster using Amazon Web Services (AWS). The AWS feature stack provides the scalability, usability, and on-demand pricing required to support the application’s intricate cluster architecture and complex MATLAB simulations.Pricemaster’s AWS environment includes four large and extra large Amazon EC2 instances supporting a variety of nodes. The diagram below reveals the Amazon EC2 configuration:The pricing application’s Amazon EC2 instances are connected to a central database within Amazon Amazon RDS. In addition, Pricemaster’s AWS infrastructure includes Amazon ELB for traffic distribution, Amazon SimpleDB for non-relational data storage, Amazon Elastic MapReduce for large-scale data processing, as well as Amazon SES. The Pricemaster team monitors all of these resources with Amazon CloudWatch.The diagram below details the application’s AWS-based architecture.The Business BenefitsThe Pricemaster team credits AWS’s ease of use, specifically that of Amazon Elastic MapReduce and Amazon RDS, with reducing its developers’ infrastructure management time by three hours per day—valuable hours the developers can now spend expanding the capabilities of the Pricemaster solution.With AWS’s on-demand pricing, MarketShare also estimates that it reduces costs by over 80% annually, compared to fixed service costs. As the Pricemaster tool continues to grow, the company anticipates even further savings with Amazon Web Services.MarketShare continues to expand its use of AWS for partners such as Ticketmaster saving time, money and providing a superior solution that is flexible, secure and scalable.
For example, one of our customers, FourSquare, has built this visualization of customer sign-ups From November of 2008 to June of 2011. this visualization helps understand global service adoption over time. You can create similar visualizations with packages such as gplot or R graphics package.