• Save
Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services
Upcoming SlideShare
Loading in...5
×
 

Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services

on

  • 1,738 views

Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services

Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services

Statistics

Views

Total Views
1,738
Views on SlideShare
1,630
Embed Views
108

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 108

http://emerge.nasscom.in 105
http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Variety of sourcesStreamingFile shippingDisk shippingImport/Export
  • “Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.
  • Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.
  • Mechanicalturk
  • Mechanicalturk
  • Mechanicalturk
  • It is almost never the case that any single organization has access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Crowdsourcing corrects this mismatch by offering companies a cost effective way to harness the ‘cognitive surplus’ of the world's best data scientists.
  • In 2009, the company acquired Adtuitive, a startup Internet advertising company.Adtuitive’s ad server was completely hosted on Amazon Web Services and served targeted retail ads at a rate of over 100 million requests per month. Aduititve’s configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon ElasticMapReduce. Amazon Elastic MapReduce runs on a custom domain-specific language that uses the Cascading application programming interface.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”elp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpersYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.The company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.Yelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:People Who Viewed this Also ViewedReview highlightsAuto complete as you type on searchSearch spelling suggestionsTop searchesAdsTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.Yelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.Using Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”To learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html

Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services Data without Limits - Presentation made by Dr. Werner Vogels from Amazon Web Services Presentation Transcript

  • Data without Limits Dr. Werner Vogels CTO, Amazon.com
  • BIG-DATAThe collection and analysis of large amounts of data to create a competitive advantage
  • BIG-DATA When your data sets become so large that you have to start innovating how tocollect, store, organize, analyze and share it
  • BIGGER IS BETTER
  • UNCERTAINTY
  • BIG-DATA REQUIRESNO LIMITS
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • Step 1: Tracking Step 2: Panel Step 3: DashboardWe’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into thekeeps track of all website visited, software to install it. We know not only their digital Wakoopa dashboard in real-time. Logused, and/or ads seen. habits, but also their offline in, and create beautiful visualizations and demographics and behavior. useful reports.
  • TechnologyPanel AWS Activity SQS EMR RDS Data Kamek* Metrics S3 Wakoopa dashboard
  • AWS DIRECT CONNECT
  • AWS IMPORT/EXPORT
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • DATA QUALITY CONTROL • Control Data • Correct Data • Validate Data • Enrich Data
  • CrowdsourcingMismatch between those with data and those with the skills to analyse it
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • MAPREDUCEAMAZON ELASTIC MAPREDUCEHADOOP
  • 30,000 CORES
  • COLLECT | STORE | ORGANIZE | ANALYZE | SHARE
  • http://aws.amazon.com/publicdatsets