Data without Limits     Dr. Werner Vogels     CTO, Amazon.com
http://wv.ly/4thpar
DATA   Intensive       Centric
BIG-DATADATA Centric     IntensiveBIG-DATA
BIG-DATA When your data sets become so large that you have to startinnovating how to collect, store, organize, analyze and...
3Vs
Volume3Vs Velocity    Variety
BIG-DATA The collection and analysis oflarge amounts of data to create    a competitive advantage
BIGGER IS BETTER
UNCERTAINTY
BIG-DATA REQUIRESNO LIMITS
COLLECT | STORE | ORGANIZE | ANALYZE |
COLLECT | STORE | ORGANIZE | ANALYZE |
Step 1: Tracking                                                     Step 2: Panel                                        ...
TechnologyPanel                         AWS        Activity   SQS   EMR   RDS   Data                                      ...
Direct Connect
AWS IMPORT/EXPORT
COLLECT | STORE | ORGANIZE | ANALYZE |
Storage Muck
Database Muck
Amazon DynamoDB
COLLECT | STORE | ORGANIZE | ANALYZE |
DATA QUALITY
DATA QUALITY• Control Data
DATA QUALITY• Control Data• Correct Data
DATA QUALITY• Control Data• Correct Data• Validate Data
DATA QUALITY• Control Data• Correct Data• Validate Data• Enrich Data
–A large provider of business listings (over 20MM in the US) needs to determine where each data element belongs and if it ...
COLLECT | STORE | ORGANIZE | ANALYZE |
Computational
MAPREDUCHADOO E AMAZON ELASTIC      P
Forrester Wave: Enterprise Hadoop Solutions,                     Q1 ‘12The Forrester Wave™ is copyrighted by Forrester Res...
HP
COLLECT | STORE | ORGANIZE | ANALYZE |
http://aws.amazon.com/publicdatsets
Big Data Verticals                                                                                           Social. Media...
COLLECT | STORE | ORGANIZE | ANALYZE |
werner@amazon.com
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia
Upcoming SlideShare
Loading in …5
×

Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia

1,156 views

Published on

Closing presentation from Dr. Wener Vogels and the AWS Summit in Sydney, May 2012

Published in: Technology, Business
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total views
1,156
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Variety of sources\nStreaming\nFile shipping\nDisk shipping\n\nImport/Export\n
  • “Wakoopa understands what people do in their digital lives. In a privacy conscious way, our technology tracks what websites they visit, what ads they see, or what apps they use. By using our online research dashboard, you can optimize your your digital strategy accordingly. Our clients include research firms such as TNS and Synovate, to companies like Google and Sanoma. Essentially, we’re the Lonely Planet of the digital world.\n
  • Kamek is a server created by Wakoopa that makes metrics (such as bounce-rate or pageviews) out of millions of visits and visitors, all in a couple of seconds, all in real-time.\n
  • \n
  • \n
  • \n
  • Mechanical turk\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Mechanical turk\n
  • Mechanical turk\n
  • Mechanical turk\n
  • Mechanical turk\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • In 2009, the company acquired Adtuitive, a startup Internet advertising company.\n Adtuitive’s ad server was completely hosted on Amazon Web Services and served t\nargeted retail ads at a rate of over 100 million requests per month. Aduititve’s\n configuration included 50 Amazon Elastic Compute Cloud (Amazon EC2) instances, \nAmazon Elastic Block Store (Amazon EBS) volumes, Amazon CloudFront, Amazon Simpl\ne Storage Service (Amazon S3), and a data warehouse pipeline built on Amazon Ela\nstic MapReduce. Amazon Elastic MapReduce runs on a custom domain-specific langua\nge that uses the Cascading application programming interface.\n\nToday, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”\n\nelp was founded in 2004 with the main goal of helping people connect with great local businesses. The Yelp community is best known for sharing in-depth reviews and insights on local businesses of every sort. In their six years of operation Yelp went from a one-city wonder (San Francisco) to an international phenomenon spanning 8 countries and nearly 50 cities. As of November 2010, Yelp had more than 39 million unique visitors to the site and in total, more than 14 million reviews have been posted by yelpers\n\nYelp has established a loyal consumer following, due in large part to the fact that they are vigilant in protecting the user from shill or suspect content. Yelp uses an automated review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide range of other features that help people discover new businesses (lists, special offers, and events), and communicate with each other. Additionally, business owners and managers are able to set up free accounts to post special offers, upload photos, and message customers.\n\nThe company has also been focused on developing mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android, Blackberry, Windows 7, Palm Pre and WAP.\n\nLocal search advertising makes up the majority of Yelp’s revenue stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not allowed to change or re-order their reviews.\n\nYelp originally depended upon giant RAIDs to store their logs, along with a single local instance of Hadoop. When Yelp made the move Amazon Elastic MapReduce, they replaced the RAIDs with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic MapReduce.\n\n“We were running out of hard drive space and capacity on our Hadoop cluster,” says Yelp search and data-mining engineer Dave Marin.\n\nYelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon Elastic MapReduce include:\n\nPeople Who Viewed this Also Viewed\nReview highlights\nAuto complete as you type on search\nSearch spelling suggestions\nTop searches\nAds\nTheir jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their Hadoop streaming jobs on Amazon Elastic MapReduce, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic MapReduce utility for monitoring.\n\nYelp developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of Amazon Elastic MapReduce job flows. Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data and is grateful for AWS technical support that helped with their Hadoop application development.\n\nUsing Amazon Elastic MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges.”\n\nTo learn more, visit http://www.yelp.com/ . To learn about the mrjob Python library, visit http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \\\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Data Without Limit - Dr. Werner Vogels - AWS Summit 2012 Australia

    1. 1. Data without Limits Dr. Werner Vogels CTO, Amazon.com
    2. 2. http://wv.ly/4thpar
    3. 3. DATA Intensive Centric
    4. 4. BIG-DATADATA Centric IntensiveBIG-DATA
    5. 5. BIG-DATA When your data sets become so large that you have to startinnovating how to collect, store, organize, analyze and share it
    6. 6. 3Vs
    7. 7. Volume3Vs Velocity Variety
    8. 8. BIG-DATA The collection and analysis oflarge amounts of data to create a competitive advantage
    9. 9. BIGGER IS BETTER
    10. 10. UNCERTAINTY
    11. 11. BIG-DATA REQUIRESNO LIMITS
    12. 12. COLLECT | STORE | ORGANIZE | ANALYZE |
    13. 13. COLLECT | STORE | ORGANIZE | ANALYZE |
    14. 14. Step 1: Tracking Step 2: Panel Step 3: DashboardWe’ve created a unique tracking application. It keeps track of all We invite members of a research panel to install it. Usage data now begins to pour into the Wakoopawebsite visited, software used, and/or ads seen. We know not only their digital habits, but also their dashboard in real-time. Log in, and create beautiful offline demographics and behavior. visualizations and useful reports.
    15. 15. TechnologyPanel AWS Activity SQS EMR RDS Data Kamek* Metri cs S3 Wakoopa dashboard
    16. 16. Direct Connect
    17. 17. AWS IMPORT/EXPORT
    18. 18. COLLECT | STORE | ORGANIZE | ANALYZE |
    19. 19. Storage Muck
    20. 20. Database Muck
    21. 21. Amazon DynamoDB
    22. 22. COLLECT | STORE | ORGANIZE | ANALYZE |
    23. 23. DATA QUALITY
    24. 24. DATA QUALITY• Control Data
    25. 25. DATA QUALITY• Control Data• Correct Data
    26. 26. DATA QUALITY• Control Data• Correct Data• Validate Data
    27. 27. DATA QUALITY• Control Data• Correct Data• Validate Data• Enrich Data
    28. 28. –A large provider of business listings (over 20MM in the US) needs to determine where each data element belongs and if it is valid.–1 MM new pieces of data are reviewed a day.Data$Engine$ Excep3ons$are$sent$ New$Data$is$Processing$ to$Mechanical$Turk$ published$ • 2$excep3on$cases:$ • Workers$$validate$ • Data$can$be$ • Conflic3ng$ new$informa3on$ pushed$out$to$the$ informa3on$ through$Web$and$ website$for$ • New$informa3on$ Phone$research$ mone3za3on.$$ that$requires$ • Workers$remove$ valida3on$$ duplicates$$
    29. 29. COLLECT | STORE | ORGANIZE | ANALYZE |
    30. 30. Computational
    31. 31. MAPREDUCHADOO E AMAZON ELASTIC P
    32. 32. Forrester Wave: Enterprise Hadoop Solutions, Q1 ‘12The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of ForresterResearch, Inc. The Forrester Wave™ is a graphical representation of Forresters call on a market and is plotted using adetailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or
    33. 33. HP
    34. 34. COLLECT | STORE | ORGANIZE | ANALYZE |
    35. 35. http://aws.amazon.com/publicdatsets
    36. 36. Big Data Verticals Social. Media/ Life. Financial. Oil.&.Gas. Retail. Security. Network/Adver*sing. Sciences. Services. Gaming. User( An+>virus( Demographics( Targeted( Recommenda+ons( Monte(Carlo( Adver+sing( Simula+ons( Seismic( Genome( Fraud( Usage( Analysis( Analysis( Detec+on( analysis( Image(and( Video( Transac+on( Risk( Analysis( Analysis( Image( In>game( Processing( Recogni+on( metrics(
    37. 37. COLLECT | STORE | ORGANIZE | ANALYZE |
    38. 38. werner@amazon.com

    ×