Data without Limits Dr. Werner Vogels CTO, Amazon.com
Human Genome ProjectCollaborative project to sequence every single letterof the human genetic code.13 years and $billions to complete.Gigabyte scale datasets (transferred between sites oniPods!)
Beyond the Human Genome45+ species sequenced: mouse, rat, gorilla, rabbit,platypus, nematode, zebra fish...Compare genomes between species to identifybiologically interesting areas of the genome.100Gb scale datasets. Increased computationalrequirements.
The Next GenerationNew sequencing instruments lead to a dramaticdrop in cost and time required to sequence a genome.Sequence and compare genetic code of individuals tofind areas of variation. Much more interesting.Terabyte scale datasets. Significant computationalrequirements.
The 1000 Genomes ProjectsPublic/private consortium to build world’s largestcollection of human genetic variation.Hugely important dataset to drive new insight intoknown genetic traits, and the identification of new ones.Vast, complex data and computational resources required,beyond reach of most research groups and hospitals.
1000 Genomes in the CloudThe 1000 Genomes data made available to all on AWS.Stored for free as part of the Public Datasets program.Updated regularly.200Tb. 1700 individual genomes. As much compute andstorage as required available to all.
The CloudHelps do the science we are capable of
50,000 coreCycleCloud Super Computerrunning on the Amazon Cloud
Challenge: To run a virtual screen with a higher accuracy algorithm & 21 million compounds
Metric Count Compute Hours of 109,927 hours Work Compute Days of 4,580 days Work Using CycleCloud & Amazon Cloud Compute Years of 12.55 years TheWork impossible run finished in... Ligand Count ~21 million ligandsUsing CycleCloud & Amazon Cloud The impossible run finished in...
Big Data powered by AWSBIG-DATA The collection and analysis of large amounts of data to create a competitive advantage
Big Data powered by AWS Big Data Verticals Social Media/Advertising Oil & Gas Retail Life Sciences Financial Services Security Network/Gaming User Anti-virus Demographics Targeted Recommendations Monte Carlo Advertising Simulations Seismic Genome Fraud Usage Analysis Analysis Detection analysis Image and Transaction Risk Video Analysis Analysis Image In-game Processing Recognition metrics
Big Data powered by AWS Storage Big Data Compute Challenges start at relatively small volumes 100 GB 1,000 PB
Big Data powered by AWS Storage Big Data ComputeWhen data sets and data analytics need to scale to thepoint that you have to start innovating around how to collect, store, organize, analyze and share it
Big Data powered by AWS Storage Innovation Compute DynamoDB Glacier HPC EMR S3 Spot
Storage Big Data Compute Unconstrained data growth 95% of the 1.2 zettabytes of ZB data in the digital universe is unstructured 70% of of this is user- generated content EB Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – PB 2012. Source: IDC TBGB
Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsSocial Graphs Images/videosFacebook, Linked-in, Contacts Traffic, security camerasApplication server logs TwitterWeb sites, games 50m tweets/day 1,400% growth per year
Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsMobile connected worldSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs using, easier to collect) Twitter (more peopleWeb sites, games 50m tweets/day 1,400% growth per year
Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart grids More aspects of dataSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (variety, depth, location, frequency) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
Storage Big Data Compute Why now?Web sites Sensor dataBlogs/Reviews/Emails/Pictures Weather, water, smart gridsPossible to understandSocial GraphsFacebook, Linked-in, Contacts Images/videos Traffic, security camerasApplication server logs (not just answer specific questions) TwitterWeb sites, games 50m tweets/day 1,400% growth per year
Storage Big Data Compute Why now?Who is your consumer really?What do people really like?What is happening socially with your products?How do people really use your product?
Storage Big Data Compute Why now? Web sites Sensor data Blogs/Reviews/Emails/Pictures Weather, water, smart grids Social Graphs Images/videosMore server logs => better results data Facebook, Linked-in, Contacts Application Traffic, security cameras Twitter Web sites, games 50m tweets/day 1,400% growth per year
Big Data PipelineCollect | Store | Organize | Analyze | Share
Storage Big Data Compute Where do you put your slice of it? Collection - Ingestion AWS Direct Connect AWS Import/Export Queuing Amazon Storage GatewayDedicated bandwidth between Physical transfer of media Reliable messaging for task Shrink-wrapped gateway for you site and AWS into and out of AWS distribution & collection volume synchronization
Storage Big Data Compute Where do you put your slice of it?Relational Database Service DynamoDB Simple Storage Service (S3) Fully managed database NoSQL, Schemaless, Object datastore up to 5TB per (MySQL, Oracle, MSSQL) Provisioned throughput object database 99.999999999% durability
Storage Big Data Compute Where do you put your slice of it? Glacier Long term cold storage From $0.01 per GB/Month 99.999999999% durability
Storage Big Data Compute Glacier - Full lifecycle big data management Data import Computation & Long term archive Visualization Once data analysis complete, Physical shipping of devices for HPC & EMR cluster jobs of many entire resultant dataset placed in creation of data in AWS thousands of cores cold storage rather than tape e.g. Cost effective when comparede.g. 50TB of Seismic data created e.g. 200TB of visualization data to tape, retrieval in 3-5 hours if as EBS volumes in a Gluster file generated from cluster processing required system
Storage Big Data Compute How quick do you need to read it? Single digit ms 10s-100s ms <5 hours DynamoDB S3 Glacier Social scale applications Any object, any app Media & asset archivesProvisioned throughput performance 99.999999999% durability Extremely low cost Flexible consistency models Objects up to 5TB in size S3 levels of durability Performance Scale Price
Storage Big Data Compute Operate at any scale Unlimited data Performance Scale Price
Storage Big Data Compute Pay for only what you use Provisioned IOPS Volume usedProvisioned read/write performance Pay for volume stored per per Dynamo table/EBS volume month & puts/getsPay for a given provisioned capacity No capacity planning required whether used or not to maintain unlimited storage Performance Scale Price
Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it?
Storage Big Data Compute“Big data” change the dynamics of computation and data sharing Collection Computation CollaborationHow do I acquire it? What horsepower How do I work with Where do I put it? can I apply to it? others on it? Direct Connect EC2 Cloud Formation Import/Export GPUs Simple Workflow S3 Elastic Map Reduce S3 DynamoDB
Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduceElastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types suchas spot
Elastic MapReduceManaged, elastic Hadoop clusterIntegrates with S3 & DynamoDBLeverage Hive & Pig analytics scriptsIntegrates with instance types such as spot Feature Details Scalable Use as many or as few compute instances running Hadoop as you want. Modify the number of instances while your job flow is running Integrated with other Works seamlessly with S3 as origin and output. services Integrates with DynamoDB Comprehensive Supports languages such as Hive and Pig for defining analytics, and allows complex definitions in Cascading, Java, Ruby, Perl, Python, PHP, R, or C++ Cost effective Works with Spot instance types Monitoring Monitor job flows from with the management console
Features powered by Amazon Elastic MapReduce: People Who Viewed this Also Viewed Review highlights Auto complete as you type on search Search spelling suggestions Top searches Ads200 Elastic MapReduce jobs per day Processing 3TB of data
Storage Big Data Compute Hadoop-as-a-Service – Elastic MapReduce "With Amazon Elastic MapReduce, there was no upfront investment in hardware, no hardware procurement delay, and no need to hire additional operations staff. Because of the flexibility of the platform, our first new online advertising campaign experienced a 500% increase in return on ad spend from a similar campaign a year before.”
Data Analytics3.5 billion records Execute batch processing data sets ranging in size from dozens of “Our first client71 million unique cookies Gigabytes to Terabytes campaign experienced1.7 million targeted ads a 500% increase in Building in-house infrastructure torequired per day analyze these click stream datasets their return on ad requires investment in expensive spend from a similar “headroom” to handle peak demand. campaign a year before” User recently purchased a sports movie Targeted Ad and is searching for video games (1.7 Million per day)
“AWS gave us the flexibility to bring a massive amount of capacity online in a short period of time and allowed us to do so in an operationally DynamoDB: straightforward way. over 500,000 writes per second AWS is now Shazam’s cloud provider of choice,” Amazon EMR: Jason Titus,more than 1 million writes CTO per second
Step 1: Tracking Step 2: Panel Step 3: DashboardWe’ve created a unique tracking application. It We invite members of a research panel Usage data now begins to pour into thekeeps track of all website visited, software used, to install it. We know not only their digital Wakoopa dashboard in real-time. Log in,and/or ads seen. habits, but also their offline and create beautiful visualizations and demographics and behavior. useful reports.
Rediff uses Amazon EMR along with Amazon S3 toperform data mining, log processing and analytics fortheir online business. Inputs gained are used to powera better user experience on their portal.Rediff needed 12-15 hours to run this on a 10-12 nodecluster on premise. AWS gave choice and flexibility ofan on demand model which can be scaled up anddown and shortened the time required to process data.
More than 25 Million Streaming Members50 Billion Events Per Day