In the old days, forehand, you knew what questions you are going to ask. and the quesitons that you are going to ask drove the data model and the data model usually drove how you are going to store it and data model will also drove how you are collecting it the data.
Now the Philosophy around data has changed. The philosophy is collect as much data as possible before you know what quesitons you are going to ask and most importantly you dont know which algorithms you are going to ask because you dont know what type of quesiotns I might need in future. The ultimate mantra of collect and measure everything. How you are going to refine those algorigthms, how much data, how much processing power, you really dont know how much resources you really need. Big data is what clouds are for. Its Big data analysis and cloud computing is the perfect marriage.If you are really serious of this new style of data analysis, you should not be worried about amont of commutation you. You should be completely free from that constraints.Collect and Store without limitsCompute and Analyze without limitsVisualize without limites
Data is the next industrial revolutionToday, the core of any successful company is the data it manages and its ability to effectively model, analyze and process that data quickly – almost in real time - so that it can make the right decision faster and rise to the top.
Big Data is all about storing, processing, analyzing, sharing, distributing and visualizing massive amounts of data so that companies can distill knowledge from it, derive valuable business insights from that knowledge, and make better business decisions, all as quickly as possible.
Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of their clients. Bank at least 400,000 simulations to get realistic results.Through the use of AWS, Bankinter brought average time-to-solution down from 23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required.
Cloud is highly cost-effective because you can turn off and stop paying for it when you don’t need it or your users are not accessing. Build websites that sleep at night
Only happens in the cloud
This is a real usage graph from one of our financial services customers during the last week of April (They have asked to remain anonymous for competitive reasons). Firms on Wall Street are finding EC2 an ideal environment to run many of their dailymission critical grid computing and cpu bound applications for a couple key reasons: 1/ Flexibilitythe ability to instantly access hundreds/thousands of cores increases the amount of data they can process, improving the overall quality of their models. and 2/ Cost efficiencies, as they can complete more of their processing for less total spend (Not paying for infrastructure during times of the day and weekends when its not needed)This wall street firm in particular has a nightly business process where they upload the day’s market trading data into S3, and then run proprietary ‘risk management’ algorithms. This lasts ~10 hours during week nights, where they ramp up to the equivalent 3000 m1.smalls. During the day and on weekends, they maintain a base of roughly 300 cores, to handle their always on work loads.
First story is about Cost of storing and analyzing Big DataA large retailer went to Razorfish to analyze massive amounts of click stream logs from their website. They analyze massive datasets of clickstream logs and provide patterns to the their ad serving and cross-selling engines so that they can show a targeted ad. While clickstream logs analysis is not new in our industy what what I learnt from the story is cost of storing and analyzing big data has significantly reduced -
Think Big Data, Think Cloud Jinesh Varia firstname.lastname@example.org Technology Evangelist
Big Agenda#1 Why Big Data matters today?#2 How AWS addresses Big Data challenges?#3 What are enterprises doing today?
Until now, Questions you ask drove Data model New model is collect as much data as possible – “Data-First Philosophy”
Data is the new raw materialfor any business on par withcapital, people, labor
Big DataThe collection and analysis of large amounts of data to create a competitive advantage
Big Data + Big Compute by the side = Big Insights Your Big Data Ingest Analyze data data from Get Big Insights in parallel different sources
Big Data#1 Why Big Data matters today?#2 How AWS addresses Big Data challenges?
Big Data Use cases Media Transcoding Retail Log Analysis Web Analytics Data Warehousing Genome Sequencing Bioinformatics Digital Advertising Financial Modeling
Big Data Analytics in the AWS Cloud Storage Amazon S3 YourBig DataData fromdifferent sources
Ingesting large amounts of data to the cloud One-time upload with Hours constant delta updatesData Velocity Transfer to S3 over Days Internet (Multi- Threaded/Multi-Part) GBs TBs Data Volume and Size
Ingesting large amounts of data to the cloud One-time upload with Hours constant delta updatesData Velocity Transfer to S3 over Days Internet (Multi- AWS Import/Export Threaded/Multi-Part) GBs TBs Data Volume and Size
AWS Import/Export AWS Import/Export Amazon Simple Amazon Elastic eSATA, USB 2.0, SATA Storage Compute Cloud Service (S3) (EC2)Available Internet Theoretical Min. Number of Days to When to Consider AWSConnection Transfer 1TB at 80% Network Utilization Import/Export?T1 (1.544Mbps) 82 days 100GB or more10Mbps 13 days 600GB or moreT3 (44.736Mbps) 3 days 2TB or more100Mbps 1 to 2 days 5TB or more1000Mbps Less than 1 day 60TB or more
Ingesting large amounts of data to the cloud One-time upload with UDP Transfer Software Hours constant delta updates (Aspera, Tsunami…)Data Velocity Transfer to Amazon S3 Days over Internet (Multi- AWS Import/Export Threaded/Multi-Part) GBs TBs Data Volume and Size
Big Data Analytics in the AWS Cloud Storage Amazon S3 YourBig Data Compute and Analytics Amazon EMR (Hadoop) Amazon EC2Data fromdifferent Optimize sources Amazon EC2 Spot Instances Real time Access Expand/Shrink running cluster To Analytical Reports Database Amazon RDS Amazon DynamoDB
Hadoop + Amazon Elastic MapReduce Upload large Amazon S3 datasets or log Amazon S3 files directly Data InputSource Data Output Data Task Amazon Elastic Node MapReduce Amazon DynamoDB MapperCode/ Reducer Name TaskScripts HiveQL Node Node Pig Latin Cascading Runs multiple JobFlow Steps HiveQL Core Node Pig Latin Query Core Node HDFS Amazon Elastic MapReduce BI Apps JDBC Hadoop Cluster ODBC
This is where the cloud really shines Storage Amazon S3 YourBig Data Compute and Analytics Amazon EMR (Hadoop) Amazon EC2Data fromdifferent Optimize sources Amazon EC2 Spot Instances Real time Access Expand/Shrink running cluster To Analytical Reports Database Amazon RDS Amazon DynamoDB
Big Data#1 Why Big Data matters today?#2 How AWS addresses Big Data challenges?#3 What are enterprises doing today?
#1 Reduced Time To Market1 instance for 500 hours =500 instances for 1 hourYou choose where to balance cost against time
Bank – Monte Carlo Simulations “The AWS platform was a good fit for its unlimited and flexible computational power to our risk-simulation process23 Hours requirements.to With AWS, we now have the power to decide how fast we want to obtain simulation results, and, more importantly,20 Minutes we have the ability to run simulations not possible before due to the large amount of infrastructure required.” – Castillo, Director, Bankinter
#2 Now every employee in your company can have one supercomputer
Recommendation Engine for Investment Bankers Amazon S3: Companies You May Be Interested In Amazon Elastic Map-Reduce: Compute User Selectivity S&P Capital IQ Compute Key Developments Microsoft Join & Score SQL Server Amazon S3: Clicks Key Developments Company Profiles “We see continued value in using the AWS cloud because of the flexibility and the scalability. We have a long queue of projects and we envision using AWS to help us get there.” Jeff Sternberg, Data Science Lead Capital IQ / Standard & Poors
#3 Elasticity is one of the fundamentalproperties of the cloud that drives many of its economic benefits
When you turn off your cloud resources, you actually stop paying for them
Elasticity in Wall Street & Amazon EC2 3000 CPU’s for one firm’s risk management processes3000-- Number of EC2 Instances 300 CPU’s on weekends300 -- Wednesday Thursday Friday Saturday Sunday Monday Tuesday 4/22/2009 4/23/2009 4/24/2009 4/25/2009 4/26/2009 4/27/2009 4/28/2009
Clickstream log analysis Daily batch processing requirement: 3.5 billion records Click stream data (TB’s / day) 71 million unique cookies 1.7 million targeted adsOptimize next Daily online ad required per dayday’s ad spend spend analysis Several TBs of Clickstream logs Compile Results a day
Clickstream Log AnalysisExample Query User recently Analyze purchased a Clickstream logs sports movie Analyze and get patterns Targeted Ad and is searching from similar user (1.7 Million per day) for video games purchase behavior Old Way New Way -SAN storage -Cloud Services -30 servers for compute -Hadoop and Cascading -3 high-end SQL servers -“Ad Serving” Integration Business results: Business results: -Upfront CapEx: ~$500K -Upfront CapEx: $0 -Recurring OpEx: Significant -Recurring OpEx: $13K/mo. -Procurement time: 2 mos. -Procurement time: zero -Processing time: 2 days /Job -Processing time: 8 hours / Job
Cloud Accelerates Big Data Analytics500%Increase in Return on Ad Spend from last year
3 Takeaways#1 Why Big Data matters today? Data-First Philosophy and Big Data Analytics#2 How AWS addresses Big Data challenges? Amazon EMR, Amazon EC2, AWS Import/Export, Dynamo DB#3 What are enterprises doing today? Capital IQ, Bankinter, Razorfish
Big Thank you! Jinesh Variajvaria@amazon.com Twitter:@jinman