Your SlideShare is downloading. ×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data and Analytics Innovation Summit


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Cloud
  • 2. The big data pipelineHow customers are using the pipelineThe big data eco-system on the cloud
  • 3. GenerationCollectStoreCollaboration & sharingAnalysis and Computation
  • 4. GenerationCollectStoreCollaboration & sharingAnalysis and Computationlower cost,increasedthroughput
  • 5. GenerationCollectStoreCollaboration & sharingAnalysis and Computationlower cost,increasedthroughputconstraint
  • 6. Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • 7. Very high barrier toturning data intoinformation…
  • 8. Very high barrier toturning data intoinformation.Infrastructure capacityTechnical SkillsQuestions to askCheap experimentation
  • 9. Amazon Web Services Cloud
  • 10. Elastic and highly scalableNo upfront capital expenseOnly pay for what you use++Available on-demand+=Removeconstraints
  • 11. Remove constraints = More experimentationMore experimentation = More innovationMore Innovation = Competitive edge
  • 12. Amazon Web ServicesRemoves constraintsFocus on your dataLeave undifferentiated heavy lifting to us
  • 13. big data
  • 14. Bankinter uses HPC on AWS for Monte CarloSimulation“Bankinter uses AWS as anintegral part of our credit-risk simulation application;We need to perform atleast 5,000,000 simulationsto get realistic results”CreditDataAverage simulationtime went from 23 hours to 20 minutes
  • 15. Challenge:Learn about customer based onwhat they do, rather than whatthey say (i.e., data exhaust);virtually unlimited dataSolution:Always-on cluster continuallyprocesses new financial dataand stores results in S3.Collaborative filtering used toprovide recommendations andad-hoc queries performedusing Hive.
  • 16. For illustrative purposes only.
  • 17. S&P Capital IQMicrosoftSQL ServerAmazon S3:• Companies You MayBe Interested InAmazon S3:• Clicks• Key Developments• Company ProfilesAmazon Elastic Map-Reduce:• Compute User Selectivity• Compute Key Developments• Join & Score
  • 18. Challenge:Volatile weather is deadly to crops like grapes and tomatoesSolution:Built a predictive model based on freely available data—60 years ofcrop data, 14 TBs of soil data, and one million government Dopplerradar points. 50 hadoop clusters process new data as it comes into S3each day, continuously updating the model.150B SoilObservations3M DailyWeatherMeasurements850K PrecisionRainfall GridsTracked
  • 19. Simulations Each Month• Per Simulation:• 10K Unique Scenarios Generated• 5 Trillion Datapoints• 5-6k Node Hadoop Cluster
  • 20. AWSImport/ExportCorporatedata centerAmazonElasticMapReduceAmazonSimpleStorageService (S3)BI UsersClickstream datafrom 500+websites and VoDplatform
  • 21. More than 25 Million Streaming Members50 Billion Events Per Day30 Million plays every day2 billion hours of video in 3months4 million ratings per day3 million searchesDevice location , time ,day, week etc.Social data
  • 22. 10 TB of streaming data per day
  • 23. Data consumed in multiple waysS3EMRProd Cluster(EMR)RecommendationEngineAd-hocAnalysisPersonalization
  • 24. Amazon Dynamodb
  • 25. “Who buys video games?”
  • 26. 3.5 billion records13 TB of click stream logs71 million unique cookiesPer day:
  • 27. 500% return on ad spend17,000% reduction inprocurement timeResults:
  • 28. “Who is using ourservice?”
  • 29. Identified early mobile usageInvested heavily in mobiledevelopmentFinding signal in the noise of logs9,432,061 unique mobile devicesused the Yelp mobile app.
  • 30. Every day is crucial and costly
  • 31. Challenge: To run a virtual screen with a higheraccuracy algorithm & 21 million compounds
  • 32. Metric CountCompute Hours ofWork109,927 hoursCompute Days ofWork4,580 daysCompute Years ofWork12.55 yearsLigand Count ~21 million ligandsUsing Cycle Computing and AmazonWeb Services
  • 33. 3 Hoursfor $4828.85/hr
  • 34. Relational Database ServiceFully managed database(MySQL, Oracle, MSSQL)DynamoDBNoSQL, Schemaless,Provisioned throughputdatabaseS3Object datastore up to 5TBper object99.999999999% durability
  • 35. Map-Reduce engineHadoop-as-a-serviceMassively parallelCost effective AWS wrapperAmazon Elastic MapReduce
  • 36. AmazonRedshiftdata warehouse servicepetabyte-scalefast and fully managed
  • 37. RDBMSRedshiftOLTPERPReportingand BI
  • 38. +Source: Size Query type Hive Redshift3 billionrowsSimple rangequery1680seconds (28min)360 seconds(6 min)1 millionrows2 complexjoins182 seconds 8 seconds$13.60/hour on Redshift versus $57/hour onHIVE
  • 39. GenerationCollectStoreCollaboration & sharingAnalysis and Computation
  • 40. Thank you! 14st, Kowloonbay International Trade& Exhibition Centre (KITEC), Hong KongOne day Free trainingWalk through of services