AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigData - Vishal Malik, Cognizant


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigData - Vishal Malik, Cognizant

  1. 1. PetaByte Scale Computing on Amazon EC2 with Big Data Vishal Malik Head Cloud CoE Cognizant
  2. 2. PetaByte scale computing on Amazon EC2 with BigData©2011, Cognizant
  3. 3. Some Background..•Only 5% of data on the web today is structured•Challenge would be cutting through the noise! Ability to process huge data, filter at scale. Turning raw unstructured data into insights using ML etc. Adding relevance to data via personalizing content. Analyzing data by applying ML about what user likes and give more of it. (driving online-ad revenues for example)•By 2013, we’ll have 650 Exabytes on internet!•Sentimental analysis in real-time will become more prevalent.•The need to process 40+ TB (compressed) data/day by single organization will become more prevalent.2 | ©2011, Cognizant
  4. 4. The Challenge?How to scale without significant increase in the infrastructure cost (processing & storage).How to do Analytics near real-time, as opposed to guess work! Process 5TB+ (uncompressed data) in less than 1 minute! (today it can take ~ an hour)People are asking new questions everyday, hence the need to have all data in DWH and ability to answer these questions near real-time. (Agile BI!)Feedback loop presenting data in line with user preferences.3 | ©2011, Cognizant
  5. 5. rDBMS: The “good and the bad”Good for: Relational data transactionsBad for: Queues, polling, caching Social graph tree traversal, NxN relationships Don’t require ACID for everything! Not good for scaling to PetaBytes of dataTraditional SQL based systems have: Replication delay & cache eviction produce inconsistent results to the at end-user. Slow (single threaded) Locks create contention for popular data hence can’t scale to PetaBytes4 | ©2011, Cognizant
  6. 6. Solution?Cost effective way to Process data and, Store dataProcessing side: One of the most popular ones are: Use Hadoop (Open Source MR framework) for back-end distributed processing. Build a sql-like (lightweight) layer on top of Hadoop. Access time is in micro-seconds, moving towards near-real time!Storage side: Popular and very stables ones are: Use S3, SimpleDB (from Amazon’s AWS) etc Private cloud using NoSQL db’s namely Hbase, CouchDB, MongoDB, Riak, Redis etc5 | ©2011, Cognizant
  7. 7. Current State of Storage Tiering Solutions Existing Innovation? required Solutions Customers are asking • Only h/w based option • Easy to manage • Cost of implementation for storage solutions storage is very high assuming that are cheaper and • East to implement RAID 6, RAID 10+0 easier to implement, storage systems. and other costly understand. • Have a say in policies options . set to move data • H/W based solution not “We’re seeing a big wherever required at user friendly and opportunity to position the disk level. policies set are • Visibility of what is transparent to the user. iMoveS where data is happening to my data • Purely based on disk growing significantly and how/where it is storage hardware from along with cost” stored support perspective. • Better control over • Storing 7TB in 6 hours where/how/what is or less is not possible stored on my storage using current disks systems. with 80MBytes/sec write rate. | ©2011, Cognizant
  8. 8. NoSQL DataStores…Make Storing/Retrieving of information easier to All done using manage & use  Based on access pattern, migrate iMoveS Engine data to the right storage engine • S/W based based on pre-set policies. E.g. < 10% writes go to Hbase. > 50GB checkpoint system stores go to HBase. < 50GB go to • Policy based data MongoDB.  Understand access patterns to object migration refine and retune policies under • Policy based data which data migration happens access/storage • Extreme ScalabilityMake S/W based storage • Great for machine engine do all the intelligent work generated data for  Performance gains analysis.  High availability  Administration & monitoring  Low cost/gigabytes Anyone should be able to store data and not worry about replication, RAID, mirroring. | ©2011, Cognizant
  9. 9. Thank You8 | ©2011, Cognizant