Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigData - Vishal Malik, Cognizant

  • Login to see the comments

AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigData - Vishal Malik, Cognizant

  1. 1. PetaByte Scale Computing on Amazon EC2 with Big Data Vishal Malik Head Cloud CoE Cognizant
  2. 2. PetaByte scale computing on Amazon EC2 with BigData©2011, Cognizant
  3. 3. Some Background..•Only 5% of data on the web today is structured•Challenge would be cutting through the noise! Ability to process huge data, filter at scale. Turning raw unstructured data into insights using ML etc. Adding relevance to data via personalizing content. Analyzing data by applying ML about what user likes and give more of it. (driving online-ad revenues for example)•By 2013, we’ll have 650 Exabytes on internet!•Sentimental analysis in real-time will become more prevalent.•The need to process 40+ TB (compressed) data/day by single organization will become more prevalent.2 | ©2011, Cognizant
  4. 4. The Challenge?How to scale without significant increase in the infrastructure cost (processing & storage).How to do Analytics near real-time, as opposed to guess work! Process 5TB+ (uncompressed data) in less than 1 minute! (today it can take ~ an hour)People are asking new questions everyday, hence the need to have all data in DWH and ability to answer these questions near real-time. (Agile BI!)Feedback loop presenting data in line with user preferences.3 | ©2011, Cognizant
  5. 5. rDBMS: The “good and the bad”Good for: Relational data transactionsBad for: Queues, polling, caching Social graph tree traversal, NxN relationships Don’t require ACID for everything! Not good for scaling to PetaBytes of dataTraditional SQL based systems have: Replication delay & cache eviction produce inconsistent results to the at end-user. Slow (single threaded) Locks create contention for popular data hence can’t scale to PetaBytes4 | ©2011, Cognizant
  6. 6. Solution?Cost effective way to Process data and, Store dataProcessing side: One of the most popular ones are: Use Hadoop (Open Source MR framework) for back-end distributed processing. Build a sql-like (lightweight) layer on top of Hadoop. Access time is in micro-seconds, moving towards near-real time!Storage side: Popular and very stables ones are: Use S3, SimpleDB (from Amazon’s AWS) etc Private cloud using NoSQL db’s namely Hbase, CouchDB, MongoDB, Riak, Redis etc5 | ©2011, Cognizant
  7. 7. Current State of Storage Tiering Solutions Existing Innovation? required Solutions Customers are asking • Only h/w based option • Easy to manage • Cost of implementation for storage solutions storage is very high assuming that are cheaper and • East to implement RAID 6, RAID 10+0 easier to implement, storage systems. and other costly understand. • Have a say in policies options . set to move data • H/W based solution not “We’re seeing a big wherever required at user friendly and opportunity to position the disk level. policies set are • Visibility of what is transparent to the user. iMoveS where data is happening to my data • Purely based on disk growing significantly and how/where it is storage hardware from along with cost” stored support perspective. • Better control over • Storing 7TB in 6 hours where/how/what is or less is not possible stored on my storage using current disks systems. with 80MBytes/sec write rate. | ©2011, Cognizant
  8. 8. NoSQL DataStores…Make Storing/Retrieving of information easier to All done using manage & use  Based on access pattern, migrate iMoveS Engine data to the right storage engine • S/W based based on pre-set policies. E.g. < 10% writes go to Hbase. > 50GB checkpoint system stores go to HBase. < 50GB go to • Policy based data MongoDB.  Understand access patterns to object migration refine and retune policies under • Policy based data which data migration happens access/storage • Extreme ScalabilityMake S/W based storage • Great for machine engine do all the intelligent work generated data for  Performance gains analysis.  High availability  Administration & monitoring  Low cost/gigabytes Anyone should be able to store data and not worry about replication, RAID, mirroring. | ©2011, Cognizant
  9. 9. Thank You8 | ©2011, Cognizant

×