Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR


Published on

Looks into the various projects on Hadoop related to Data Infrastructure at Yahoo! Bangalore.

Published in: Technology, Education
1 Comment
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This session delves into the data infrastructure driving the productivity gains by having the user focus on utilizing the data on Hadoop and not how to get itWe'll also look into the next generation of data infrastructure at Yahoo!
  • BIG Data is hereData is increasing faster than computeTurning data into Insights isn’t trivial.Bring personal meaning to the WebTurning Data into a Competitive advantage for Yahoo!Make it RelevantAll of that requires an unprecedented ability to process big data, at scale– all at speeds never before thought possible and with a powerful layer of security. I’m talking about 120 terabytes of data and 100 billion events every single day on top of 70 petabytes of data.Hadoop is at the epicenter of big data and cloud computing
  • Hadoop Clusters are partitioned based on the class of usersManaging data is a challengeWe built a Data Management Solution optimized for space, time and bandwidthProductivity Gains – Users focus on utilizing the data and not how to get itScale: 100 + TB of Data and ~2000+ feeds processed dailySupport standard data loading sources Various Data Sources, Sinks, varying interfacesSupport replication across Hadoop ClustersData Loading SLAReliable data loadingGuarantee Data QualityData Retention ManagementAnonymizationCompliance Archival
  • Y! Research and SciencesAdvertising & AudienceTargetingAd OptimizationsReportingContent Agility
  • Key Drivers:Once we have Users focus on utilizing the data and not how to get it, how do we optimize for space and timeHow do we learn about the Hadoop itself so we can build better systems What are the KPIs that we can generate that could help us drive the next wave?
  • WhatThis attempts to provide insights into how Hadoop infrastructure is actually enabling the business. HowGather all data needed to analyze Hadoop performance into one central hive warehouseNextProvide insights into the Hadoop Clusters usage to allow us to tune better and prioritize features.Generate KPIs for the clusters, such as Availability, Utilization, Capacity Planning, etc.Generate canned reports per Hadoop Cluster.Provides key information to manage the data better such as Archival decisions or management of replicas, etc.Find the Query of Death, jobs that cause the clusters to go down.
  • UtilizationThe central warehouse of Hadoop logs could drive metering and reporting efforts. Infrastructure utilizationMeteringReportingHow do we drive utilization of resources?
  • Key Drivers:Each byte of information on average is replicated by a factor of 6 (average 2 clusters with replication factor 3). Some feeds may be copied in as many as 8 clusters.Typically each file is accessed 80% of the times in the first 20 weeksData local maps on average around 20%Millions of metadata files that take up valuable Namenode namespace/memoryOpportunities:Reduce the replication factor after data becomes coldUse Erasure coding for cold dataAging data can be archived into Hadoop Archives, as the access frequency dropsReduce the footprint of Metadata stored in files on HDFS, Howl?
  • We are hiring!!!
  • Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR

    1. 1. Data Infrastructure on Hadoop Venkatesh S Architect, Hadoop Data
    2. 2. Outline • Big Picture • Data Infrastructure –Now –Next Wave • Questions
    3. 3. BIG Data is here.
    4. 4. Managing BIG Data
    5. 5. Ads Optimization Content Optimization Search Index Machine Learning (e.g. Spam filters) RSS Feeds Site thumbnails Who is using this Data?
    6. 6. Next Wave!
    7. 7. Hadoop Analytics Warehouse
    8. 8. Utilization
    9. 9. Storage Efficiency
    10. 10. Questions?