Your SlideShare is downloading. ×
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • This session delves into the data infrastructure driving the productivity gains by having the user focus on utilizing the data on Hadoop and not how to get itWe'll also look into the next generation of data infrastructure at Yahoo!
  • BIG Data is hereData is increasing faster than computeTurning data into Insights isn’t trivial.Bring personal meaning to the WebTurning Data into a Competitive advantage for Yahoo!Make it RelevantAll of that requires an unprecedented ability to process big data, at scale– all at speeds never before thought possible and with a powerful layer of security. I’m talking about 120 terabytes of data and 100 billion events every single day on top of 70 petabytes of data.Hadoop is at the epicenter of big data and cloud computing
  • Hadoop Clusters are partitioned based on the class of usersManaging data is a challengeWe built a Data Management Solution optimized for space, time and bandwidthProductivity Gains – Users focus on utilizing the data and not how to get itScale: 100 + TB of Data and ~200+ feeds processed dailySupport standard data loading sources Various Data Sources, Sinks, varying interfacesSupport replication across Hadoop ClustersData Loading SLAReliable data loadingGuarantee Data QualityData Retention ManagementAnonymizationCompliance Archival
  • Y! Research and SciencesAdvertising & AudienceTargetingAd OptimizationsReportingContent Agility
  • Key Drivers:Once we have Users focus on utilizing the data and not how to get it, how do we optimize for space and timeHow do we learn about the Hadoop itself so we can build better systems What are the KPIs that we can generate that could help us drive the next wave?
  • WhatThis attempts to provide insights into how Hadoop infrastructure is actually enabling the business. HowGather all data needed to analyze Hadoop performance into one central hive warehouseNextProvide insights into the Hadoop Clusters usage to allow us to tune better and prioritize features.Generate KPIs for the clusters, such as Availability, Utilization, Capacity Planning, etc.Generate canned reports per Hadoop Cluster.Provides key information to manage the data better such as Archival decisions or management of replicas, etc.Find the Query of Death, jobs that cause the clusters to go down.
  • UtilizationThe central warehouse of Hadoop logs could drive metering and reporting efforts. Infrastructure utilizationMeteringReportingHow do we drive utilization of resources?
  • Key Drivers:Each byte of information on average is replicated by a factor of 6 (average 2 clusters with replication factor 3). Some feeds may be copied in as many as 8 clusters.Typically each file is accessed 80% of the times in the first 20 weeksData local maps on average around 20%Millions of metadata files that take up valuable Namenode namespace/memoryOpportunities:Reduce the replication factor after data becomes coldUse Erasure coding for cold dataAging data can be archived into Hadoop Archives, as the access frequency dropsReduce the footprint of Metadata stored in files on HDFS, Howl?
  • We are hiring!!!
  • Transcript

    • 1. Data Infrastructure on Hadoop
      Venkatesh S
      Architect, Hadoop Data
    • 2. Outline
      Big Picture
      Data Infrastructure
      Next Wave
    • 3. BIG Data is here.
    • 4. Managing BIG Data
    • 5. Who is using this Data?
      Content Optimization
      Search Index
      Machine Learning (e.g. Spam filters)
      Ads Optimization
      Site thumbnails
      RSS Feeds
    • 6. Next Wave!
    • 7. Hadoop Analytics Warehouse
    • 8. Utilization
    • 9. Storage Efficiency
    • 10. Questions?