This session delves into the data infrastructure driving the productivity gains by having the user focus on utilizing the data on Hadoop and not how to get itWe'll also look into the next generation of data infrastructure at Yahoo!
BIG Data is hereData is increasing faster than computeTurning data into Insights isn’t trivial.Bring personal meaning to the WebTurning Data into a Competitive advantage for Yahoo!Make it RelevantAll of that requires an unprecedented ability to process big data, at scale– all at speeds never before thought possible and with a powerful layer of security. I’m talking about 120 terabytes of data and 100 billion events every single day on top of 70 petabytes of data.Hadoop is at the epicenter of big data and cloud computing
Hadoop Clusters are partitioned based on the class of usersManaging data is a challengeWe built a Data Management Solution optimized for space, time and bandwidthProductivity Gains – Users focus on utilizing the data and not how to get itScale: 100 + TB of Data and ~200+ feeds processed dailySupport standard data loading sources Various Data Sources, Sinks, varying interfacesSupport replication across Hadoop ClustersData Loading SLAReliable data loadingGuarantee Data QualityData Retention ManagementAnonymizationCompliance Archival
Y! Research and SciencesAdvertising & AudienceTargetingAd OptimizationsReportingContent Agility
Key Drivers:Once we have Users focus on utilizing the data and not how to get it, how do we optimize for space and timeHow do we learn about the Hadoop itself so we can build better systems What are the KPIs that we can generate that could help us drive the next wave?
WhatThis attempts to provide insights into how Hadoop infrastructure is actually enabling the business. HowGather all data needed to analyze Hadoop performance into one central hive warehouseNextProvide insights into the Hadoop Clusters usage to allow us to tune better and prioritize features.Generate KPIs for the clusters, such as Availability, Utilization, Capacity Planning, etc.Generate canned reports per Hadoop Cluster.Provides key information to manage the data better such as Archival decisions or management of replicas, etc.Find the Query of Death, jobs that cause the clusters to go down.
UtilizationThe central warehouse of Hadoop logs could drive metering and reporting efforts. Infrastructure utilizationMeteringReportingHow do we drive utilization of resources?
Key Drivers:Each byte of information on average is replicated by a factor of 6 (average 2 clusters with replication factor 3). Some feeds may be copied in as many as 8 clusters.Typically each file is accessed 80% of the times in the first 20 weeksData local maps on average around 20%Millions of metadata files that take up valuable Namenode namespace/memoryOpportunities:Reduce the replication factor after data becomes coldUse Erasure coding for cold dataAging data can be archived into Hadoop Archives, as the access frequency dropsReduce the footprint of Metadata stored in files on HDFS, Howl?
We are hiring!!!
Transcript of "Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S"
Data Infrastructure on Hadoop<br />Venkatesh S<br />Architect, Hadoop Data<br />