Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S
Upcoming SlideShare
Loading in...5
×
 

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S

on

  • 1,978 views

 

Statistics

Views

Total Views
1,978
Views on SlideShare
1,820
Embed Views
158

Actions

Likes
0
Downloads
47
Comments
0

1 Embed 158

http://d.hatena.ne.jp 158

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This session delves into the data infrastructure driving the productivity gains by having the user focus on utilizing the data on Hadoop and not how to get itWe'll also look into the next generation of data infrastructure at Yahoo!
  • BIG Data is hereData is increasing faster than computeTurning data into Insights isn’t trivial.Bring personal meaning to the WebTurning Data into a Competitive advantage for Yahoo!Make it RelevantAll of that requires an unprecedented ability to process big data, at scale– all at speeds never before thought possible and with a powerful layer of security. I’m talking about 120 terabytes of data and 100 billion events every single day on top of 70 petabytes of data.Hadoop is at the epicenter of big data and cloud computing
  • Hadoop Clusters are partitioned based on the class of usersManaging data is a challengeWe built a Data Management Solution optimized for space, time and bandwidthProductivity Gains – Users focus on utilizing the data and not how to get itScale: 100 + TB of Data and ~200+ feeds processed dailySupport standard data loading sources Various Data Sources, Sinks, varying interfacesSupport replication across Hadoop ClustersData Loading SLAReliable data loadingGuarantee Data QualityData Retention ManagementAnonymizationCompliance Archival
  • Y! Research and SciencesAdvertising & AudienceTargetingAd OptimizationsReportingContent Agility
  • Key Drivers:Once we have Users focus on utilizing the data and not how to get it, how do we optimize for space and timeHow do we learn about the Hadoop itself so we can build better systems What are the KPIs that we can generate that could help us drive the next wave?
  • WhatThis attempts to provide insights into how Hadoop infrastructure is actually enabling the business. HowGather all data needed to analyze Hadoop performance into one central hive warehouseNextProvide insights into the Hadoop Clusters usage to allow us to tune better and prioritize features.Generate KPIs for the clusters, such as Availability, Utilization, Capacity Planning, etc.Generate canned reports per Hadoop Cluster.Provides key information to manage the data better such as Archival decisions or management of replicas, etc.Find the Query of Death, jobs that cause the clusters to go down.
  • UtilizationThe central warehouse of Hadoop logs could drive metering and reporting efforts. Infrastructure utilizationMeteringReportingHow do we drive utilization of resources?
  • Key Drivers:Each byte of information on average is replicated by a factor of 6 (average 2 clusters with replication factor 3). Some feeds may be copied in as many as 8 clusters.Typically each file is accessed 80% of the times in the first 20 weeksData local maps on average around 20%Millions of metadata files that take up valuable Namenode namespace/memoryOpportunities:Reduce the replication factor after data becomes coldUse Erasure coding for cold dataAging data can be archived into Hadoop Archives, as the access frequency dropsReduce the footprint of Metadata stored in files on HDFS, Howl?
  • We are hiring!!!

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venkatesh S Presentation Transcript

  • Data Infrastructure on Hadoop
    Venkatesh S
    Architect, Hadoop Data
  • Outline
    Big Picture
    Data Infrastructure
    Now
    Next Wave
    Questions
  • BIG Data is here.
  • Managing BIG Data
  • Who is using this Data?
    Content Optimization
    Search Index
    Machine Learning (e.g. Spam filters)
    Ads Optimization
    Site thumbnails
    RSS Feeds
  • Next Wave!
  • Hadoop Analytics Warehouse
  • Utilization
  • Storage Efficiency
  • Questions?