SlideShare a Scribd company logo
1 of 17
Download to read offline
Scaling Storage and Computation
with Apache Hadoop
Konstantin V. Shvachko
1 October 2010
What is Hadoop
• Hadoop is an ecosystem of tools for processing
“Big Data”
• Hadoop is an open source project
• Yahoo! a primary developer of Hadoop since 2006
Big Data
• Big Data management, storage and analytics
• Large datasets (PBs) do not fit one computer
– Internal (memory) sort
– External (disk) sort
– Distributed sort
• Computations that need a lot of compute power
Big Data: Examples
• Search Webmap as of 2008 @ Y!
– Raw disk used 5 PB
– 1500 nodes
• Large Hadron Collider: PBs of events
– 1 PB of data per sec, most filtered out
• 2 quadrillionth (1015) digit of πis 0
– Tsz-Wo (Nicholas) Sze
– 23 days vs 2 years before
– No data, pure CPU workload
Hadoop is the Solution
• Architecture principles:
– Linear scaling
– Reliability and Availability
– Using unreliable commodity hardware
– Computation is shipped to data
No expensive data transfers
– High performance
Hadoop Components
Data SerializationAvro
Distributed coordinationZookeeper
HDFS Distributed file system
MapReduce Distributed computation
HBase Column store
Pig Dataflow language
Hive Data warehouse
Chukwa Data Collection
MapReduce
• MapReduce – distributed computation framework
– Invented by Google researchers
• Two stages of a MR job
– Map: {<Key,Value>} -> {<K’,V’>}
– Reduce: {<K’,V’>} -> {<K’’,V’’>}
• Map – a truly distributed stage
Reduce – an aggregation, may not be distributed
• Shuffle – sort and merge,
transition from Map to Reduce
invisible to user
MapReduce Workflow
Hadoop Distributed File System
HDFS
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Lots of fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store block replicas as files on local drives
• Blocks are replicated on 3 DataNodes for redundancy
HDFS Read
• To read a block, the client requests the list of replica
locations from the NameNode
• Then pulling data from a replica on one of the DataNodes
HDFS Write
• To write a block of a file, the client requests a list of
candidate DataNodes from the NameNode, and
organizes a write pipeline.
Replica Location Awareness
• MapReduce schedules a task assigned to process block
B to a DataNode possessing a replica of B
• Data are large, programs are small
• Local access to data
ZooKeeper
• A distributed coordination service for distributed apps
– Event coordination and notification
– Leader election
– Distributed locking
• ZooKeeper can help build HA systems
HBase
• Distributed table store on top of HDFS
– An implementation of Googl’s BigTable
• Big table is Big Data, cannot be stored on a single node
• Tables: big, sparse, loosely structured.
– Consist of rows, having unique row keys
– Has arbitrary number of columns,
– grouped into small number of column families
– Dynamic column creation
• Table is partitioned into regions
– Horizontally across rows; vertically across column families
• HBase provides structured yet flexible access to data
Pig
• A language on top of and to simplify MapReduce
• Pig speaks Pig Latin
• SQL-like language
• Pig programs are translated into a
series of MapReduce jobs
Hive
• Serves the same purpose as Pig
• Closely follows SQL standards
• Keeps metadata about Hive tables in MySQL DRBM
Hadoop User Groups

More Related Content

What's hot

Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit
 
Hadoop 3.0 features
Hadoop 3.0 featuresHadoop 3.0 features
Hadoop 3.0 featuresanand murari
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryDataWorks Summit
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...DataWorks Summit
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 

What's hot (20)

Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hadoop 3.0 features
Hadoop 3.0 featuresHadoop 3.0 features
Hadoop 3.0 features
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow Story
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 

Viewers also liked

Big Data Analytics in Healthcare and Life Sciences
Big Data Analytics in Healthcare and Life SciencesBig Data Analytics in Healthcare and Life Sciences
Big Data Analytics in Healthcare and Life SciencesAli Sanousi, MD, MBA, PhD
 
The data and analytics of the new life sciences marketplace handout
The data and analytics of the new life sciences marketplace handoutThe data and analytics of the new life sciences marketplace handout
The data and analytics of the new life sciences marketplace handoutFrank Wartenberg
 
IBM_analytics across the healthcare ecosystem
IBM_analytics across the healthcare ecosystem IBM_analytics across the healthcare ecosystem
IBM_analytics across the healthcare ecosystem Heather Fraser
 
Geber Consulting - Big Data in Healthcare
Geber Consulting - Big Data in Healthcare Geber Consulting - Big Data in Healthcare
Geber Consulting - Big Data in Healthcare Martin Hiesboeck
 
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...Alex Zeltov
 
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)SironaHealth
 
Healthcare Analytics Maturity Model
Healthcare Analytics Maturity ModelHealthcare Analytics Maturity Model
Healthcare Analytics Maturity ModelFrank Wang
 
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...Spark Summit
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingCascading
 
Big Data, CEP and IoT : Redefining Healthcare Information Systems and Analytics
Big Data, CEP and IoT : Redefining Healthcare Information Systems and AnalyticsBig Data, CEP and IoT : Redefining Healthcare Information Systems and Analytics
Big Data, CEP and IoT : Redefining Healthcare Information Systems and AnalyticsTauseef Naquishbandi
 
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...Ryan Squire
 
Agile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopAgile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopDataWorks Summit
 
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Cloudera, Inc.
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
 
Interoperability between heterogeneous healthcare information systems by John...
Interoperability between heterogeneous healthcare information systems by John...Interoperability between heterogeneous healthcare information systems by John...
Interoperability between heterogeneous healthcare information systems by John...Luigi Ceccaroni
 
Predictive analytics for personalized healthcare
Predictive analytics for personalized healthcarePredictive analytics for personalized healthcare
Predictive analytics for personalized healthcareJohn Cai
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Big Data Applications in Healthcare
Big Data Applications in Healthcare Big Data Applications in Healthcare
Big Data Applications in Healthcare PYA, P.C.
 
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...Damo Consulting Inc.
 

Viewers also liked (20)

Big Data Analytics in Healthcare and Life Sciences
Big Data Analytics in Healthcare and Life SciencesBig Data Analytics in Healthcare and Life Sciences
Big Data Analytics in Healthcare and Life Sciences
 
The data and analytics of the new life sciences marketplace handout
The data and analytics of the new life sciences marketplace handoutThe data and analytics of the new life sciences marketplace handout
The data and analytics of the new life sciences marketplace handout
 
IBM_analytics across the healthcare ecosystem
IBM_analytics across the healthcare ecosystem IBM_analytics across the healthcare ecosystem
IBM_analytics across the healthcare ecosystem
 
Geber Consulting - Big Data in Healthcare
Geber Consulting - Big Data in Healthcare Geber Consulting - Big Data in Healthcare
Geber Consulting - Big Data in Healthcare
 
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...
 
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)
 
Healthcare Analytics Maturity Model
Healthcare Analytics Maturity ModelHealthcare Analytics Maturity Model
Healthcare Analytics Maturity Model
 
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...
Healthcare Predictive Analytics with the OR-(Denny Lee and Ayad Shammout, Dat...
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Big Data, CEP and IoT : Redefining Healthcare Information Systems and Analytics
Big Data, CEP and IoT : Redefining Healthcare Information Systems and AnalyticsBig Data, CEP and IoT : Redefining Healthcare Information Systems and Analytics
Big Data, CEP and IoT : Redefining Healthcare Information Systems and Analytics
 
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...
 
Agile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoopAgile deployment predictive analytics on hadoop
Agile deployment predictive analytics on hadoop
 
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Predictive Analytics Usage and Implications in Healthcare
Predictive Analytics Usage and Implications in HealthcarePredictive Analytics Usage and Implications in Healthcare
Predictive Analytics Usage and Implications in Healthcare
 
Interoperability between heterogeneous healthcare information systems by John...
Interoperability between heterogeneous healthcare information systems by John...Interoperability between heterogeneous healthcare information systems by John...
Interoperability between heterogeneous healthcare information systems by John...
 
Predictive analytics for personalized healthcare
Predictive analytics for personalized healthcarePredictive analytics for personalized healthcare
Predictive analytics for personalized healthcare
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Big Data Applications in Healthcare
Big Data Applications in Healthcare Big Data Applications in Healthcare
Big Data Applications in Healthcare
 
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
 

Similar to Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop

Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 

Similar to Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop (20)

Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 

More from Media Gorod

Iidf market watch_2013
Iidf market watch_2013Iidf market watch_2013
Iidf market watch_2013Media Gorod
 
E travel 2013 ufs-f
E travel 2013 ufs-fE travel 2013 ufs-f
E travel 2013 ufs-fMedia Gorod
 
Travel shop 2013
Travel shop 2013Travel shop 2013
Travel shop 2013Media Gorod
 
Kozyakov pay u_e-travel2013
Kozyakov pay u_e-travel2013Kozyakov pay u_e-travel2013
Kozyakov pay u_e-travel2013Media Gorod
 
13909772985295c7a772abc7.11863824
13909772985295c7a772abc7.1186382413909772985295c7a772abc7.11863824
13909772985295c7a772abc7.11863824Media Gorod
 
As e-travel 2013
As   e-travel 2013As   e-travel 2013
As e-travel 2013Media Gorod
 
Ishounkina internet research-projects
Ishounkina internet research-projectsIshounkina internet research-projects
Ishounkina internet research-projectsMedia Gorod
 
Orlova pay u group_290813_
Orlova pay u group_290813_Orlova pay u group_290813_
Orlova pay u group_290813_Media Gorod
 
Ep presentation (infographic 2013)
Ep presentation (infographic 2013)Ep presentation (infographic 2013)
Ep presentation (infographic 2013)Media Gorod
 
Iway slides e-travel_2013-11_ready
Iway slides e-travel_2013-11_readyIway slides e-travel_2013-11_ready
Iway slides e-travel_2013-11_readyMedia Gorod
 
Data insight e-travel2013
Data insight e-travel2013Data insight e-travel2013
Data insight e-travel2013Media Gorod
 
Электронное Правительство как Продукт
Электронное Правительство как ПродуктЭлектронное Правительство как Продукт
Электронное Правительство как ПродуктMedia Gorod
 
Lean мышление / Специфика Lean Startup
Lean мышление / Специфика Lean StartupLean мышление / Специфика Lean Startup
Lean мышление / Специфика Lean StartupMedia Gorod
 
Глобальный взгляд на мобильный мир (Nielsen)
 Глобальный взгляд на мобильный мир (Nielsen) Глобальный взгляд на мобильный мир (Nielsen)
Глобальный взгляд на мобильный мир (Nielsen)Media Gorod
 
Как россияне используют смартфоны (Nielsen)
 Как россияне используют смартфоны (Nielsen) Как россияне используют смартфоны (Nielsen)
Как россияне используют смартфоны (Nielsen)Media Gorod
 
Мобильный интернет в России (MailRuGroup)
Мобильный интернет в России (MailRuGroup) Мобильный интернет в России (MailRuGroup)
Мобильный интернет в России (MailRuGroup) Media Gorod
 

More from Media Gorod (20)

Itogi2013
Itogi2013Itogi2013
Itogi2013
 
Moneytree rus 1
Moneytree rus 1Moneytree rus 1
Moneytree rus 1
 
Iidf market watch_2013
Iidf market watch_2013Iidf market watch_2013
Iidf market watch_2013
 
E travel 2013 ufs-f
E travel 2013 ufs-fE travel 2013 ufs-f
E travel 2013 ufs-f
 
Travel shop 2013
Travel shop 2013Travel shop 2013
Travel shop 2013
 
Kozyakov pay u_e-travel2013
Kozyakov pay u_e-travel2013Kozyakov pay u_e-travel2013
Kozyakov pay u_e-travel2013
 
13909772985295c7a772abc7.11863824
13909772985295c7a772abc7.1186382413909772985295c7a772abc7.11863824
13909772985295c7a772abc7.11863824
 
As e-travel 2013
As   e-travel 2013As   e-travel 2013
As e-travel 2013
 
Ishounkina internet research-projects
Ishounkina internet research-projectsIshounkina internet research-projects
Ishounkina internet research-projects
 
E travel13
E travel13E travel13
E travel13
 
Orlova pay u group_290813_
Orlova pay u group_290813_Orlova pay u group_290813_
Orlova pay u group_290813_
 
Ep presentation (infographic 2013)
Ep presentation (infographic 2013)Ep presentation (infographic 2013)
Ep presentation (infographic 2013)
 
Iway slides e-travel_2013-11_ready
Iway slides e-travel_2013-11_readyIway slides e-travel_2013-11_ready
Iway slides e-travel_2013-11_ready
 
Data insight e-travel2013
Data insight e-travel2013Data insight e-travel2013
Data insight e-travel2013
 
Электронное Правительство как Продукт
Электронное Правительство как ПродуктЭлектронное Правительство как Продукт
Электронное Правительство как Продукт
 
Lean мышление / Специфика Lean Startup
Lean мышление / Специфика Lean StartupLean мышление / Специфика Lean Startup
Lean мышление / Специфика Lean Startup
 
Глобальный взгляд на мобильный мир (Nielsen)
 Глобальный взгляд на мобильный мир (Nielsen) Глобальный взгляд на мобильный мир (Nielsen)
Глобальный взгляд на мобильный мир (Nielsen)
 
Как россияне используют смартфоны (Nielsen)
 Как россияне используют смартфоны (Nielsen) Как россияне используют смартфоны (Nielsen)
Как россияне используют смартфоны (Nielsen)
 
Мобильный интернет в России (MailRuGroup)
Мобильный интернет в России (MailRuGroup) Мобильный интернет в России (MailRuGroup)
Мобильный интернет в России (MailRuGroup)
 
Meta Mass Media
Meta Mass MediaMeta Mass Media
Meta Mass Media
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop

  • 1. Scaling Storage and Computation with Apache Hadoop Konstantin V. Shvachko 1 October 2010
  • 2. What is Hadoop • Hadoop is an ecosystem of tools for processing “Big Data” • Hadoop is an open source project • Yahoo! a primary developer of Hadoop since 2006
  • 3. Big Data • Big Data management, storage and analytics • Large datasets (PBs) do not fit one computer – Internal (memory) sort – External (disk) sort – Distributed sort • Computations that need a lot of compute power
  • 4. Big Data: Examples • Search Webmap as of 2008 @ Y! – Raw disk used 5 PB – 1500 nodes • Large Hadron Collider: PBs of events – 1 PB of data per sec, most filtered out • 2 quadrillionth (1015) digit of πis 0 – Tsz-Wo (Nicholas) Sze – 23 days vs 2 years before – No data, pure CPU workload
  • 5. Hadoop is the Solution • Architecture principles: – Linear scaling – Reliability and Availability – Using unreliable commodity hardware – Computation is shipped to data No expensive data transfers – High performance
  • 6. Hadoop Components Data SerializationAvro Distributed coordinationZookeeper HDFS Distributed file system MapReduce Distributed computation HBase Column store Pig Dataflow language Hive Data warehouse Chukwa Data Collection
  • 7. MapReduce • MapReduce – distributed computation framework – Invented by Google researchers • Two stages of a MR job – Map: {<Key,Value>} -> {<K’,V’>} – Reduce: {<K’,V’>} -> {<K’’,V’’>} • Map – a truly distributed stage Reduce – an aggregation, may not be distributed • Shuffle – sort and merge, transition from Map to Reduce invisible to user
  • 9. Hadoop Distributed File System HDFS • The name space is a hierarchy of files and directories • Files are divided into blocks (typically 128 MB) • Namespace (metadata) is decoupled from data – Lots of fast namespace operations, not slowed down by – Data streaming • Single NameNode keeps the entire name space in RAM • DataNodes store block replicas as files on local drives • Blocks are replicated on 3 DataNodes for redundancy
  • 10. HDFS Read • To read a block, the client requests the list of replica locations from the NameNode • Then pulling data from a replica on one of the DataNodes
  • 11. HDFS Write • To write a block of a file, the client requests a list of candidate DataNodes from the NameNode, and organizes a write pipeline.
  • 12. Replica Location Awareness • MapReduce schedules a task assigned to process block B to a DataNode possessing a replica of B • Data are large, programs are small • Local access to data
  • 13. ZooKeeper • A distributed coordination service for distributed apps – Event coordination and notification – Leader election – Distributed locking • ZooKeeper can help build HA systems
  • 14. HBase • Distributed table store on top of HDFS – An implementation of Googl’s BigTable • Big table is Big Data, cannot be stored on a single node • Tables: big, sparse, loosely structured. – Consist of rows, having unique row keys – Has arbitrary number of columns, – grouped into small number of column families – Dynamic column creation • Table is partitioned into regions – Horizontally across rows; vertically across column families • HBase provides structured yet flexible access to data
  • 15. Pig • A language on top of and to simplify MapReduce • Pig speaks Pig Latin • SQL-like language • Pig programs are translated into a series of MapReduce jobs
  • 16. Hive • Serves the same purpose as Pig • Closely follows SQL standards • Keeps metadata about Hive tables in MySQL DRBM