SlideShare a Scribd company logo
The Family of Hadoop
            Nham Xuan Nam
     nhamxuannam [at] gmail.com
     http://namnham.blogspot.com




    Barcamp Saigon, December 13 2009
Content
   History
   Sub-projects
   HDFS
   Map Reduce
   HBase
   Hive
History
   created by Doug Cutting, the creator of
    Lucene.
   Lucene: open source index & search library.
   Nutch: Lucene-based web crawler.
   Jun 2003, there was a successful 100
    million page Nutch demo system.
   Nutch problem: its architecture could not
    scale to the billions of pages.
History
 Oct 2003, Google published the paper
“The Google File System”.
   In 2004, Nutch team wrote an open source implementation
    of GFS, called Nutch Distributed File System (NDFS).
   Dec 2004, Google published the paper “MapReduce:
    Simplified Data Processing on Large Clusters”.
   In 2005, Nutch team implemented MapReduce in Nutch.
   Mid 2005, all the major Nutch algorithms had been ported
    to run using MapReduce and NDFS.
History
   Feb 2006, Nutch's NDFS and the MapReduce
    implementation formed Hadoop project.
   Doug Cutting joined Yahoo!.
   Jan 2008, Hadoop became Apache top-level
    project.
   Feb 2008, Yahoo! production search index
    was generated by a 10,000-core Hadoop
    cluster.
History




Source: http://wiki.apache.org/hadoop/PoweredBy
Sub-projects
Architecture
Data Model
   File stored as blocks (default size: 64M)
   Reliability through replication
    – Each block is replicated to several datanodes
Namenode & Datanodes
   Namenode (master)
    – manages the filesystem namespace
    – maintains the filesystem tree and metadata for all the
      files and directories in the tree.

   Datanodes (slaves)
    – store data in the local file system
    – Periodically report back to the namenode with lists of all
      existing blocks

   Clients communicate with both namenode and datanodes.
Data Flow
Data Flow
Accessibility
   FileSystem Java API
    – org.apache.hadoop.fs.*

   Web Interface

   Commands for HDFS users
$ hadoop dfs ­mkdir /barcamp

$ hadoop dfs ­ls /barcamp

   Commands for HDFS admins
$ hadoop dfsadmin ­report

$ hadoop dfsadmin ­refreshNodes
Programming Model
Programming Model
   Data is a stream of keys and values
   Map

    – Input: <key1,value1> pairs from data source

    – Output: immediate <key2,value2> pairs

   Reduce
    – Called once per a key, in sorted order
       Input: <key2, list of value2>

       Output: <key3,value3> pairs
Data Flow
WordCount Example
 File01:                                  File02:
 Hello Barcamp Hello Everyone             Hello Hadoop Hello Everyone

<_, Hello Barcamp Hello Everyone>       <_, Hello Hadoop Hello Everyone>




         <Hello,     2>                          <Hello,     2>
         <Barcamp, 1>                            <Hadoop,    1>
         <Everyone,  1>                          <Everyone,  1>

                          <Barcamp,    [1]>
                          <Hadoop,     [1]>
                          <Hello,      [2,2]>
                          <Everyone,   [1,1]>




                            <Barcamp, 1>
                            <Hadoop,    1>
                            <Hello,     4>
                            <Everyone,  2>
MapReduce in Hadoop
   JobTracker (master)
    – handling all jobs.
    – scheduling tasks on the slaves.
    – monitoring & re-executing tasks.

   TaskTrackers (slaves)
    – execute the tasks.

   Task
    – run an individual map or reduce.
MapReduce in Hadoop
Introduction
   Nov 2006, Google released the paper “Bigtable: A
    Distributed Storage System for Structured Data”
   BigTable: distributed, column-oriented store, built on top of
    Google File System.
   HBase: open source implementation of BigTable, built on
    top of HDFS.
Data Model
   Data are stored in tables of rows and columns.
   Cells are ”versioned”
→ Data are addressed by row/column/version key.
   Table rows are sorted by row key, the table's primary key.
   Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
   Tables are stored in regions.
   Region: a row range [start-key : end-key)
Data Model
Architecture
Architecture
   Master Server
    – assigns regions to regionservers
    – monitors the health of regionservers
    – handles administrative funtions

   RegionServers
     – contain regions and handle client read/write requests

   Catalog Tables (ROOT and META)
     – maintain the current list, state, recent history, and
       location of all regions.
Accessibility
   Client API
org.apache.hadoop.hbase
.client.*

   HBase Shell
$ bin/hbase shell
hbase> 

   Web Interface
Introduction
   started at Facebook
   an open source data warehousing solution
    built on top of Hadoop
   for managing and querying structured data
   Hive QL: SQL-like query language
    – compiled into map-reduce jobs
   log processing, data mining,...
Data Model
   Tables
    – analogous to tables in RDBMS
    – rows are organized into typed columns
    – all the data is stored in a directory in HDFS

   Partitions
    – determine the distribution of data within sub-directories
      of the table directory

   Buckets
    – based on the hash of a column in the table
    – Each bucket is stored as a file in the partition directory
Architecture
Architecture
   Metastore
    – contains metadata about data stored in Hive.
    – stored in any SQL backend or an embedded Derby.
    – Database: a namespace for tables
    – Table metadata: column types, physical layout,...
    – Partition metadata

   Compiler

   Excution Engine

   Shell
Hive Query Language
   Data Definition (DDL) statements
    – CREATE/DROP/ALTER TABLE
    – SHOW TABLE/PARTITIONS

   Data Manipulation (DML) statements
    – LOAD DATA
    – INSERT
    – SELECT

   User Defined functions: UDF/UDAF
Hive @ Facebook
The End




Thank you!

More Related Content

What's hot

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
Nikita Kesharwani
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop
HadoopHadoop
Hadoop
Cassell Hsu
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
HDFS
HDFSHDFS
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 

What's hot (20)

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotes
fgary20
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication Industry
Leonard Raphael
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insuranceBob Foresite
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
SAGE Publishing
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
FHIR Developer Days
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry Trends
ICF
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final exam
Rogue Phoenix
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie company
Rahul Biradar
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Imesh Gunaratne
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-WashyJoanGascon
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - Proposal
Gerhard29046
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability Insurance
EDR
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The Rules
GMR Group
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016
Nhật Nguyễn
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services
Trinity Dwarka
 
Hydraulics actuation system
Hydraulics actuation systemHydraulics actuation system
Hydraulics actuation system
Toastmaster International
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral Cavity
Ravi Kumar
 
Prostate Cancer
Prostate CancerProstate Cancer
Prostate Cancer
Robert J Miller MD
 

Viewers also liked (20)

Funeral insurance quotes
Funeral insurance quotesFuneral insurance quotes
Funeral insurance quotes
 
Innovation in the telecommunication Industry
Innovation in the telecommunication IndustryInnovation in the telecommunication Industry
Innovation in the telecommunication Industry
 
Local commercial insurance
Local commercial insuranceLocal commercial insurance
Local commercial insurance
 
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
5 Tips for Teaching Introduction to Mass Communication: Engaging Students Liv...
 
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
Rolling out FHIR - architecture and implementation considerations by Lloyd Mc...
 
MRO Market Update and Industry Trends
MRO Market Update and Industry TrendsMRO Market Update and Industry Trends
MRO Market Update and Industry Trends
 
Beginners SharePoint introduction
Beginners SharePoint introductionBeginners SharePoint introduction
Beginners SharePoint introduction
 
FREE Law 531 final exam
FREE Law 531 final examFREE Law 531 final exam
FREE Law 531 final exam
 
Kristen's cookie company
Kristen's cookie companyKristen's cookie company
Kristen's cookie company
 
Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)Neerogi - A Patient Information Management System (PIMS)
Neerogi - A Patient Information Management System (PIMS)
 
Mrs.Wishy-Washy
Mrs.Wishy-WashyMrs.Wishy-Washy
Mrs.Wishy-Washy
 
Preventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - ProposalPreventaloss Loss Adjusters - Proposal
Preventaloss Loss Adjusters - Proposal
 
Lender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability InsuranceLender Essentials: Environmental Liability Insurance
Lender Essentials: Environmental Liability Insurance
 
Summary -First Break All The Rules
Summary -First Break All The RulesSummary -First Break All The Rules
Summary -First Break All The Rules
 
Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016Chemical plant design &amp; construction 2016
Chemical plant design &amp; construction 2016
 
Marketing of Financial Products and Services
Marketing of Financial Products and Services Marketing of Financial Products and Services
Marketing of Financial Products and Services
 
Hr value proposition
Hr value proposition  Hr value proposition
Hr value proposition
 
Hydraulics actuation system
Hydraulics actuation systemHydraulics actuation system
Hydraulics actuation system
 
Viral infections of Oral Cavity
Viral infections of Oral CavityViral infections of Oral Cavity
Viral infections of Oral Cavity
 
Prostate Cancer
Prostate CancerProstate Cancer
Prostate Cancer
 

Similar to The Family of Hadoop

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
AyeeshaParveen
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
AyeeshaParveen
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
HADOOP
HADOOPHADOOP
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Steve Watt
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Hadoop
HadoopHadoop

Similar to The Family of Hadoop (20)

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
HADOOP
HADOOPHADOOP
HADOOP
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

The Family of Hadoop

  • 1. The Family of Hadoop Nham Xuan Nam nhamxuannam [at] gmail.com http://namnham.blogspot.com Barcamp Saigon, December 13 2009
  • 2. Content  History  Sub-projects  HDFS  Map Reduce  HBase  Hive
  • 3. History  created by Doug Cutting, the creator of Lucene.  Lucene: open source index & search library.  Nutch: Lucene-based web crawler.  Jun 2003, there was a successful 100 million page Nutch demo system.  Nutch problem: its architecture could not scale to the billions of pages.
  • 4. History  Oct 2003, Google published the paper “The Google File System”.  In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).  Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.  In 2005, Nutch team implemented MapReduce in Nutch.  Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
  • 5. History  Feb 2006, Nutch's NDFS and the MapReduce implementation formed Hadoop project.  Doug Cutting joined Yahoo!.  Jan 2008, Hadoop became Apache top-level project.  Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
  • 8.
  • 10. Data Model  File stored as blocks (default size: 64M)  Reliability through replication – Each block is replicated to several datanodes
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 14. Accessibility  FileSystem Java API – org.apache.hadoop.fs.*  Web Interface  Commands for HDFS users $ hadoop dfs ­mkdir /barcamp $ hadoop dfs ­ls /barcamp  Commands for HDFS admins $ hadoop dfsadmin ­report $ hadoop dfsadmin ­refreshNodes
  • 15.
  • 17. Programming Model  Data is a stream of keys and values  Map – Input: <key1,value1> pairs from data source – Output: immediate <key2,value2> pairs  Reduce – Called once per a key, in sorted order  Input: <key2, list of value2>  Output: <key3,value3> pairs
  • 19. WordCount Example File01: File02: Hello Barcamp Hello Everyone Hello Hadoop Hello Everyone <_, Hello Barcamp Hello Everyone> <_, Hello Hadoop Hello Everyone> <Hello, 2> <Hello, 2> <Barcamp, 1> <Hadoop, 1> <Everyone,  1> <Everyone,  1> <Barcamp, [1]> <Hadoop, [1]> <Hello, [2,2]> <Everyone, [1,1]> <Barcamp, 1> <Hadoop, 1> <Hello,  4> <Everyone,  2>
  • 20. MapReduce in Hadoop  JobTracker (master) – handling all jobs. – scheduling tasks on the slaves. – monitoring & re-executing tasks.  TaskTrackers (slaves) – execute the tasks.  Task – run an individual map or reduce.
  • 22.
  • 23. Introduction  Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”  BigTable: distributed, column-oriented store, built on top of Google File System.  HBase: open source implementation of BigTable, built on top of HDFS.
  • 24. Data Model  Data are stored in tables of rows and columns.  Cells are ”versioned” → Data are addressed by row/column/version key.  Table rows are sorted by row key, the table's primary key.  Columns are grouped into column families. → A column name has the form “<family>:<label>”  Tables are stored in regions.  Region: a row range [start-key : end-key)
  • 27. Architecture  Master Server – assigns regions to regionservers – monitors the health of regionservers – handles administrative funtions  RegionServers – contain regions and handle client read/write requests  Catalog Tables (ROOT and META) – maintain the current list, state, recent history, and location of all regions.
  • 28. Accessibility  Client API org.apache.hadoop.hbase .client.*  HBase Shell $ bin/hbase shell hbase>   Web Interface
  • 29.
  • 30. Introduction  started at Facebook  an open source data warehousing solution built on top of Hadoop  for managing and querying structured data  Hive QL: SQL-like query language – compiled into map-reduce jobs  log processing, data mining,...
  • 31. Data Model  Tables – analogous to tables in RDBMS – rows are organized into typed columns – all the data is stored in a directory in HDFS  Partitions – determine the distribution of data within sub-directories of the table directory  Buckets – based on the hash of a column in the table – Each bucket is stored as a file in the partition directory
  • 33. Architecture  Metastore – contains metadata about data stored in Hive. – stored in any SQL backend or an embedded Derby. – Database: a namespace for tables – Table metadata: column types, physical layout,... – Partition metadata  Compiler  Excution Engine  Shell
  • 34. Hive Query Language  Data Definition (DDL) statements – CREATE/DROP/ALTER TABLE – SHOW TABLE/PARTITIONS  Data Manipulation (DML) statements – LOAD DATA – INSERT – SELECT  User Defined functions: UDF/UDAF