SlideShare a Scribd company logo
Scalable Computing
        with
      Hadoop

          Doug Cutting
      cutting@apache.org
    dcutting@yahoo-inc.com

            5/4/06
Seek versus Transfer
●   B-Tree
    –   requires seek per access
    –   unless to recent, cached page
    –   so can buffer & pre-sort accesses
    –   but, w/ fragmentation, must still seek per page


                                                G O U


    A B C D E F                H   I    J K L M           ...
Seek versus Transfer
●   update by merging
    –   merge sort takes log(updates), at transfer rate
    –   merging updates is linear in db size, at transfer rate
●   if 10MB/s xfer, 10ms seek, 1% update of TB db
    –   100b entries, 10kb pages, 10B entries, 1B pages
    –   seek per update requires 1000 days!
    –   seek per page requires 100 days!
    –   transfer entire db takes 1 day
Hadoop DFS
●   modelled after Google's GFS
●   single namenode
    –   maps name → <blockId>*
    –   maps blockId → <host:port>replication_level
●   many datanodes, one per disk generally
    –   map blockId → <byte>*
    –   poll namenode for replication, deletion, etc. requests
●   client code talks to both
Hadoop MapReduce
●   Platform for reliable, scalable computing.
●   All data is sequences of <key,value> pairs.
●   Programmer specifies two primary methods:
    –   map(k, v) → <k', v'>*
    –   reduce(k', <v'>*) → <k', v'>*
    –   also partition(), compare(), & others
●   All v' with same k' are reduced together, in order.
    –   bonus: built-in support for sort/merge!
MapReduce job processing

input     map tasks   reduce tasks   output
split 0   map()
split 1   map()        reduce()      part 0

split 2   map()        reduce()      part 1

split 3   map()        reduce()      part 2

split 4   map()
Example: RegexMapper
public class RegexMapper implements Mapper {
  private Pattern pattern;
  private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("mapred.mapper.regex"));
      group = job.getInt("mapred.mapper.regex.group", 0);
    }

    public void map(WritableComparable key, Writable value,
                    OutputCollector output, Reporter reporter)
      throws IOException {
      String text = ((UTF8)value).toString();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        output.collect(new UTF8(matcher.group(group)),
                       new LongWritable(1));
      }
    }
}
Example: LongSumReducer
public class LongSumReducer implements Reducer {

    public void configure(JobConf job) {}

    public void reduce(WritableComparable key, Iterator values,
                       OutputCollector output, Reporter reporter)
      throws IOException {

        long sum = 0;

        while (values.hasNext()) {
          sum += ((LongWritable)values.next()).get();
        }

        output.collect(key, new LongWritable(sum));
    }
}
Example: main()
public static void main(String[] args) throws IOException {
  NutchConf defaults = NutchConf.get();
  JobConf job = new JobConf(defaults);

    job.setInputDir(new File(args[0]));

    job.setMapperClass(RegexMapper.class);
    job.set("mapred.mapper.regex", args[2]);
    job.set("mapred.mapper.regex.group", args[3]);

    job.setReducerClass(LongSumReducer.class);

    job.setOutputDir(args[1]);
    job.setOutputKeyClass(UTF8.class);
    job.setOutputValueClass(LongWritable.class);

    JobClient.runJob(job);
}
Nutch Algorithms
●   inject urls into a crawl db, to bootstrap it.
●   loop:
    –   generate a set of urls to fetch from crawl db;
    –   fetch a set of urls into a segment;
    –   parse fetched content of a segment;
    –   update crawl db with data parsed from a segment.
●   invert links parsed from segments
●   index segment text & inlink anchor text
Nutch on MapReduce & NDFS
●   Nutch's major algorithms converted in 2 weeks.
●   Before:
    –   several were undistributed scalabilty bottlenecks
    –   distributable algorithms were complex to manage
    –   collections larger than 100M pages impractical
●   After:
    –   all are scalable, distributed, easy to operate
    –   code is substantially smaller & simpler
    –   should permit multi-billion page collections
Data Structure: Crawl DB
●   CrawlDb is a directory of files containing:
    <URL, CrawlDatum>
●   CrawlDatum:
    <status, date, interval, failures, linkCount, ...>
●   Status:
    {db_unfetched, db_fetched, db_gone,
      linked,
      fetch_success, fetch_fail, fetch_gone}
Algorithm: Inject
●   MapReduce1: Convert input to DB format
    In: flat text file of urls
    Map(line) → <url, CrawlDatum>; status=db_unfetched
    Reduce() is identity;
    Output: directory of temporary files
●   MapReduce2: Merge into existing DB
    Input: output of Step1 and existing DB files
    Map() is identity.
    Reduce: merge CrawlDatum's into single entry
    Out: new version of DB
Algorithm: Generate
●   MapReduce1: select urls due for fetch
    In: Crawl DB files
    Map() → if date≥now, invert to <CrawlDatum, url>
    Partition by value hash (!) to randomize
    Reduce:
       compare() order by decreasing CrawlDatum.linkCount
       output only top-N most-linked entries
●   MapReduce2: prepare for fetch
    Map() is invert; Partition() by host, Reduce() is identity.
    Out: Set of <url,CrawlDatum> files to fetch in parallel
Algorithm: Fetch
●   MapReduce: fetch a set of urls
    In: <url,CrawlDatum>, partition by host, sort by hash
    Map(url,CrawlDatum) → <url, FetcherOutput>
       multi-threaded, async map implementation
       calls existing Nutch protocol plugins
    FetcherOutput: <CrawlDatum, Content>
    Reduce is identity
    Out: two files: <url,CrawlDatum>, <url,Content>
Algorithm: Parse
●   MapReduce: parse content
    In: <url, Content> files from Fetch
    Map(url, Content) → <url, Parse>
     calls existing Nutch parser plugins
    Reduce is identity.
    Parse: <ParseText, ParseData>
    Out: split in three: <url,ParseText>, <url,ParseData>
     and <url,CrawlDatum> for outlinks.
Algorithm: Update Crawl DB
●   MapReduce: integrate fetch & parse out into db
    In: <url,CrawlDatum> existing db plus fetch & parse
      out
    Map() is identity
    Reduce() merges all entries into a single new entry
       overwrite previous db status w/ new from fetch
       sum count of links from parse w/ previous from db
    Out: new crawl db
Algorithm: Invert Links
●   MapReduce: compute inlinks for all urls
    In: <url,ParseData>, containing page outlinks
    Map(srcUrl, ParseData> → <destUrl, Inlinks>
       collect a single-element Inlinks for each outlink
       limit number of outlinks per page
    Inlinks: <srcUrl, anchorText>*
    Reduce() appends inlinks
    Out: <url, Inlinks>, a complete link inversion
Algorithm: Index
●   MapReduce: create Lucene indexes
    In: multiple files, values wrapped in <Class, Object>
       <url, ParseData> from parse, for title, metadata, etc.
       <url, ParseText> from parse, for text
       <url, Inlinks> from invert, for anchors
       <url, CrawlDatum> from fetch, for fetch date
    Map() is identity
    Reduce() create a Lucene Document
       call existing Nutch indexing plugins
    Out: build Lucene index; copy to fs at end
Hadoop MapReduce Extensions
●   Split output to multiple files
    –   saves subsequent i/o, since inputs are smaller
●   Mix input value types
    –   saves MapReduce passes to convert values
●   Async Map
    –   permits multi-threaded Fetcher
●   Partition by Value
    –   facilitates selecting subsets w/ maximum key values
Thanks!




http://lucene.apache.org/hadoop/

More Related Content

What's hot

Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
Learning postgresql
Learning postgresqlLearning postgresql
Learning postgresql
DAVID RAUDALES
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
Samir Bessalah
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
Chengjen Lee
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
JavaDayUA
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 
Jsonnet, terraform & packer
Jsonnet, terraform & packerJsonnet, terraform & packer
Jsonnet, terraform & packer
David Cunningham
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
JavaDayUA
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
Jonathan Katz
 
Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012
Mack Hardy
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
Alex Ivanov
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
Anastasia Lubennikova
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Altinity Ltd
 
Postgresql Federation
Postgresql FederationPostgresql Federation
Postgresql Federation
Jim Mlodgenski
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Cheng Min Chi
 

What's hot (20)

Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Learning postgresql
Learning postgresqlLearning postgresql
Learning postgresql
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Jsonnet, terraform & packer
Jsonnet, terraform & packerJsonnet, terraform & packer
Jsonnet, terraform & packer
 
Save Java memory
Save Java memorySave Java memory
Save Java memory
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012Server side geo_tools_in_drupal_pnw_2012
Server side geo_tools_in_drupal_pnw_2012
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
 
Postgresql Federation
Postgresql FederationPostgresql Federation
Postgresql Federation
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 

Viewers also liked

Dean Keynote Ladis2009
Dean Keynote Ladis2009Dean Keynote Ladis2009
Dean Keynote Ladis2009
longhao
 
Vim get start_1.0
Vim get start_1.0Vim get start_1.0
Vim get start_1.0
longhao
 
产品部内部交流平台
产品部内部交流平台产品部内部交流平台
产品部内部交流平台
longhao
 
高性能网站最佳实践
高性能网站最佳实践高性能网站最佳实践
高性能网站最佳实践longhao
 
Netputer
NetputerNetputer
Netputerlonghao
 
透明计算与云计算
透明计算与云计算透明计算与云计算
透明计算与云计算longhao
 
手机上的硬件设备及典型App
手机上的硬件设备及典型App手机上的硬件设备及典型App
手机上的硬件设备及典型App
longhao
 
Node.js长连接开发实践
Node.js长连接开发实践Node.js长连接开发实践
Node.js长连接开发实践
longhao
 
软件重构
软件重构软件重构
软件重构longhao
 
借助社会化媒体的个人成长
借助社会化媒体的个人成长借助社会化媒体的个人成长
借助社会化媒体的个人成长longhao
 
Java并发编程培训
Java并发编程培训Java并发编程培训
Java并发编程培训longhao
 
无Ued产品的易用性讨论
无Ued产品的易用性讨论无Ued产品的易用性讨论
无Ued产品的易用性讨论
longhao
 
并发编程实践
并发编程实践并发编程实践
并发编程实践longhao
 
Cw concept Cocoa
Cw concept CocoaCw concept Cocoa
Cw concept Cocoa
Meyta Nasetyowati
 
Tiempo del verbo 3
Tiempo del verbo 3Tiempo del verbo 3
Los tic
Los ticLos tic
Exeter Presentation November_2016_Web
Exeter Presentation November_2016_WebExeter Presentation November_2016_Web
Exeter Presentation November_2016_Web
Exeter Resource Corporation
 
Fashionary by WGSN
Fashionary by WGSNFashionary by WGSN
Fashionary by WGSN
Penter
 
What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...
Sameer Mathur
 
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
Idabagus Mahartana
 

Viewers also liked (20)

Dean Keynote Ladis2009
Dean Keynote Ladis2009Dean Keynote Ladis2009
Dean Keynote Ladis2009
 
Vim get start_1.0
Vim get start_1.0Vim get start_1.0
Vim get start_1.0
 
产品部内部交流平台
产品部内部交流平台产品部内部交流平台
产品部内部交流平台
 
高性能网站最佳实践
高性能网站最佳实践高性能网站最佳实践
高性能网站最佳实践
 
Netputer
NetputerNetputer
Netputer
 
透明计算与云计算
透明计算与云计算透明计算与云计算
透明计算与云计算
 
手机上的硬件设备及典型App
手机上的硬件设备及典型App手机上的硬件设备及典型App
手机上的硬件设备及典型App
 
Node.js长连接开发实践
Node.js长连接开发实践Node.js长连接开发实践
Node.js长连接开发实践
 
软件重构
软件重构软件重构
软件重构
 
借助社会化媒体的个人成长
借助社会化媒体的个人成长借助社会化媒体的个人成长
借助社会化媒体的个人成长
 
Java并发编程培训
Java并发编程培训Java并发编程培训
Java并发编程培训
 
无Ued产品的易用性讨论
无Ued产品的易用性讨论无Ued产品的易用性讨论
无Ued产品的易用性讨论
 
并发编程实践
并发编程实践并发编程实践
并发编程实践
 
Cw concept Cocoa
Cw concept CocoaCw concept Cocoa
Cw concept Cocoa
 
Tiempo del verbo 3
Tiempo del verbo 3Tiempo del verbo 3
Tiempo del verbo 3
 
Los tic
Los ticLos tic
Los tic
 
Exeter Presentation November_2016_Web
Exeter Presentation November_2016_WebExeter Presentation November_2016_Web
Exeter Presentation November_2016_Web
 
Fashionary by WGSN
Fashionary by WGSNFashionary by WGSN
Fashionary by WGSN
 
What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...What are the important brand architecture decisions in developing a branding ...
What are the important brand architecture decisions in developing a branding ...
 
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
High Bandwidth suspention modelling and Design LQR Full state Feedback Contro...
 

Similar to hadoop

Hadoop
HadoopHadoop
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
ElieHannouch
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 

Similar to hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Introduction To MongoDB
Introduction To MongoDBIntroduction To MongoDB
Introduction To MongoDB
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 

Recently uploaded

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

hadoop

  • 1. Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06
  • 2. Seek versus Transfer ● B-Tree – requires seek per access – unless to recent, cached page – so can buffer & pre-sort accesses – but, w/ fragmentation, must still seek per page G O U A B C D E F H I J K L M ...
  • 3. Seek versus Transfer ● update by merging – merge sort takes log(updates), at transfer rate – merging updates is linear in db size, at transfer rate ● if 10MB/s xfer, 10ms seek, 1% update of TB db – 100b entries, 10kb pages, 10B entries, 1B pages – seek per update requires 1000 days! – seek per page requires 100 days! – transfer entire db takes 1 day
  • 4. Hadoop DFS ● modelled after Google's GFS ● single namenode – maps name → <blockId>* – maps blockId → <host:port>replication_level ● many datanodes, one per disk generally – map blockId → <byte>* – poll namenode for replication, deletion, etc. requests ● client code talks to both
  • 5. Hadoop MapReduce ● Platform for reliable, scalable computing. ● All data is sequences of <key,value> pairs. ● Programmer specifies two primary methods: – map(k, v) → <k', v'>* – reduce(k', <v'>*) → <k', v'>* – also partition(), compare(), & others ● All v' with same k' are reduced together, in order. – bonus: built-in support for sort/merge!
  • 6. MapReduce job processing input map tasks reduce tasks output split 0 map() split 1 map() reduce() part 0 split 2 map() reduce() part 1 split 3 map() reduce() part 2 split 4 map()
  • 7. Example: RegexMapper public class RegexMapper implements Mapper { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getInt("mapred.mapper.regex.group", 0); } public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String text = ((UTF8)value).toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new UTF8(matcher.group(group)), new LongWritable(1)); } } }
  • 8. Example: LongSumReducer public class LongSumReducer implements Reducer { public void configure(JobConf job) {} public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { long sum = 0; while (values.hasNext()) { sum += ((LongWritable)values.next()).get(); } output.collect(key, new LongWritable(sum)); } }
  • 9. Example: main() public static void main(String[] args) throws IOException { NutchConf defaults = NutchConf.get(); JobConf job = new JobConf(defaults); job.setInputDir(new File(args[0])); job.setMapperClass(RegexMapper.class); job.set("mapred.mapper.regex", args[2]); job.set("mapred.mapper.regex.group", args[3]); job.setReducerClass(LongSumReducer.class); job.setOutputDir(args[1]); job.setOutputKeyClass(UTF8.class); job.setOutputValueClass(LongWritable.class); JobClient.runJob(job); }
  • 10. Nutch Algorithms ● inject urls into a crawl db, to bootstrap it. ● loop: – generate a set of urls to fetch from crawl db; – fetch a set of urls into a segment; – parse fetched content of a segment; – update crawl db with data parsed from a segment. ● invert links parsed from segments ● index segment text & inlink anchor text
  • 11. Nutch on MapReduce & NDFS ● Nutch's major algorithms converted in 2 weeks. ● Before: – several were undistributed scalabilty bottlenecks – distributable algorithms were complex to manage – collections larger than 100M pages impractical ● After: – all are scalable, distributed, easy to operate – code is substantially smaller & simpler – should permit multi-billion page collections
  • 12. Data Structure: Crawl DB ● CrawlDb is a directory of files containing: <URL, CrawlDatum> ● CrawlDatum: <status, date, interval, failures, linkCount, ...> ● Status: {db_unfetched, db_fetched, db_gone, linked, fetch_success, fetch_fail, fetch_gone}
  • 13. Algorithm: Inject ● MapReduce1: Convert input to DB format In: flat text file of urls Map(line) → <url, CrawlDatum>; status=db_unfetched Reduce() is identity; Output: directory of temporary files ● MapReduce2: Merge into existing DB Input: output of Step1 and existing DB files Map() is identity. Reduce: merge CrawlDatum's into single entry Out: new version of DB
  • 14. Algorithm: Generate ● MapReduce1: select urls due for fetch In: Crawl DB files Map() → if date≥now, invert to <CrawlDatum, url> Partition by value hash (!) to randomize Reduce: compare() order by decreasing CrawlDatum.linkCount output only top-N most-linked entries ● MapReduce2: prepare for fetch Map() is invert; Partition() by host, Reduce() is identity. Out: Set of <url,CrawlDatum> files to fetch in parallel
  • 15. Algorithm: Fetch ● MapReduce: fetch a set of urls In: <url,CrawlDatum>, partition by host, sort by hash Map(url,CrawlDatum) → <url, FetcherOutput> multi-threaded, async map implementation calls existing Nutch protocol plugins FetcherOutput: <CrawlDatum, Content> Reduce is identity Out: two files: <url,CrawlDatum>, <url,Content>
  • 16. Algorithm: Parse ● MapReduce: parse content In: <url, Content> files from Fetch Map(url, Content) → <url, Parse> calls existing Nutch parser plugins Reduce is identity. Parse: <ParseText, ParseData> Out: split in three: <url,ParseText>, <url,ParseData> and <url,CrawlDatum> for outlinks.
  • 17. Algorithm: Update Crawl DB ● MapReduce: integrate fetch & parse out into db In: <url,CrawlDatum> existing db plus fetch & parse out Map() is identity Reduce() merges all entries into a single new entry overwrite previous db status w/ new from fetch sum count of links from parse w/ previous from db Out: new crawl db
  • 18. Algorithm: Invert Links ● MapReduce: compute inlinks for all urls In: <url,ParseData>, containing page outlinks Map(srcUrl, ParseData> → <destUrl, Inlinks> collect a single-element Inlinks for each outlink limit number of outlinks per page Inlinks: <srcUrl, anchorText>* Reduce() appends inlinks Out: <url, Inlinks>, a complete link inversion
  • 19. Algorithm: Index ● MapReduce: create Lucene indexes In: multiple files, values wrapped in <Class, Object> <url, ParseData> from parse, for title, metadata, etc. <url, ParseText> from parse, for text <url, Inlinks> from invert, for anchors <url, CrawlDatum> from fetch, for fetch date Map() is identity Reduce() create a Lucene Document call existing Nutch indexing plugins Out: build Lucene index; copy to fs at end
  • 20. Hadoop MapReduce Extensions ● Split output to multiple files – saves subsequent i/o, since inputs are smaller ● Mix input value types – saves MapReduce passes to convert values ● Async Map – permits multi-threaded Fetcher ● Partition by Value – facilitates selecting subsets w/ maximum key values