SlideShare a Scribd company logo
1 of 30
Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis
Talk Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MOTIVATIONS ,[object Object],[object Object]
Why is Efficient Indexing Important? ,[object Object],[object Object],[object Object],Collection Data Year Docs Size(GB) WT2G Web 1999 240k 2.0 GOV Web 2002 1.8M 18.0 Blogs06 Blogs 2006 3M 13.0 GOV2 Web 2004 25M 425.0 ClueWeb09 Web 2009 1.2B 25,000
[object Object],[object Object],[object Object],Solutions? MapReduce
Contributions ,[object Object],[object Object],[object Object],[object Object],[object Object]
CLASSICAL INDEXING ,[object Object],[object Object],[object Object]
Classical Indexing ,[object Object],[object Object],[object Object],[object Object],(I.H. Witten, A. Moffat and T.C. Bell, 1999) Lexicon Posting List term Total docs Total frequency pointer Document number frequency
[object Object],[object Object],[object Object],Single-Pass In-Memory Indexing <> <> <> <> <> <> DISK Compressed Files % Used RAM t1 t2 t3 Indexer Final Inverted Index
How can Indexing be Distributed? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problems with Classical Approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MAPREDUCE INDEXING ,[object Object],[object Object]
MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing with MapReduce ,[object Object],[object Object],[object Object],[object Object],Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004 Emit each word in the document A big intermediate sort Lots of merging Approach  Emits Sorting Num emits per map Emit size D&G_Token Tokens Lots Lots Tiny D&G_Term Terms Lots Many Tiny Nutch Documents Little Some Average Single-Pass Posting lists Some Few Large
D&G_Token & D&G_Term ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit every  token Emit every  term
Nutch Style Indexing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit every  document
Our Single-Pass Indexing Strategy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Emit limited  Posting-Lists
Our MapReduce Indexing Strategy (2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
EXPERIMENTATION & RESULTS  ,[object Object],[object Object]
Research Questions ,[object Object],[object Object],[object Object],[object Object]
Evaluation of MapReduce Indexing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Experimental Setup ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Target Indexing Throughput Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines ,[object Object],[object Object],Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
Baseline Indexing Throughput ,[object Object],[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
D&G_Token Indexing Throughput ,[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term MapReduce Single-Pass Number of Machines Allocated
D&G_Term Indexing Throughput ,[object Object],[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass Number of Machines Allocated
Single-Pass Indexing Throughput ,[object Object],[object Object],[object Object],Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines Indexing Strategy 1 2 4 6 8 Shared-Nothing Distributed 3 6 12 18 24 Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8 MapReduce D&G_Token - - - - - MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38 MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31 Number of Machines Allocated
CONCLUSION
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Questions? ,[object Object]

More Related Content

What's hot

Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
Bin Cai
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
Edureka!
 

What's hot (17)

Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Ieeepro techno solutions ieee java project - budget-driven scheduling algor...
Ieeepro techno solutions   ieee java project - budget-driven scheduling algor...Ieeepro techno solutions   ieee java project - budget-driven scheduling algor...
Ieeepro techno solutions ieee java project - budget-driven scheduling algor...
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 

Viewers also liked (6)

Trec2009blog overview v9
Trec2009blog overview v9Trec2009blog overview v9
Trec2009blog overview v9
 
Content Personalization - Why it Helps
Content Personalization - Why it HelpsContent Personalization - Why it Helps
Content Personalization - Why it Helps
 
News Article Ranking : Leveraging the Wisdom of Bloggers
News Article Ranking : Leveraging the Wisdom of BloggersNews Article Ranking : Leveraging the Wisdom of Bloggers
News Article Ranking : Leveraging the Wisdom of Bloggers
 
Christmas Gifts 2010
Christmas Gifts 2010Christmas Gifts 2010
Christmas Gifts 2010
 
Crowdsourcing a News Query Classification Dataset
Crowdsourcing a News Query Classification DatasetCrowdsourcing a News Query Classification Dataset
Crowdsourcing a News Query Classification Dataset
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similar to Comparing Distributed Indexing To Mapreduce or Not?

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
msgroner
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
mathieuraj
 
Download It
Download ItDownload It
Download It
butest
 

Similar to Comparing Distributed Indexing To Mapreduce or Not? (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
E031201032036
E031201032036E031201032036
E031201032036
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Download It
Download ItDownload It
Download It
 

Recently uploaded

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Comparing Distributed Indexing To Mapreduce or Not?

  • 1. Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 29.
  • 30.

Editor's Notes

  1. © Terrier Development Team, University of Glasgow, 2005
  2. © Terrier Development Team, University of Glasgow, 2005