SlideShare a Scribd company logo
1 of 9
Download to read offline
Exercise
Similarity of Documents 
•Simple inner product 
•Cosine similarity 
•Term weights 
–Standard problem in IR 
–tf-idf, BM25, etc. 
•A document d is represented as a vector Wdof term weights wt,d, which indicate the importance of each term t in the document 
di 
dj
Trivial Solution 
•A trivial solution is to take each vector and compute its similarity (dot prod) with every other vector in the collection, which means that we will load each vector o(n) times and load each term o(dft2)times 
•This will work for small collection, but our goal is to have a scalable and efficient sol for large collections
•Load weights for each term once 
•Each term contributes o(dft2)partial scores 
•Allows efficiency tricks 
Each term contributes only if appears in 
Better Solution
mapindexreduce 
Better Solution: Map-Reduce
Solution –parts we have done 
•An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs 
•Indexing: 
–We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight. 
–Use tf-idfto compute the weights. 
–Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value. 
–The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.
Exercise -tasks 
•Task 1 
–Write map-reduce pseudo-code to compute sim(di,dj) based on the idea described in previous page. 
–Given the tf-idfcode 
–Given the inverted index computation code 
–Write the similarity code to compute sim(di,dj) based on the pseudo-code designed above. 
–Execute code on the provided data set, and test with the provided testing code. 
•Group / individual presentation (60 min)
Solution 
•An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs 
•Indexing: 
–We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight. 
–Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value. 
–The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings. 
•PairwiseSimilarity: 
–Mapping over each posting, the mappergenerates key tuplescorresponding to pairs of docidsin the postings. 
–These key tuplesare associated with the product of the corresponding term weights 
–They represent the individual term contributions to the final inner product. 
–The MapReduce runtime sorts the tuplesand then the reducer sums all the individual score contributions for a pair to generate the final similarity score.
End of session 
Day –2: Exercise

More Related Content

What's hot

Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmTarikuDabala1
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Data structure and algorithm
Data structure and algorithmData structure and algorithm
Data structure and algorithmTrupti Agrawal
 
Dynamic Memory & Linked Lists
Dynamic Memory & Linked ListsDynamic Memory & Linked Lists
Dynamic Memory & Linked ListsAfaq Mansoor Khan
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 

What's hot (20)

Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithm
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Chapter 10 ds
Chapter 10 dsChapter 10 ds
Chapter 10 ds
 
Data structure and algorithm
Data structure and algorithmData structure and algorithm
Data structure and algorithm
 
Dynamic Memory & Linked Lists
Dynamic Memory & Linked ListsDynamic Memory & Linked Lists
Dynamic Memory & Linked Lists
 
R language introduction
R language introductionR language introduction
R language introduction
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 

Similar to Hadoop exercise

Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptxNibrasulIslam
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k RetrievalAndrew Zhang
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingJay Nagar
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
 

Similar to Hadoop exercise (20)

Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Lec1
Lec1Lec1
Lec1
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Ir models
Ir modelsIr models
Ir models
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k Retrieval
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
BDS_QA.pdf
BDS_QA.pdfBDS_QA.pdf
BDS_QA.pdf
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 

Hadoop exercise

  • 2. Similarity of Documents •Simple inner product •Cosine similarity •Term weights –Standard problem in IR –tf-idf, BM25, etc. •A document d is represented as a vector Wdof term weights wt,d, which indicate the importance of each term t in the document di dj
  • 3. Trivial Solution •A trivial solution is to take each vector and compute its similarity (dot prod) with every other vector in the collection, which means that we will load each vector o(n) times and load each term o(dft2)times •This will work for small collection, but our goal is to have a scalable and efficient sol for large collections
  • 4. •Load weights for each term once •Each term contributes o(dft2)partial scores •Allows efficiency tricks Each term contributes only if appears in Better Solution
  • 6. Solution –parts we have done •An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs •Indexing: –We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight. –Use tf-idfto compute the weights. –Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value. –The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings.
  • 7. Exercise -tasks •Task 1 –Write map-reduce pseudo-code to compute sim(di,dj) based on the idea described in previous page. –Given the tf-idfcode –Given the inverted index computation code –Write the similarity code to compute sim(di,dj) based on the pseudo-code designed above. –Execute code on the provided data set, and test with the provided testing code. •Group / individual presentation (60 min)
  • 8. Solution •An efficient solution to the pair wise document similarity problem, expressed as two separate MapReduce jobs •Indexing: –We build a standard inverted index (Frakesand Baeza-Yates, 1992), where each term is associated with a list of docid’sfor documents that contain it and the associated term weight. –Mapping over all documents, the mapper, for each term in the document, emits the term as the key, and a tupleconsisting of the docidand term weight as the value. –The MapReduce runtime automatically handles the grouping of these tuples, which the reducer then writes out to disk, thus generating the postings. •PairwiseSimilarity: –Mapping over each posting, the mappergenerates key tuplescorresponding to pairs of docidsin the postings. –These key tuplesare associated with the product of the corresponding term weights –They represent the individual term contributions to the final inner product. –The MapReduce runtime sorts the tuplesand then the reducer sums all the individual score contributions for a pair to generate the final similarity score.
  • 9. End of session Day –2: Exercise