This document provides an overview of automatic spelling correction techniques. It discusses using character n-grams to represent words and generate spelling correction candidates based on edit distance. An information retrieval model is used where the misspelled word is the query and candidate corrections are documents. Inverted indexing is discussed as a way to implement this model using MapReduce. Reducers concatenate document IDs for each term to build the final inverted index, with optimizations needed to handle large document frequencies.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview and agenda for a lecture on graph processing using MapReduce. It discusses representing graphs as adjacency matrices or lists, and gives examples of single source shortest path and PageRank algorithms. Graph processing in MapReduce typically involves computations at each node and propagating those computations across the graph. Key challenges include representing graph structure suitably for MapReduce and traversing the graph in a distributed manner through multiple iterations.
Information Retrieval-4(inverted index_&_query handling)Jeet Das
The document describes the process of creating an inverted index to support keyword searching of documents. It discusses storing term postings lists that map terms to the documents that contain them. It also describes techniques like skip pointers, phrase queries, and proximity searches to improve query processing efficiency and support more complex search needs. Precision, recall, and f-score metrics for evaluating information retrieval systems are also summarized.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
This document summarizes a presentation about using Apache Solr to build recommender systems and discover latent relationships in data. It discusses how Solr can index user preferences and transactions to find co-occurrences and make recommendations. Streaming expressions are presented as a way to calculate significance scores to identify meaningful patterns beyond simple counts. Emergent properties like "flarglewharbliness" are used as an example of relationships that exist beyond predefined categories, and the potential for Solr to autonomously discover such latent vocabularies is briefly discussed.
This document summarizes the key steps in the locality sensitive hashing (LSH) algorithm for finding similar documents:
1. Documents are converted to sets of shingles (sequences of tokens) to represent them as high-dimensional data points.
2. MinHashing is applied to generate signatures (hashes) for each document such that similar documents are likely to have the same signatures. This compresses the data into a signature matrix.
3. LSH uses the signature matrix to hash similar documents into the same buckets with high probability, finding candidate pairs for further similarity evaluation and filtering out dissimilar pairs from consideration. This improves the computation efficiency over directly comparing all pairs.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview and agenda for a lecture on graph processing using MapReduce. It discusses representing graphs as adjacency matrices or lists, and gives examples of single source shortest path and PageRank algorithms. Graph processing in MapReduce typically involves computations at each node and propagating those computations across the graph. Key challenges include representing graph structure suitably for MapReduce and traversing the graph in a distributed manner through multiple iterations.
Information Retrieval-4(inverted index_&_query handling)Jeet Das
The document describes the process of creating an inverted index to support keyword searching of documents. It discusses storing term postings lists that map terms to the documents that contain them. It also describes techniques like skip pointers, phrase queries, and proximity searches to improve query processing efficiency and support more complex search needs. Precision, recall, and f-score metrics for evaluating information retrieval systems are also summarized.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
This document summarizes a presentation about using Apache Solr to build recommender systems and discover latent relationships in data. It discusses how Solr can index user preferences and transactions to find co-occurrences and make recommendations. Streaming expressions are presented as a way to calculate significance scores to identify meaningful patterns beyond simple counts. Emergent properties like "flarglewharbliness" are used as an example of relationships that exist beyond predefined categories, and the potential for Solr to autonomously discover such latent vocabularies is briefly discussed.
This document summarizes the key steps in the locality sensitive hashing (LSH) algorithm for finding similar documents:
1. Documents are converted to sets of shingles (sequences of tokens) to represent them as high-dimensional data points.
2. MinHashing is applied to generate signatures (hashes) for each document such that similar documents are likely to have the same signatures. This compresses the data into a signature matrix.
3. LSH uses the signature matrix to hash similar documents into the same buckets with high probability, finding candidate pairs for further similarity evaluation and filtering out dissimilar pairs from consideration. This improves the computation efficiency over directly comparing all pairs.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
DNA microarrays, also known as DNA chips, allow simultaneous measurement of gene expression levels for every gene in a genome. Microarrays detect messenger RNA (mRNA) or complementary DNA (cDNA). They are manufactured by amplifying individual genes using PCR and spotting them on a medium like a glass slide. When fluorescently labeled cDNA from two samples are hybridized to the array, the level of fluorescence indicates gene expression differences between the samples. Microarray data is analyzed to identify genes that are up-regulated or down-regulated under different experimental conditions.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on October 14, 2012.
http://www.socalcodecamp.com/session.aspx?sid=a4774b3c-7a2d-45db-8721-f54c5a314e17
This document outlines the OpenRepGrid project which aims to create open source software for analyzing repertory grid data using the R programming language. It discusses the motivation for the project due to limitations of existing proprietary software. The OpenRepGrid R package provides functionality for importing, analyzing, and visualizing repertory grid data. The document demonstrates several analysis methods available in OpenRepGrid and discusses plans to develop graphical user interfaces and community participation to further expand the software's capabilities.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
OpenTag: Open Attribute Value Extraction From Product ProfilesSubhabrata Mukherjee
Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, Feifei Li
KDD 2018, London, UK
OpenTag brings deep learning and active learning together for state-of-the-art imputation and open entity extraction system.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
This document provides an overview of a genetics lecture that discusses haplotypes, linkage disequilibrium, and the HapMap project.
The lecture begins by explaining how mutations give rise to SNPs, which then give rise to haplotypes. Recombination leads to linkage disequilibrium, where haplotypes are seen together more often than by chance. The document then discusses different measures used to calculate linkage disequilibrium, including D, D', and r2.
The second half focuses on the HapMap project. It provides background on the project and explains how it aimed to characterize linkage disequilibrium across the human genome. HapMap helped identify tag SNPs that could represent other SNPs in linkage
The OpenRepGrid project – Software tools for the analysis and administration...Mark Heckmann
In the workshop participants are introduced to the OpenRepGrid project. Part of the project is an open source software for the analysis of repertory grid data. The software currently comes in two flavors: As an online analysis tool and as an add-on package for the R program. The workshop gives an introduction to the software, its development philosophy and outlines the set of currently implemented features. Moreover, it is demonstrated how researchers may extend software features to suit their needs and actively contribute to its development. Further information about OpenRepGrid can be found on the program’s website under www.openrepgrid.org.
This presentation about Python Interview Questions will help you crack your next Python interview with ease. The video includes interview questions on Numbers, lists, tuples, arrays, functions, regular expressions, strings, and files. We also look into concepts such as multithreading, deep copy, and shallow copy, pickling and unpickling. This video also covers Python libraries such as matplotlib, pandas, numpy,scikit and the programming paradigms followed by Python. It also covers Python library interview questions, libraries such as matplotlib, pandas, numpy and scikit. This video is ideal for both beginners as well as experienced professionals who are appearing for Python programming job interviews. Learn what are the most important Python interview questions and answers and know what will set you apart in the interview process.
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S. students to programming and computer science. This course will give you hands-on development experience and prepare you for a career as a professional Python programmer.
What is this course about?
The All-in-One Python course enables you to become a professional Python programmer. Any aspiring programmer can learn Python from the basics and go on to master web development & game development in Python. Gain hands on experience creating a flappy bird game clone & website functionalities in Python.
What are the course objectives?
By the end of this online Python training course, you will be able to:
1. Internalize the concepts & constructs of Python
2. Learn to create your own Python programs
3. Master Python Django & advanced web development in Python
4. Master PyGame & game development in Python
5. Create a flappy bird game clone
The Python training course is recommended for:
1. Any aspiring programmer can take up this bundle to master Python
2. Any aspiring web developer or game developer can take up this bundle to meet their training needs
Learn more at https://www.simplilearn.com/mobile-and-software-development/python-development-training
Le développement du Web et des réseaux sociaux ou les numérisations massives de documents contribuent à un renouvellement des Sciences Humaines et Sociales, des études des patrimoines littéraires ou culturels, ou encore de la façon dont est exploitée la littérature scientifique en général.
Les humanités numériques, qui croisent diverses disciplines avec l’informatique, posent comme centrales les questions du volume des données, de leur diversité, de leur origine, de leur véracité ou de leur représentativité. Les informations sont véhiculées au sein de « documents » textuels (livres, pages Web, tweets...), audio, vidéo ou multimédia. Ils peuvent comporter des illustrations ou des graphiques.
Appréhender de telles ressources nécessite le développement d'approches informatiques robustes, capables de passer à l’échelle et adaptées à la nature fondamentalement ambiguë et variée des informations manipulées (langage naturel ou images à interpréter, points de vue multiples…).
Si les approches d’apprentissage statistique sont monnaie courante pour des tâches de classification ou d’extraction d’information, elles doivent faire face à des espaces vectoriels creux et de dimension très élevées (plusieurs millions), être en mesure d’exploiter des ressources (par exemple des lexiques ou des thesaurus) et tenir compte ou produire des annotations sémantiques qui devront pouvoir être réutilisées.
Pour faire face à ces enjeux, des infrastructures ont été créées telle HumaNum à l’échelle nationale, DARIAH ou CLARIN à l’échelle européenne et des recommandations établies à l’échelle mondiale telle que la TEI (Text Encoding Initiative). Des plateformes au service de l’information scientifique comme l’équipement d’excellence OpenEdition.org sont une autre brique essentielle pour la préservation et l’accès aux « Big Digital Humanities » mais aussi pour favoriser la reproductibilité et la compréhension des expérimentations et des résultats obtenus.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
The document discusses how exponential growth and decay models can be applied to many real-world phenomena, from the spread of bacteria and fungus to the growth of social networks and e-commerce companies. It provides examples of how concepts like doubling time, half-life, word of mouth effects, and time delays can exponentially impact various systems over time if not managed properly. The key message is that exponential behavior is more common and influential than often realized.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introducing the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques and how to apply them to solve NLP problems. The workshop covers four modules: 1) archaic techniques, 2) word vectors, 3) sentence/paragraph/document vectors, and 4) character vectors. It emphasizes that representation learning is key to NLP as it transforms raw text into a numeric form that machine learning models can understand.
The document summarizes Prolog, a logic programming language. It provides an introduction to Prolog including that it is declarative and based on logic programming. It describes the basic components of Prolog programs including facts and rules. It also covers topics like variables, lists, recursion, and solving problems like merging lists and Sudoku puzzles using Prolog. Advanced topics discussed include structure inspection, meta-logical predicates, cuts, and extra-logical predicates.
Science Online 2013: Data Visualization Using RWilliam Gunn
This document discusses data visualization using R. It begins with a brief history of data visualization and introduces R as an open-source tool for data analysis and visualization that allows working with larger and more diverse datasets than Excel. It then covers important R concepts like data structures, loading and selecting data, and the ggplot2 package for creating graphs and visualizations by layering different geoms like points, lines, bars and boxplots.
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
Research talk presented at "Innovations in Online Research" (October 1, 2021)
Event URL: https://web.cvent.com/event/d063e447-1f16-4f70-a375-5d6978b3feea/websitePage:b8d4ce12-3d02-4d24-897d-fd469ca4808a.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
DNA microarrays, also known as DNA chips, allow simultaneous measurement of gene expression levels for every gene in a genome. Microarrays detect messenger RNA (mRNA) or complementary DNA (cDNA). They are manufactured by amplifying individual genes using PCR and spotting them on a medium like a glass slide. When fluorescently labeled cDNA from two samples are hybridized to the array, the level of fluorescence indicates gene expression differences between the samples. Microarray data is analyzed to identify genes that are up-regulated or down-regulated under different experimental conditions.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on October 14, 2012.
http://www.socalcodecamp.com/session.aspx?sid=a4774b3c-7a2d-45db-8721-f54c5a314e17
This document outlines the OpenRepGrid project which aims to create open source software for analyzing repertory grid data using the R programming language. It discusses the motivation for the project due to limitations of existing proprietary software. The OpenRepGrid R package provides functionality for importing, analyzing, and visualizing repertory grid data. The document demonstrates several analysis methods available in OpenRepGrid and discusses plans to develop graphical user interfaces and community participation to further expand the software's capabilities.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
OpenTag: Open Attribute Value Extraction From Product ProfilesSubhabrata Mukherjee
Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, Feifei Li
KDD 2018, London, UK
OpenTag brings deep learning and active learning together for state-of-the-art imputation and open entity extraction system.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
This document provides an overview of a genetics lecture that discusses haplotypes, linkage disequilibrium, and the HapMap project.
The lecture begins by explaining how mutations give rise to SNPs, which then give rise to haplotypes. Recombination leads to linkage disequilibrium, where haplotypes are seen together more often than by chance. The document then discusses different measures used to calculate linkage disequilibrium, including D, D', and r2.
The second half focuses on the HapMap project. It provides background on the project and explains how it aimed to characterize linkage disequilibrium across the human genome. HapMap helped identify tag SNPs that could represent other SNPs in linkage
The OpenRepGrid project – Software tools for the analysis and administration...Mark Heckmann
In the workshop participants are introduced to the OpenRepGrid project. Part of the project is an open source software for the analysis of repertory grid data. The software currently comes in two flavors: As an online analysis tool and as an add-on package for the R program. The workshop gives an introduction to the software, its development philosophy and outlines the set of currently implemented features. Moreover, it is demonstrated how researchers may extend software features to suit their needs and actively contribute to its development. Further information about OpenRepGrid can be found on the program’s website under www.openrepgrid.org.
This presentation about Python Interview Questions will help you crack your next Python interview with ease. The video includes interview questions on Numbers, lists, tuples, arrays, functions, regular expressions, strings, and files. We also look into concepts such as multithreading, deep copy, and shallow copy, pickling and unpickling. This video also covers Python libraries such as matplotlib, pandas, numpy,scikit and the programming paradigms followed by Python. It also covers Python library interview questions, libraries such as matplotlib, pandas, numpy and scikit. This video is ideal for both beginners as well as experienced professionals who are appearing for Python programming job interviews. Learn what are the most important Python interview questions and answers and know what will set you apart in the interview process.
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S. students to programming and computer science. This course will give you hands-on development experience and prepare you for a career as a professional Python programmer.
What is this course about?
The All-in-One Python course enables you to become a professional Python programmer. Any aspiring programmer can learn Python from the basics and go on to master web development & game development in Python. Gain hands on experience creating a flappy bird game clone & website functionalities in Python.
What are the course objectives?
By the end of this online Python training course, you will be able to:
1. Internalize the concepts & constructs of Python
2. Learn to create your own Python programs
3. Master Python Django & advanced web development in Python
4. Master PyGame & game development in Python
5. Create a flappy bird game clone
The Python training course is recommended for:
1. Any aspiring programmer can take up this bundle to master Python
2. Any aspiring web developer or game developer can take up this bundle to meet their training needs
Learn more at https://www.simplilearn.com/mobile-and-software-development/python-development-training
Le développement du Web et des réseaux sociaux ou les numérisations massives de documents contribuent à un renouvellement des Sciences Humaines et Sociales, des études des patrimoines littéraires ou culturels, ou encore de la façon dont est exploitée la littérature scientifique en général.
Les humanités numériques, qui croisent diverses disciplines avec l’informatique, posent comme centrales les questions du volume des données, de leur diversité, de leur origine, de leur véracité ou de leur représentativité. Les informations sont véhiculées au sein de « documents » textuels (livres, pages Web, tweets...), audio, vidéo ou multimédia. Ils peuvent comporter des illustrations ou des graphiques.
Appréhender de telles ressources nécessite le développement d'approches informatiques robustes, capables de passer à l’échelle et adaptées à la nature fondamentalement ambiguë et variée des informations manipulées (langage naturel ou images à interpréter, points de vue multiples…).
Si les approches d’apprentissage statistique sont monnaie courante pour des tâches de classification ou d’extraction d’information, elles doivent faire face à des espaces vectoriels creux et de dimension très élevées (plusieurs millions), être en mesure d’exploiter des ressources (par exemple des lexiques ou des thesaurus) et tenir compte ou produire des annotations sémantiques qui devront pouvoir être réutilisées.
Pour faire face à ces enjeux, des infrastructures ont été créées telle HumaNum à l’échelle nationale, DARIAH ou CLARIN à l’échelle européenne et des recommandations établies à l’échelle mondiale telle que la TEI (Text Encoding Initiative). Des plateformes au service de l’information scientifique comme l’équipement d’excellence OpenEdition.org sont une autre brique essentielle pour la préservation et l’accès aux « Big Digital Humanities » mais aussi pour favoriser la reproductibilité et la compréhension des expérimentations et des résultats obtenus.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
The document discusses how exponential growth and decay models can be applied to many real-world phenomena, from the spread of bacteria and fungus to the growth of social networks and e-commerce companies. It provides examples of how concepts like doubling time, half-life, word of mouth effects, and time delays can exponentially impact various systems over time if not managed properly. The key message is that exponential behavior is more common and influential than often realized.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introducing the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques and how to apply them to solve NLP problems. The workshop covers four modules: 1) archaic techniques, 2) word vectors, 3) sentence/paragraph/document vectors, and 4) character vectors. It emphasizes that representation learning is key to NLP as it transforms raw text into a numeric form that machine learning models can understand.
The document summarizes Prolog, a logic programming language. It provides an introduction to Prolog including that it is declarative and based on logic programming. It describes the basic components of Prolog programs including facts and rules. It also covers topics like variables, lists, recursion, and solving problems like merging lists and Sudoku puzzles using Prolog. Advanced topics discussed include structure inspection, meta-logical predicates, cuts, and extra-logical predicates.
Science Online 2013: Data Visualization Using RWilliam Gunn
This document discusses data visualization using R. It begins with a brief history of data visualization and introduces R as an open-source tool for data analysis and visualization that allows working with larger and more diverse datasets than Excel. It then covers important R concepts like data structures, loading and selecting data, and the ggplot2 package for creating graphs and visualizations by layering different geoms like points, lines, bars and boxplots.
Similar to Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011) (20)
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
Research talk presented at "Innovations in Online Research" (October 1, 2021)
Event URL: https://web.cvent.com/event/d063e447-1f16-4f70-a375-5d6978b3feea/websitePage:b8d4ce12-3d02-4d24-897d-fd469ca4808a.
Explainable Fact Checking with Humans in-the-loopMatthew Lease
Invited Keynote at KDD 2021 TrueFact Workshop: Making a Credible Web for Tomorrow, August 15, 2021.
https://www.microsoft.com/en-us/research/event/kdd-2021-truefact-workshop-making-a-credible-web-for-tomorrow/#!program-schedule
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
AI & Work, with Transparency & the Crowd Matthew Lease
The document discusses designing human-AI partnerships and the role of crowdsourcing in AI systems. It summarizes work on designing AI assistants to work with humans, using crowds to help fact-check information, and explores challenges around protecting crowd workers who review harmful content or do "dirty jobs". It advocates for more research on ethics in AI and using crowds to help check work for ethical issues.
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
The document discusses designing human-AI partnerships to combat misinformation. It describes a prototype partnership where a human and AI work together to fact-check claims. The partnership aims to make the AI more transparent and address user bias by allowing the user to adjust the perceived reliability of news sources, which then changes the AI's political leaning analysis and fact checking results. The discussion wraps up by noting challenges like avoiding echo chambers and assessing potential harms, as well as opportunities for AI to reduce bias and increase trust through explainable, interactive systems.
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
This document summarizes a presentation about designing human-AI partnerships for fact-checking misinformation. It discusses using crowdsourced rationales to improve the accuracy and cost-efficiency of annotation tasks. It also addresses challenges in designing interfaces for automatic fact-checking models, such as integrating human knowledge and reasoning to correct errors and account for bias. The goal is to develop mixed-initiative systems where humans and AI can jointly reason and personalize fact-checking.
Presentation given at the Linguistic Data Consortium (LDC), University of Pennsylvania, April 2019. Based on presentations at the 6th ACM Collective Intelligence Conference, 2018 and the 6th AAAI Conference on Human Computation & Crowdsourcing (HCOMP), 2018. Blog post: https://blog.humancomputation.com/?p=9932.
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
Presented at the 31st ACM User Interface Software and Technology Symposium (UIST), 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/nguyen-uist18.pdf
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
Presentation at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). August 30, 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/kutlu-desires18.pdf
Talk given August 29, 2018 at the 1st Biannual Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2018). Paper: https://www.ischool.utexas.edu/~ml/papers/lease-desires18.pdf
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
Presentation at the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), July 7, 2018. Work by Tanya Goyal, Tyler McDonnell, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. Pages 41-49 in conference proceedings. Online version of paper includes corrections to official version in proceedings: https://www.ischool.utexas.edu/~ml/papers/goyal-hcomp18
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
Invited Talk at the ACM JCDL 2018 WORKSHOP ON CYBERINFRASTRUCTURE AND MACHINE LEARNING FOR DIGITAL LIBRARIES AND ARCHIVES. https://www.tacc.utexas.edu/conference/jcdl18
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
This document discusses opportunities for collaboration between researchers working in systematic reviews and electronic discovery (e-discovery). It notes similarities in the challenges both fields face, including the need for high recall with bounded costs and reliance on multi-stage review pipelines. The document proposes that technologies developed for semi-automated citation screening and crowdsourcing could help address current limitations. It concludes by encouraging information retrieval researchers to investigate open problems in systematic reviews as opportunities to advance technologies beyond other tasks and help bring together interested parties through forums like the TREC Total Recall track.
Crowd computing utilizes both crowdsourcing and human computation to solve problems. Crowdsourcing enables more efficient and scalable data collection and processing by outsourcing tasks to a large, undefined group of people. Human computation allows software developers to incorporate human intelligence and judgment into applications to provide capabilities beyond current artificial intelligence. Examples discussed include Amazon Mechanical Turk, various crowd-powered applications, and how crowdsourcing has helped label large datasets to train machine learning models.
The Rise of Crowd Computing (December 2015)Matthew Lease
Crowd computing is rising with two waves - the first using crowds to label large amounts of data for artificial intelligence applications. The second wave delivers applications that go beyond AI abilities by incorporating human computation. Open problems remain around ensuring high quality outputs, task design, understanding the worker context and experience, and addressing ethics concerns around opaque platforms and working conditions. The future holds potential for empowering crowd work but also risks like digital sweatshops if worker freedoms and conditions are not considered.
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
The document summarizes a presentation about analyzing paid crowd work platforms beyond Mechanical Turk. It discusses how Mechanical Turk has dominated research on paid crowdsourcing due to its early popularity, but that it has limitations. The presentation conducts a qualitative study of 7 alternative crowd work platforms to identify distinguishing capabilities not found on MTurk, such as different payment models, richer worker profiles, and support for confidential tasks. It aims to increase awareness of other platforms to further inform practice and research on crowdsourcing.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)
1. Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 6
September 29, 2011
Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
2. Acknowledgments
Course design and slides based on
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park
Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)
3. Today’s Agenda
• Automatic Spelling Correction
– Review: Information Retrieval (IR)
• Boolean Search
• Vector Space Modeling
• Inverted Indexing in MapReduce
– Probabilisitic modeling via noisy channel
• Index Compression
– Order inversion in MapReduce
• In-class exercise
• Hadoop: Pipelined & Chained jobs
5. Automatic Spelling Correction
Three main stages
Error detection
Candidate generation
Candidate ranking / choose best candidate
Usage cases
Flagging possible misspellings / spell checker
Suggesting possible corrections
Automatically correcting (inferred) misspellings
• “as you type” correction
• web queries
• real-time closed captioning
• …
6. Types of spelling errors
Unknown words: “She is their favorite acress in town.”
Can be identified using a dictionary…
…but could be a valid word not in the dictionary
Dictionary could be automatically constructed from large corpora
• Filter out rare words (misspellings, or valid but unlikely)…
• Why filter out rare words that are valid?
Unknown words violating phonotactics:
e.g. “There isn’t enough room in this tonw for the both of us.”
Given dictionary, could automatically construct “n-gram dictionary”
of all character n-grams known in the language
• e.g. English words don’t end with “nw”, so flag tonw
Incorrect homophone: “She drove their.”
Valid word, wrong usage; infer appropriateness from context
Typing errors reflecting kayout of leyboard
7. Candidate generation
How to generate possible corrections for acress?
Inspiration: how do people do it?
People may suggest words like actress, across, access, acres,
caress, and cress – what do these have in common?
What about “blam” and “zigzag”?
Two standard strategies for candidate generation
Minimum edit distance
• Generate all candidates within 1+ edit step(s)
• Possible edit operations: insertion, deletion, substitution, transposition, …
• Filter through a dictionary
• See Peter Norvig’s post: http://norvig.com/spell-correct.html
Character ngrams: see next slide…
8. Character ngram Spelling Correction
Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
Uhm, IR model???
Review…
9. Abstract IR Architecture
Query Documents
online offline
Representation Representation
Function Function
Query Representation Document Representation
Comparison
Function Index
Results
10. Document Boolean Representation
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items fat
healthier.
But does that mean the popular shoestring fries fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with new
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with Company
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's Said
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger nutrition
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately …
be reached for comment.
…
12. Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
1 2 3 4
blue 1 blue 2
cat 1 cat 3
egg 1 egg 4
fish 1 1 fish 1 2
green 1 green 4
ham 1 ham 4
hat 1 hat 3
one 1 one 1
red 1 red 2
two 1 two 1
13. Inverted Indexing via MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 red 2 cat 3
Map two 1 blue 2 hat 3
fish 1 fish 2
Shuffle and Sort: aggregate values by keys
cat 3
blue 2
Reduce fish 1 2
hat 3
one 1
two 1
red 2
14. Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
15. Scalability Bottleneck
Desired output format: <term, [doc1, doc2, …]>
Just emitting each <term, docID> pair won’t produce this
How to produce this without buffering?
Side-effect: write directly to HDFS instead of emitting
Complications?
• Persistent data must be cleaned up if reducer restarted…
16. Using the Inverted Index
Boolean Retrieval: to execute a Boolean query
Build query syntax tree
OR
( blue AND fish ) OR ham ham AND
For each clause, look up postings blue fish
blue 2
fish 1 2
Traverse postings and apply Boolean operator
Efficiency analysis
Start with shortest posting first
Postings traversal is linear (if postings are sorted)
• Oops… we didn’t actually do this in building our index…
17. Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
18. Inverted Indexing in MapReduce: try 2
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values() fish
4: Sort(P) 1 2
5: Emit(term t; P)
19. (Another) Scalability Bottleneck
Reducers buffers all docIDs associated with term (to sort)
What if term occurs in many documents?
Secondary sorting
Use composite key
Partition function
Key Comparator
Side-effect: write directly to HDFS as before…
20. Inverted index for spelling correction
Like search, spelling correction must be fast
How can we quickly identify candidate corrections?
II: Map each character ngram list of all words containing it
#ac -> { act, across, actress, acquire, … }
acr -> { across, acrimony, macro, … }
cre -> { crest, acre, acres, … }
res -> { arrest, rest, rescue, restaurant, … }
ess -> { less, lesson, necessary, actress, … }
ss# -> { less, mess, moss, across, actress, … }
How do we build the inverted index in MapReduce?
21. Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
22. Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
Map(String docid, String text):
for each word w in text:
for each trigram t in w:
Emit(t, w)
Reduce(String trigram, Iterator<Text> values):
Emit(trigram, values.toSet)
Also other alternatives, e.g. in-mapper combining, pairs
Is MapReduce even necessary for this?
Dictionary vs. token frequency
23. Spelling correction as Boolean search
Given inverted index, how to find set of possible corrections?
Compute union of all words indexed by any of its character ngrams
= Boolean search
• Query “acress” “#ac OR acr OR cre OR res OR ess OR ss# “
Are all corrections equally likely / good?
24. Ranked Information Retrieval
Order documents by probability of relevance
Estimate relevance of each document to the query
Rank documents by relevance
How do we estimate relevance?
Vector space paradigm
Approximate relevance by vector similarity (e.g. cosine)
Represent queries and documents as vectors
Rank documents by vector similarity to the query
25. Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Assumption: Documents that are “close” in vector space
“talk about” the same things
Retrieve documents based on how close the document
vector is to the query vector (i.e., similarity ~ “closeness”)
26. Similarity Metric
Use “angle” between the vectors
d j dk
cos( )
d j dk
i1 wi, j wi,k
n
d j dk
sim(d j , d k )
i1 wi , j i 1 wi2,k
n n
d j dk 2
Given pre-normalized vectors, just compute inner product
sim(d j , d k ) d j d k i 1 wi , j wi ,k
n
27. Boolean Character ngram correction
Boolean Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
28. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
29. Spelling Correction in Vector Space
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Assumption: Words that are “close together” in ngram
vector space have similar orthography
Therefore, retrieve words in the dictionary based on how
close the word is to the typo (i.e., similarity ~ “closeness”)
30. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
Use simple count here, what else might we do?
31. IR Term Weighting
Term weights consist of two components
Local: how important is the term in this document?
Global: how important is the term in the collection?
Here’s the intuition:
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency (local)
Inverse document frequency (global)
32. TF.IDF Term Weighting
N
wi , j tfi , j log
ni
wi , j weight assigned to term i in document j
tfi, j number of occurrence of term i in document j
N number of documents in entire collection
ni number of documents with term i
33. Inverted Index: TF.IDF
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
tf
1 2 3 4 df
blue 1 1 blue 1 2 1
cat 1 1 cat 1 3 1
egg 1 1 egg 1 4 1
fish 2 2 2 fish 2 1 2 2 2
green 1 1 green 1 4 1
ham 1 1 ham 1 4 1
hat 1 1 hat 1 3 1
one 1 1 one 1 1 1
red 1 1 red 1 2 1
two 1 1 two 1 1 1
34. Inverted Indexing via MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 red 2 cat 3
Map two 1 blue 2 hat 3
fish 1 fish 2
Shuffle and Sort: aggregate values by keys
cat 3
blue 2
Reduce fish 1 2
hat 3
one 1
two 1
red 2
35. Inverted Indexing via MapReduce (2)
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 1 red 2 1 cat 3 1
Map two 1 1 blue 2 1 hat 3 1
fish 1 2 fish 2 2
Shuffle and Sort: aggregate values by keys
cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
37. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
What else might we do? TF.IDF for character n-grams?
38. TF.IDF for character n-grams
Think about what makes an ngram more discriminating
e.g. in acquire, acq and cqu are more indicative than qui and ire.
Schematically, we want something like:
• acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]
Possible solution: TF-IDF, where
TF is the frequency of the ngram in the word
IDF is the number of words the ngram occurs in in the vocabulary
39. Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
40. Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
Idea: use the co-occurrence matrices built during HW2
We have a vector of co-occurrence counts for each word
Extract a similar vector for the typo given its immediate context
• “She is their favorite acress in town.”
acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]
Possible enhancement: make vectors sensitive to word order
41. Combining evidence
We have orthographic similarity and contextual similarity
We can do a simple weighted combination of the two, e.g.:
simCombined ( d j , d k ) simOrth( d j , d k ) 1 simContext ( d j , d k )
How to do this more efficiently?
Compute top candidates based on simOrth
Take top k for consideration with simContext
…or other way around…
The combined model might also be expressed by a similar
probabilistic model…
42. March 22, 2005 42
Paradigm: Noisy-Channel Modeling
s arg max P( S | O) arg max P( S ) P(O | S )
S S
Want to recover most likely latent (correct) source
word underlying the observed (misspelled) word
P(S): language model gives probability distribution
over possible (candidate) source words
P(O|S): channel model gives probability of each
candidate source word being “corrupted” into the
observed typo
44. Probabilistic vs. vector space model
Both measure orthographic & contextual “fit” of the
candidate given the typo and its usage context
Noisy channel:
P( cand | typo, context ) log P(typo | cand ) log P( cand | context )
IR approach:
simCombined ( d j , d k ) simOrth( d j , d k ) 1 simContext ( d j , d k )
Both can benefit from “big” data (i.e. bigger samples)
Better estimates of probabilities and population frequencies
Usual probabilistic vs. non-probabilistic tradeoffs
Principled theory and methodology for modeling and estimation
How to extend the feature space to include additional information?
• Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
46. Postings Encoding
Conceptually:
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
In Practice:
•Instead of document IDs, encode deltas (or d-gaps)
• But it’s not obvious that this save space…
fish 1 2 8 1 12 3 13 1 1 2 45 3 …
47. Overview of Index Compression
Byte-aligned vs. bit-aligned
Non-parameterized bit-aligned
Unary codes
(gamma) codes
(delta) codes
Parameterized bit-aligned
Golomb codes
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
48. But First... General Data Compression
Run Length Encoding
7 7 7 8 8 9 = (7, 3), (8,2), (9,1)
Binary Equivalent
0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3
Good with sparse binary data
Huffman Coding
Optimal when data is distributed by negative powers of two
e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8
• a = 0, b = 10, c= 110, d=111
Prefix codes: no codeword is the prefix of another codeword
• If we read 0, we know it’s an “a” following bits are a new codeword
• Similarly 10 is a b (no other codeword starts with 10), etc.
• Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
49. Unary Codes
Encode number as a run of 1s, specifically…
x 1 coded as x-1 1s, followed by zero bit terminator
1=0
2 = 10
3 = 110
4 = 1110
...
Great for small numbers… horrible for large numbers
Overly-biased for very small gaps
50. codes
x 1 is coded in two parts: unary length : offset
Start with binary encoded, remove highest-order bit = offset
Length is number of binary digits, encoded in unary
Concatenate length + offset codes
Example: 9 in binary is 1001
Offset = 001
Length = 4, in unary code = 1110
code = 1110:001
Another example: 7 (111 in binary)
• offset=11, length=3 (110 in unary) code = 110:11
Analysis
Offset = log x
Length = log x +1
Total = 2 log x +1 (97 bits, 75 bits, …)
51. codes
As with codes, two parts: unary length & offset
Offset is same as before
Length is encoded by its code
Example: 9 (=1001 in binary)
Offset = 001
Length = 4 (100), offset=00, length 3 = 110 in unary
• code=110:00
code = 110:00:001
Comparison
codes better for smaller numbers
codes better for larger numbers
52. Golomb Codes
x 1, parameter b
x encoded in two parts
Part 1: q = ( x - 1 ) / b , code q + 1 in unary
Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary
Truncated binary defines prefix code
if b is a power of 2
• easy case: truncated binary = regular binary
else
• First 2^(log b + 1) – b values encoded in log b bits
• Remaining values encoded in log b + 1 bits
Let’s see some examples
53. Golomb Code Examples
b = 3, r = [0:2]
First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit
First 1 value in 1 bit: 0
Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11
b = 5, r = [0:4]
First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits
First 3 values in 2 bits: 00, 01, 10
Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111
• Two prefix bits needed since single leading 1 already used in “10”
b = 6, r = [0:5]
First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits
First 2 values in 2 bits: 00, 01
Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
55. Index Compression: Performance
Comparison of Index Size (bits per pointer)
Bible TREC
Unary 262 1918
Binary 15 20
6.51 6.63
6.23 6.38
Golomb 6.09 5.84
Use Golomb for d-gaps, codes for term frequencies
Optimal b 0.69 (N/df): Different b for every term!
Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)
Witten, Moffat, Bell, Managing Gigabytes (1999)
56. Where are we without compression?
(key) (values) (keys) (values)
fish 1 2 [2,4] fish 1 [2,4]
34 1 [23] fish 9 [9]
21 3 [1,8,22] fish 21 [1,8,22]
35 2 [8,41] fish 34 [23]
80 3 [2,9,76] fish 35 [8,41]
9 1 [9] fish 80 [2,9,76]
How is this different?
• Let the framework do the sorting
• Directly write postings to disk
• Term frequency implicitly stored
57. Index Compression in MapReduce
Need df to compress posting for each term
How do we compute df?
Count the # of postings in reduce(), then compress
Problem?
58. Order Inversion Pattern
In the mapper:
Emit “special” key-value pairs to keep track of df
In the reducer:
Make sure “special” key-value pairs come first: process them to
determine df
Remember: proper partitioning!
59. Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…
(key) (value)
fish 1 [2,4] Emit normal key-value pairs…
one 1 [1]
two 1 [3]
fish [1] Emit “special” key-value pairs to keep track of df…
one [1]
two [1]
60. Getting the df: Modified Reducer
(key) (value)
First, compute the df by summing contributions
fish [63] [82] [27] …
from all “special” key-value pair…
Compress postings incrementally as they arrive
fish 1 [2,4]
fish 9 [9]
fish 21 [1,8,22] Important: properly define sort order to make
sure “special” key-value pairs come first!
fish 34 [23]
fish 35 [8,41]
fish 80 [2,9,76]
… Write postings directly to disk
Where have we seen this before?
62. Exercise: where have all the ngrams gone?
For each observed (word) trigram in collection,
output its observed (docID, wordIndex) locations
Input
Doc 1 Doc 2 Doc 3
one fish two fish one fish two salmon two fish two fish
Output Possible Tools:
* pairs/stripes?
one fish two [(1,1),(2,1)]
* combining?
fish two fish [(1,2),(3,2)]
* secondary sorting?
fish two salmon [(2,2)]
* order inversion?
two fish two [(3,1)]
* side effects?
63. Exercise: shingling
Given observed (docID, wordIndex) ngram locations
For each document, for each of its ngrams (in order),
give a list of the ngram locations for that ngram
Input
one fish two [(1,1),(2,1)]
fish two fish [(1,2),(3,2)]
fish two salmon [(2,2)]
Possible Tools:
[(3,1)]
two fish two * pairs/stripes?
Output * combining?
Doc 1 [ [(1,1),(2,1)], [(1,2),(3,2)] ] * secondary sorting?
Doc 2 [ [(1,1),(2,1)], [(2,2)] ] * order inversion?
Doc 3 [ [(3,1)], [(1,2),(3,2)] ]
* side effects?
64. Exercise: shingling (2)
How can we recognize when longer ngrams are
aligned across documents?
Example
doc 1: a b c d e
doc 2: a b c d f
doc 3: e b c d f
doc 4: a b c d e
Find “a b c d” in docs 1 2 and 4,
“b c d f” in 2 & 3
“a b c d e” in 1 and 4
65. class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document
class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }
AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
...
@inproceedings{Kolak:2008,
author = {Kolak, Okan and Schilit, Bill N.},
title = {Generating links by mining quotations},
booktitle = {19th ACM conference on Hypertext and
hypermedia},
year = {2008},
pages = {117--126}
}
66. class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document
class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }
AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
++index;
foreach Alignment a in alignments
Ngram next = new Ngram(a.otherID, a.otherIndex + a.length)
if (ngrams.contains(next)) // extend alignment
a.length += 1; ngrams.remove(next)
else // terminate alignment
emit _docID, (a); alignments.remove(a)
foreach ngram in ngrams
alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
68. Building more complex MR algorithms
Monolithic single Map + single Reduce
What we’ve done so far
Fitting all computation to this model can be difficult and ugly
We generally strive for modularization when possible
What else can we do?
Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)
Chaining: [Map+ Reduce Map*]
• 1 or more Mappers
• 1 reducer
• 0 or more Mappers
Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …
Express arbitrary dependencies between jobs
69. Modularization and WordCount
General benefits of modularization
Re-use for easier/faster development
Consistent behavior across applications
Easier/faster to maintain/extend for benefit of many applications
Even basic word count can be broken down
Pre-processing
• How will we tokenize? Perform stemming? Remove stopwords?
Main computation: count tokenized tokens and group by word
Post-processing
• Transform the values? (e.g. log-damping)
Let’s separate tokenization into its own module
Many other tasks can likely benefit
First approach: pipeline…
71. Pipeline WordCount in Hadoop
Two distinct jobs: tokenize and count
Data sharing between jobs via persistent output
Can use combiners and partitioners as usual (won’t bother here)
Let’s use SequenceFileOutputFormat rather than TextOutputFormat
sequence of binary key-value pairs; faster / smaller
tokenization output will stick around unless we delete it
Tokenize job
Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer
Output goes to directory we specify
Files will be read back in by the counting job
Output is array of tokens
We need to make a suitable Writable for String arrays
Count job
Input types defined by the input SequenceFile (don’t need to be specified)
Mapper is trivial
observes tokens from incoming data
Key: (docid) & Value: (Array of Strings, encoded as a Writable)
72. Pipeline WordCount (old Hadoop API)
Configuration conf = new Configuration();
String tmpDir1to2 = "/tmp/intermediate1to2";
// Tokenize job
JobConf tokenizationJob = new JobConf(conf);
tokenizationJob.setJarByClass(PipelineWordCount.class);
FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));
FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));
tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);
tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);
tokenizationJob.setOutputKeyClass(LongWritable.class);
tokenizationJob.setOutputValueClass(TextArrayWritable.class);
tokenizationJob.setNumReduceTasks(0);
// Count job
JobConf countingJob = new JobConf(conf);
countingJob.setJarByClass(PipelineWordCount.class);
countingJob.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));
FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));
countingJob.setMapperClass(TrivialWordObserver.class);
countingJob.setReducerClass(MapRedIntSumReducer.class);
countingJob.setOutputKeyClass(Text.class);
countingJob.setOutputValueClass(IntWritable.class);
countingJob.setNumReduceTasks(reduceTasks);
JobClient.runJob(tokenizationJob);
JobClient.runJob(countingJob);
73. Pipeline jobs in Hadoop
Old API
JobClinet.runJob(..) does not return until job finishes
New API
Use Job rather than JobConf
Use job.waitForCompletion instead of JobClient.runJob
Why Old API?
In 0.20.2, chaining only possible under old API
We want to re-use the same components for chaining (next…)
74. Chaining in Hadoop Mapper 1 Mapper 1
Map+ Reduce Map*
Intermediates Intermediates
1 or more Mappers
• Can use IdentityMapper
1 reducer Mapper 2 Mapper 2
• No reducers: conf.setNumReduceTasks(0)?
0 or more Mappers
Usual combiners and partitioners
By default, data passed between Reducer Reducer
Mappers by usual writing of
intermediate data to disk
Can always use side-effects…
There is a better, built-in way to bypass Mapper 3 Mapper 3
this and pass (Key,Value) pairs by
reference instead
• Requires different Mapper semantics! Persistent Output Persistent Output
75. Hadoop: ChainMapper & ChainReducer
Below JobConf objects (deprecated in Hadoop 0.20.2).
No undeprecated replacement in 0.20.2…
Examples here work for later versions with small changes
Configuration conf = new Configuration();
JobConf job = new JobConf(conf);
...
boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,
Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);
JobConf map2Conf = new JobConf(false);
ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,
Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,
ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)
JobConf map3Conf = new JobConf(false);
ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,
Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)
JobClient.runJob(job);
76. Chaining in Hadoop
Let’s continue our running example:
Mapper 1: Tokenize
Mapper 2: Observe (count) words
Reducer: same IntSum reducer as always
Mapper 3 Log-dampen counts
• We didn’t have this in our pipeline example but we’ll add here…
77. Chained Tokenizer + WordCount
// Set up configuration and intermediate directory location
Configuration conf = new Configuration();
JobConf chainJob = new JobConf(conf);
chainJob.setJobName("Chain job");
chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…
chainJob.setNumReduceTasks(reduceTasks);
FileInputFormat.setInputPaths(chainJob, new Path(inputPath));
FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));
// pass output (Key,Value) pairs to next Mapper by reference?
boolean passByRef = false;
JobConf map1 = new JobConf(false); // tokenization
ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,
LongWritable.class, Text.class,
LongWritable.class, TextArrayWritable.class, passByRef, map1);
JobConf map2 = new JobConf(false); // Add token observer job
ChainMapper.addMapper(chainJob, TrivialWordObserver.class,
LongWritable.class, TextArrayWritable.class,
Text.class, LongWritable.class, passByRef, map2);
JobConf reduce = new JobConf(false); // Set the int sum reducer
ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,
Text.class, LongWritable.class, passByRef, reduce);
JobConf map3 = new JobConf(false); // log-scaling of counts
ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,
Text.class, FloatWritable.class, passByRef, map3);
JobClient.runJob(chainJob);
78. Hadoop Chaining: Pass by Reference
Chaining allows possible optimization
Chained mappers run in same JVM thread, so opportunity to avoid
serialization to/from disk with pipelined jobs
Also lesser benefit of avoiding extra object destruction / construction
Gotchas
OutputCollector.collect(K k, V v) promises
not alter the content of k and v
But if Map1 passes (k,v) by reference to Map2 via collect(),
Map2 may alter (k,v) & thereby violate the contract
What to do?
Option 1: Honor the contract – don’t alter input (k,v) in Map2
Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()
Document carefully to avoid later changes silently breaking this…
79. Setting Dependencies Between Jobs
JobControl and Job provide the mechanism
// create jobconf1 and jobconf2 as appropriate
// …
Job job1=new Job(jobconf1)
Job job2=new Job(jobconf2);
job2.addDependingJob(job1);
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
jbcntrl.run()
New API: no JobConf, create Job from Configuration, …
80. Higher Level Abstractions
Pig: language and execution environment for expressing
MapReduce data flows. (pretty much the standard)
See White, Chapter 11
Cascading: another environment with a higher level of
abstraction for composing complex data flows
See White, Chapter 16, pp 539-552
Cascalog: query language based on Cascading that uses
Clojure (a JVM-based LISP variant)
Word count in Cascalog
Certainly more concise – though you need to grok the syntax.
(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))