This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, and increasing the Spark cluster size, which resulted in annotating over 20 documents per second.
- Further work proposed
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Measuring Search Engine Quality using Spark and PythonSujit Pal
Presented at PyData Amsterdam 2016. Describes the Rewinder tool, to compare search engine configuration performance between Microsoft FAST and Apache Solr for the ScienceDirect search backend migration.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Measuring Search Engine Quality using Spark and PythonSujit Pal
Presented at PyData Amsterdam 2016. Describes the Rewinder tool, to compare search engine configuration performance between Microsoft FAST and Apache Solr for the ScienceDirect search backend migration.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf
Scripts that Scale with F# and mbrace.io:
Nothing beats interactive scripting for productive data exploration and rapid prototyping: grab data, run code, and iterate based on feedback. However, that story starts to break down once you need to process large datasets or expensive computations. Your local machine becomes the bottleneck, and your are left with a slow and unresponsive environment.
In this talk, we will demonstrate on live examples how you can have your cake and eat it, too, using mbrace.io, a free, open-source engine for scalable cloud programming. Using a simple programming model, you can keep working from your favorite scripting environment, and execute code interactively against a cluster on the Azure cloud. We will discuss the relevance of F# and mbrace in a data science and machine learning context, from parallelizing code and data processing in a functional style, to leveraging F# type providers to consume data or even run R packages.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
by Vikram Madan, Sr. Product Manager, AWS Deep Learning
In this workshop, we will provide cover deep learning fundamentals and focus on the powerful and scalable Apache MXNet open source deep learning framework. At the end of this tutorial you’ll be able to train your own deep neural network and fine tune existing state of the art models for image and object recognition. We’ll also deep dive on setting up your deep learning infrastructure on AWS and model deployment on AWS Lambda.
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsLucidworks
Synchronizing a search application across multiple clusters is a complex challenge and the solution evolves with our tools (Solr and Fusion AppStudio). Paul Anderson discusses how Dynatrace's cluster synchronization strategy changed over the last two years to ensure that customers worldwide have a consistent search experience. The talk focuses on two Solr features, CDCR, and Streaming Expressions, explaining what they do well, where they fall down, and where they need to improve. Paul also covers how to modify your index pipelines and signal aggregations to support cluster synchronization.
Speakers: Paul Anderson & Daniel Drahnak, Dynatrace
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
An introduction to Keras, a high-level neural networks library written in Python. Keras makes deep learning more accessible, is fantastic for rapid protyping, and can run on top of TensorFlow, Theano, or CNTK. These slides focus on examples, starting with logistic regression and building towards a convolutional neural network.
The presentation was given at the Austin Deep Learning meetup: https://www.meetup.com/Austin-Deep-Learning/events/237661902/
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf
Scripts that Scale with F# and mbrace.io:
Nothing beats interactive scripting for productive data exploration and rapid prototyping: grab data, run code, and iterate based on feedback. However, that story starts to break down once you need to process large datasets or expensive computations. Your local machine becomes the bottleneck, and your are left with a slow and unresponsive environment.
In this talk, we will demonstrate on live examples how you can have your cake and eat it, too, using mbrace.io, a free, open-source engine for scalable cloud programming. Using a simple programming model, you can keep working from your favorite scripting environment, and execute code interactively against a cluster on the Azure cloud. We will discuss the relevance of F# and mbrace in a data science and machine learning context, from parallelizing code and data processing in a functional style, to leveraging F# type providers to consume data or even run R packages.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
by Vikram Madan, Sr. Product Manager, AWS Deep Learning
In this workshop, we will provide cover deep learning fundamentals and focus on the powerful and scalable Apache MXNet open source deep learning framework. At the end of this tutorial you’ll be able to train your own deep neural network and fine tune existing state of the art models for image and object recognition. We’ll also deep dive on setting up your deep learning infrastructure on AWS and model deployment on AWS Lambda.
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsLucidworks
Synchronizing a search application across multiple clusters is a complex challenge and the solution evolves with our tools (Solr and Fusion AppStudio). Paul Anderson discusses how Dynatrace's cluster synchronization strategy changed over the last two years to ensure that customers worldwide have a consistent search experience. The talk focuses on two Solr features, CDCR, and Streaming Expressions, explaining what they do well, where they fall down, and where they need to improve. Paul also covers how to modify your index pipelines and signal aggregations to support cluster synchronization.
Speakers: Paul Anderson & Daniel Drahnak, Dynatrace
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.
An introduction to Keras, a high-level neural networks library written in Python. Keras makes deep learning more accessible, is fantastic for rapid protyping, and can run on top of TensorFlow, Theano, or CNTK. These slides focus on examples, starting with logistic regression and building towards a convolutional neural network.
The presentation was given at the Austin Deep Learning meetup: https://www.meetup.com/Austin-Deep-Learning/events/237661902/
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
a seminar presented at the School of Computer Science at the University of St Andrews 18 October 2019 (see https://blogs.cs.st-andrews.ac.uk/csblog/2019/09/25/daniel-katz-parsl/)
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
Two popular tools for doing Machine Learning on top of JVM ecosystem is H2O and SparkML. This presentation compares these two tools as Machine Learning libraries (Didn't consider Spark's Data Munjing perspective). This work was done during June of 2018.
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
LuSql: (Quickly and easily) Getting your data from your DBMS into Luceneeby
Need to move your data from your DBMS to Lucene? The recently released LuSql allows you to do this in a single line. LuSql is a high performance, low use barrier application for getting DBMS data into Lucene. This presentation will introduce LuSql, and give a brief tutorial on simple to crazy complicated use cases, including per document sub-queries and out-of-band document transformations. -- Glen Newton, CISTI, National Research Council
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Similar to Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP (20)
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
We describe our Dictionary based Named Entity Recognizer and Semantic Matcher that enables us to leverage our Knowledge Graph to provide Concept Search. We also describe our Named Entity Linking based Concept Recommender to support manual curation of our Knowledge Graph.
Youtube URL for talk: https://youtu.be/5UWrS_j8dDg
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Slides accompanying project submission video for Google AI Hackathon. Describes a LCEL and DSPy based evaluation framework inspired by the RAGAS project.
Accompanying video URL: https://youtu.be/yOIU65chc98
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The ability to handle long question style queries is often de rigueur for modern search engines. Search giants such as Bing and Google are addressing this by building Large Language Models (LLMs) into their search pipelines. Unfortunately, this approach requires large investments in infrastructure and involves high operational costs. It can also lead to loss of confidence when the LLM hallucinates non-factual answers.
A best practice for designing search pipelines is to make the search layer as cheap and fast as possible, and move heavyweight operations into the indexing layer. With that in mind, we present an approach that combines the use of LLMs during indexing to generate questions from passages, and matching them to incoming questions during search, using either text based or vector based matching. We believe this approach can provide good quality question answering capabilities for search applications and address the cost and confidence issues mentioned above.
Vector search goes far beyond just text, and, in this interactive workshop, you will learn how to use it for multimodal search through an in-depth look at CLIP, a vision and language model, developed by OpenAI. Sujit Pal, technology research director at Elsevier, and Raphael Pisoni, senior computer vision engineer at Partium.io, will walk you through two applications of image search and then have a panel discussion with our staff developer advocate, James, on how to use CLIP for image and text search.
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
The power of community: training a Transformer Language Model on a shoestringSujit Pal
I recently participated in a community event to train an ALBERT language model for the Bengali language. The event was organized by Neuropark, Hugging Face, and Yandex Research. The training was done collaboratively in a distributed manner using free GPU resources provided by Colab and Kaggle. Volunteers were recruited on Twitter and project coordination happened on Discord. At its peak, there were approximately 50 volunteers from all over the world simultaneously engaged in training the model. The distributed training was done on the Hivemind platform from Yandex Research, and the software to train the model in a data-parallel manner was developed by Hugging Face. In this talk I provide my perspective of the project as a somewhat curious participant. I will describe the Hivemind platform, the training regimen, and the evaluation of the language model on downstream tasks. I will also cover some challenges we encountered that were peculiar to the Bengali language (and Indic languages in general).
Accelerating NLP with Dask and Saturn CloudSujit Pal
Slides for talk delivered at NY NLP Meetup. Abstract -- Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines. This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. The pipeline was built and executed on Saturn Cloud, a platform that makes it easy to launch and manage Dask clusters. The talk will present an introduction to Dask and explain how users can easily accelerate Python and NLP code across clusters of machines.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
This deck discusses two papers by Dr Leslie Smith. The first paper discusses empirical findings around learning rate (LR) and other regularization parameters for neural networks, and leads to the idea of Cyclic Learning Rates (CLR). The second paper discusses CLR in depth, as well as how to estimate its parameters. The slides also covers LR Finder, a tool first introduced in the Fast.AI library to find optimal parameters for CLR, including how to run it and interpret its outputs.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering.
To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components -- a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores.
Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor.
The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
Named Entity Recognition (NER) is foundational for many downstream NLP tasks such as Information Retrieval, Relation Extraction, Question Answering, and Knowledge Base Construction. While many high-quality pre-trained NER models exist, they usually cover a small subset of popular entities such as people, organizations, and locations. But what if we need to recognize domain specific entities such as proteins, chemical names, diseases, etc? The Open Source Named Entity Recognition for Data Scientists (NERDS) toolkit, from the Elsevier Data Science team, was built to address this need.
NERDS aims to speed up development and evaluation of NER models by providing a set of NER algorithms that are callable through the familiar scikit-learn style API. The uniform interface allows reuse of code for data ingestion and evaluation, resulting in cleaner and more maintainable NER pipelines. In addition, customizing NERDS by adding new and more advanced NER models is also very easy, just a matter of implementing a standard NER Model class.
Our presentation will describe the main features of NERDS, then walk through a demonstration of developing and evaluating NER models that recognize biomedical entities. We will then describe a Neural Network based NER algorithm (a Bi-LSTM seq2seq model written in Pytorch) that we will then integrate into the NERDS NER pipeline.
We believe NERDS addresses a real need for building domain specific NER models quickly and efficiently. NER is an active field of research, and the hope is that this presentation will spark interest and contributions of new NER algorithms and Data Adapters from the community that can in turn help to move the field forward.
Graph Techniques for Natural Language ProcessingSujit Pal
Natural Language embodies the human ability to make “infinite use of finite means” (Humboldt, 1836; Chomsky, 1965). A relatively small number of words can be combined using a grammar in myriad different ways to convey all kinds of information. Languages model inter-relationships between their words, just like graphs model inter-relationships between their vertices. It is not surprising then, that graphs are a natural tool to study Natural Language and glean useful information from it, automatically, and at scale. This presentation will focus on NLP techniques to convert raw text to graphs, and present Graph Theory based solutions to some common NLP problems. Solutions presented will use Apache Spark or Neo4j depending on problem size and scale. Examples of Graph Theory solutions presented include PageRank for Document Summarization, Link Prediction from raw text for Knowledge Graph enhancement, Label Propagation for entity classification, and Random Walk techniques to find similar documents.
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
An introduction to Learning to Rank, with case studies using RankLib with and without plugins provided by Solr and Elasticsearch. RankLib is a library of learning to rank algorithms, which includes some popular LTR algorithms such as LambdaMART, RankBoost, RankNet, etc.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Search summit-2018-content-engineering-slidesSujit Pal
Slides accompanying content engineering tutorial presented at RELX Search Summit 2018. Contains techniques for keyword extraction using various statistical, rule based and machine learning methods, keyword de-duplication using SimHash and Dedupe, and dimensionality reduction techniques such as Topic Modeling, NMF, Word vectors, etc.
Evolving a Medical Image Similarity SearchSujit Pal
Slides for talk at Haystack Conference 2018. Covers evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. Discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Introduction
• About Me
– Work at Elsevier Labs.
– Interested in Search, NLP and Distributed Processing.
– URL: labs.elsevier.com
– Email: sujit.pal@elsevier.com
– Twitter: @palsujit
• About Elsevier
– World’s largest publisher of STM Books and Journals.
– Uses Data to inform and enable consumers of STM info.
– And like everybody else, we are hiring!
5. Problem Definition
• What is the problem?
– Annotate millions of documents from different corpora.
• 14M docs from Science Direct alone.
• More from other corpora, dependency parsing, etc.
– Critical step for Machine Reading and Knowledge Graph applications.
• Why is this such a big deal?
– Takes advantage of existing linked data.
– No model training for multiple complex STM domains.
– However, simple until done at scale.
7. Dictionary Based NE Annotator (SoDA)
• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from external Dictionaries.
• Built with Open Source Components
– Apache Solr – Highly reliable, scalable and fault-tolerant search index.
– SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.
– Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.
– Apache Spark – Lightning fast, large scale data processing.
• Uses ideas from other Open Source libraries
– FuzzyWuzzy – Fuzzy String Matching like a boss.
• Contributed back to Open Source
– https://github.com/elsevierlabs-os/soda
9. How does it work (exact/case matching)?
• Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against
it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”,
“her”, “hers”, “she”}.
• Michael McCandless implemented FSTs in Lucene (blog post).
• David Smiley built SolrTextTagger to use Lucene FSTs.
• SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
10. How does it work (fuzzy matching)?
• Pre-normalizes each dictionary entry into various forms
– Original – “Astrocytoma, Subependymal Giant Cell”
– Lowercased – “astrocytoma, subependymal giant cell”
– Punctuation – “astrocytoma subependymal giant cell”
– Sorted – “astrocytoma cell giant subependymal”
– Stemmed – “astrocytoma cell giant subependym”
• Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired
normalization level and matches against corresponding field.
• Caller specifies normalization level.
12. Feature Overview
• Provides JSON over HTTP interface
– Compose request as JSON document
– HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests).
– Receive response as JSON document.
• Language-Agnostic and Cross-Platform.
• API can be used from standalone clients, Spark jobs and Databricks notebooks.
• Examples in Scala and Python
13. Services
• Status
– index.json – returns a JSON (suitable for health check monitoring)
• Single Lexicon Services
– annot.json – annotates a block of text in streaming manner. Supports different levels of
matching (strict to permissive).
– matchphrase.json – annotates short phrases. Supports same matching levels as annot.json.
• Multi-Lexicon Services
– dicts.json – lists all lexicons available.
– coverage.json – returns number of annotations by lexicon found for text across all available
lexicons.
• Indexing Services
– delete.json – deletes entire lexicon from index.
– add.json – adds an entry to the specified lexicon.
14. Annotation Service I/O
Example annotation request
{
“lexicon” : “countries”,
“text” : “Institute of Clean Coal Technology, East
China University”,
“matching” : “exact”
}
Example annotated response
[
{
“id” : “http://www.geonames.org/CHN”,
“lexicon” : “countries”,
“begin” : 41,
“end” : 46,
“coveredText” : “China”,
“confidence” : 1.0
}
]
15. Calling Annotation Service
• Client originally written in Python, using built-in json and requests libraries.
• For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala.
• Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown.
• Match against a country lexicon to find country names.
18. Annotation Service - Outputs
• Each Annotation result provides:
– Entity ID (not shown)
– Begin position in text
– End position in text
– Matched Text
– Confidence (not shown)
• Zero or more Annotations possible per input text.
19. Loading Dictionaries
• Dictionary entries represented by:
– Lexicon Name
– Entry ID (unique across lexicons)
– List of possible synonym terms
• JSON Request to add an entry for MeSH dictionary.
{ “id”: http://id.nlm.nih.gov/mesh/2015/M0021699,
“lexicon”: “mesh”,
“names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...],
“commit”: false }
• Preferable to commit periodically and after batch.
22. SoDA Performance – Expected
• Test: annotate 14M docs in “reasonable time”.
– Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU).
– Total estimated time: 16.2 months!
• Questions
– Can we make the process faster?
– Can we scale out the process?
23. Where is the time being spent?
• Majority of time spent in Solr.
• Some time spent in SoDA (decreases slower than Solr as transactions get shorter).
• Almost no additional time spent in Spark.
24. Optimization #1: Combine Paragraphs
• Performance measured using 10K random
articles.
• Time to annotate 1 article: Mean 2.9s, Median
2.1s.
• Annotation done per paragraph, 40
paragraphs/article on average.
• Reduce HTTP network + parsing overhead by
sending full document.
• Time to annotate 1 article: Mean 1.4s, Median
0.3s.
• 2x - 7x improvement.
25. Optimization #2: Tune Solr GC
• OOB Solr would GC very frequently, slowing
down Spark and causing timeouts.
• Current Index Size: 2.1 GB
• Need to size box so approximately 75% RAM
given to OS and remaining 25% allocated to
Solr (Uwe Schindler's Blog Post).
• Heap size should be 3-4x index size (Internal
Guideline).
• Current Solr Heap Size = 8 GB
• RAM is 30.5 GB
• CMS (Concurrent Mark-Sweep) Garbage
Collection.
26. Optimization #3: Larger Spark Cluster
• Running on cluster Master + 4 Workers.
• Each worker has 1 Executor with 4 Cores.
• Number of simultaneous Solr clients = 16 (4
workers * 1 executor * 4 cores) – measured
with lsof –p in a loop on Solr server.
• Throughput increases with number of
partitions till about 2x the number of worker
cores.
• Best throughput 5 docs/sec with #-
partitions=30 for 16 cores.
28. Performance – Did we meet expectations?
• At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate
against our largest dictionary (8M entries).
• Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have
enough capacity to serve requests.
• Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests.
• Number of simultaneous requests approximately equal to number of worker cores.
• Example: annotate 14M documents in 3 days.
– Throughput required: 14M / (3 * 86400) = 54 docs/s
– Number of workers: 54 / 1.5 = 36 workers
– Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144
– Number of Solr servers: 144 / 32 = 4.5 = 5 servers
29. Future Work
• More Lexicons
• Investigate Lexicon-Centric scale out.
– Allows more lexicons.
– Not limited to single index.
• Move to Lucene, eliminate network
overhead.
– Asynchronous model
– Use Kafka topic with multiple partitions
– Lucene based tagging consumers
– Write output to S3.