The document describes dictionary-based named entity extraction from streaming text. It discusses named entity recognition approaches like regular expression-based, dictionary-based, and model-based. It then describes the SoDA v.2 architecture for scalable dictionary-based named entity extraction, including the Aho-Corasick algorithm, SolrTextTagger, and services provided. Finally, it outlines future work on improving the system.
Introduction to Natural Language Processingrohitnayak
Natural Language Processing has matured a lot recently. With the availability of great open source tools complementing the needs of the Semantic Web we believe this field should be on the radar of all software engineering professionals.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Introduction to Natural Language Processingrohitnayak
Natural Language Processing has matured a lot recently. With the availability of great open source tools complementing the needs of the Semantic Web we believe this field should be on the radar of all software engineering professionals.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
This presentation goes into the details of word embeddings, applications, learning word embeddings through shallow neural network , Continuous Bag of Words Model.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
This presentation provides a beginner-friendly introduction towards Natural Language Processing in a way that arouses interest in the field. I have made the effort to include as many easy to understand examples as possible.
NLP is a tool for computers to analyse, comprehend, and derive meaning from natural language in an intelligent and useful way. Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks.
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Self-learned Relevancy with Apache SolrTrey Grainger
Search engines are known for "relevancy", but the relevancy models that ship out of the box (BM25, classic tf-idf, etc.) are just scratching the surface of what's needed for a truly insightful application.
What if your search engine could automatically tune its own domain-specific relevancy model based on user interactions? What if it could learn the important phrases and topics within your domain, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain? What if you could further use SQL queries to explore these relationships within your own BI tools and return results in ranked order to deliver relevance-driven analytics visualizations?
In this presentation, we'll walk through how you can leverage the myriad of capabilities in the Apache Solr ecosystem (such as the Solr Text Tagger, Semantic Knowledge Graph, Spark-Solr, Solr SQL, learning to rank, probabilistic query parsing, and Lucidworks Fusion) to build self-learning, relevance-first search, recommendations, and data analytics applications.
This presentation goes into the details of word embeddings, applications, learning word embeddings through shallow neural network , Continuous Bag of Words Model.
Finite state automata (deterministic and nondeterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus, the two machines are quite useful in language processing tasks.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
This presentation provides a beginner-friendly introduction towards Natural Language Processing in a way that arouses interest in the field. I have made the effort to include as many easy to understand examples as possible.
NLP is a tool for computers to analyse, comprehend, and derive meaning from natural language in an intelligent and useful way. Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks.
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Self-learned Relevancy with Apache SolrTrey Grainger
Search engines are known for "relevancy", but the relevancy models that ship out of the box (BM25, classic tf-idf, etc.) are just scratching the surface of what's needed for a truly insightful application.
What if your search engine could automatically tune its own domain-specific relevancy model based on user interactions? What if it could learn the important phrases and topics within your domain, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain? What if you could further use SQL queries to explore these relationships within your own BI tools and return results in ranked order to deliver relevance-driven analytics visualizations?
In this presentation, we'll walk through how you can leverage the myriad of capabilities in the Apache Solr ecosystem (such as the Solr Text Tagger, Semantic Knowledge Graph, Spark-Solr, Solr SQL, learning to rank, probabilistic query parsing, and Lucidworks Fusion) to build self-learning, relevance-first search, recommendations, and data analytics applications.
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
In order to reduce the cost of building domain ontologies manually, in this paper, we propose a method and a tool named DODDLE-OWL for domain ontology construction reusing texts and existing ontologies extracted by an ontology search engine: Swoogle. In the experimental evaluation, we applied the method to a particular field of law and evaluated the acquired ontologies.
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
All you need to start with Apache Solr (elastic search). This presentation includes all the information of Solr i.e. what it is, installation, indexing & searching for beginners.
This presentation was provided by Phil Norman of OCLC and Peter McCracken of Serials Solutions, during the NISO event "OpenURL Implementation: Link Resolution That Users Will Love," held on August 21, 2008.
Similar to SoDA v2 - Named Entity Recognition from streaming text (20)
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
We describe our Dictionary based Named Entity Recognizer and Semantic Matcher that enables us to leverage our Knowledge Graph to provide Concept Search. We also describe our Named Entity Linking based Concept Recommender to support manual curation of our Knowledge Graph.
Youtube URL for talk: https://youtu.be/5UWrS_j8dDg
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Slides accompanying project submission video for Google AI Hackathon. Describes a LCEL and DSPy based evaluation framework inspired by the RAGAS project.
Accompanying video URL: https://youtu.be/yOIU65chc98
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The ability to handle long question style queries is often de rigueur for modern search engines. Search giants such as Bing and Google are addressing this by building Large Language Models (LLMs) into their search pipelines. Unfortunately, this approach requires large investments in infrastructure and involves high operational costs. It can also lead to loss of confidence when the LLM hallucinates non-factual answers.
A best practice for designing search pipelines is to make the search layer as cheap and fast as possible, and move heavyweight operations into the indexing layer. With that in mind, we present an approach that combines the use of LLMs during indexing to generate questions from passages, and matching them to incoming questions during search, using either text based or vector based matching. We believe this approach can provide good quality question answering capabilities for search applications and address the cost and confidence issues mentioned above.
Vector search goes far beyond just text, and, in this interactive workshop, you will learn how to use it for multimodal search through an in-depth look at CLIP, a vision and language model, developed by OpenAI. Sujit Pal, technology research director at Elsevier, and Raphael Pisoni, senior computer vision engineer at Partium.io, will walk you through two applications of image search and then have a panel discussion with our staff developer advocate, James, on how to use CLIP for image and text search.
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
The power of community: training a Transformer Language Model on a shoestringSujit Pal
I recently participated in a community event to train an ALBERT language model for the Bengali language. The event was organized by Neuropark, Hugging Face, and Yandex Research. The training was done collaboratively in a distributed manner using free GPU resources provided by Colab and Kaggle. Volunteers were recruited on Twitter and project coordination happened on Discord. At its peak, there were approximately 50 volunteers from all over the world simultaneously engaged in training the model. The distributed training was done on the Hivemind platform from Yandex Research, and the software to train the model in a data-parallel manner was developed by Hugging Face. In this talk I provide my perspective of the project as a somewhat curious participant. I will describe the Hivemind platform, the training regimen, and the evaluation of the language model on downstream tasks. I will also cover some challenges we encountered that were peculiar to the Bengali language (and Indic languages in general).
Accelerating NLP with Dask and Saturn CloudSujit Pal
Slides for talk delivered at NY NLP Meetup. Abstract -- Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines. This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. The pipeline was built and executed on Saturn Cloud, a platform that makes it easy to launch and manage Dask clusters. The talk will present an introduction to Dask and explain how users can easily accelerate Python and NLP code across clusters of machines.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
This deck discusses two papers by Dr Leslie Smith. The first paper discusses empirical findings around learning rate (LR) and other regularization parameters for neural networks, and leads to the idea of Cyclic Learning Rates (CLR). The second paper discusses CLR in depth, as well as how to estimate its parameters. The slides also covers LR Finder, a tool first introduced in the Fast.AI library to find optimal parameters for CLR, including how to run it and interpret its outputs.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering.
To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components -- a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores.
Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor.
The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
Named Entity Recognition (NER) is foundational for many downstream NLP tasks such as Information Retrieval, Relation Extraction, Question Answering, and Knowledge Base Construction. While many high-quality pre-trained NER models exist, they usually cover a small subset of popular entities such as people, organizations, and locations. But what if we need to recognize domain specific entities such as proteins, chemical names, diseases, etc? The Open Source Named Entity Recognition for Data Scientists (NERDS) toolkit, from the Elsevier Data Science team, was built to address this need.
NERDS aims to speed up development and evaluation of NER models by providing a set of NER algorithms that are callable through the familiar scikit-learn style API. The uniform interface allows reuse of code for data ingestion and evaluation, resulting in cleaner and more maintainable NER pipelines. In addition, customizing NERDS by adding new and more advanced NER models is also very easy, just a matter of implementing a standard NER Model class.
Our presentation will describe the main features of NERDS, then walk through a demonstration of developing and evaluating NER models that recognize biomedical entities. We will then describe a Neural Network based NER algorithm (a Bi-LSTM seq2seq model written in Pytorch) that we will then integrate into the NERDS NER pipeline.
We believe NERDS addresses a real need for building domain specific NER models quickly and efficiently. NER is an active field of research, and the hope is that this presentation will spark interest and contributions of new NER algorithms and Data Adapters from the community that can in turn help to move the field forward.
Graph Techniques for Natural Language ProcessingSujit Pal
Natural Language embodies the human ability to make “infinite use of finite means” (Humboldt, 1836; Chomsky, 1965). A relatively small number of words can be combined using a grammar in myriad different ways to convey all kinds of information. Languages model inter-relationships between their words, just like graphs model inter-relationships between their vertices. It is not surprising then, that graphs are a natural tool to study Natural Language and glean useful information from it, automatically, and at scale. This presentation will focus on NLP techniques to convert raw text to graphs, and present Graph Theory based solutions to some common NLP problems. Solutions presented will use Apache Spark or Neo4j depending on problem size and scale. Examples of Graph Theory solutions presented include PageRank for Document Summarization, Link Prediction from raw text for Knowledge Graph enhancement, Label Propagation for entity classification, and Random Walk techniques to find similar documents.
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
An introduction to Learning to Rank, with case studies using RankLib with and without plugins provided by Solr and Elasticsearch. RankLib is a library of learning to rank algorithms, which includes some popular LTR algorithms such as LambdaMART, RankBoost, RankNet, etc.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Search summit-2018-content-engineering-slidesSujit Pal
Slides accompanying content engineering tutorial presented at RELX Search Summit 2018. Contains techniques for keyword extraction using various statistical, rule based and machine learning methods, keyword de-duplication using SimHash and Dedupe, and dimensionality reduction techniques such as Topic Modeling, NMF, Word vectors, etc.
Evolving a Medical Image Similarity SearchSujit Pal
Slides for talk at Haystack Conference 2018. Covers evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. Discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. Agenda
• Introduction
• The Entity Resolution Problem
• Named Entity Recognition/Extraction (NER)
• SoDA v.2 Architecture
• SoDA v.2 Services
• Future Work
• Conclusion
2
Dictionary based Named Entity Extraction from streaming text
3. Introduction
• About Me
• Work at Elsevier Labs
• Interested in Search, NLP and Machine Learning
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
• About Elsevier Labs
• Advanced Technology Group within Elsevier
• More info: https://labs.elsevier.com
• About Elsevier
• World’s largest publisher of STM books and journals
• Uses data to inform and enable consumers of STM Info
3
Dictionary based Named Entity Extraction from streaming text
4. The Entity Resolution Problem
• Named Entity Recognition/Extraction – recognize mentions of named
entities.
• Named Entity Resolution – resolve entity with root entity.
4
Dictionary based Named Entity Extraction from streaming text
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
PERSON LOCATIONEVENT
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
5. Approaches to NER
• Three major approaches
• Regular Expression (RegEx) Based
• Dictionary Based
• Model Based
• Hybrid approaches
• Combining Approaches
• Data Programming
• Active Learning
5
Dictionary based Named Entity Extraction from streaming text
6. RegEx based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
([A-Z][a-z]+){2,3}
AGE
(d){1,3}syearssold
DATE
([A-Z][a-z]{2}(.)*)s(d{2})
6
Dictionary based Named Entity Extraction from streaming text
7. Dictionary Based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
Names of
famous
people
DATE
Month names
and abbrs.
7
Dictionary based Named Entity Extraction from streaming text
8. Dictionary based NER – 3rd Party S/W
• Open Source
• GATE (General Architecture for Text Engineering)
• pyahocorasick
• SoDA (SOlr Dictionary Annotator)
• Commercial / Open Source
• LingPipe
8
Dictionary based Named Entity Extraction from streaming text
10. Model based NER – Sequence Models
• Typical model structure
• Input – a sentence s or a sequence of words {x0, x1, …, xn}.
• Output – a sequence Y {y0, y1, …, yn} of IOB tags.
• Hidden Markov Models – IOB tag depends on input variable and
previous label.
• Conditional Random Fields – IOB tag depends on features {f0, f1, …,
fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi,
current label yi, previous label yi-1, and the entire sentence s.
10
Dictionary based Named Entity Extraction from streaming text
11. Model based NER – Sequence Models (2)
• Family of Deep Learning Sequence Models – has been used for POS
tagging, phrase chunking, NER and even language translation.
• Feature vectors for words created using Word Embeddings (word2vec,
GloVe, fasttext, etc).
• Performance can be improved with Attention mechanisms.
• Represents state of the art for Named Entity Recognition.
• Needs lots of data to train.
11
Dictionary based Named Entity Extraction from streaming text
x1x0 EOSxn
y1y0 y2
y0 yny1
EOS
LSTM ENCODER LSTM DECODER
weights
12. Model based NER – 3rd party S/W
• Open Source
• GATE
• Apache OpenNLP
• Stanford NER (has NLTK plugin)
• SpaCy NER
• NERDS
• Commercial
• Basis Technologies Rosette Entity Extractor
• IBM Watson / Alchemy API
• Amazon Comprehend
• Azure Named Entity Recognition
12
Dictionary based Named Entity Extraction from streaming text
13. Hybrid Approaches – combinations
• Create initial labeled dataset by harvesting entities from large text corpora
using one or more of the following:
• Weak Supervision – RegEx and other pattern matching (eg. Hearst
Patterns for phrases).
• Distant Supervision – matching against dictionaries derived from
industry specific (public or private) ontologies.
• Unsupervised – legacy rule based models.
• Supervised – predictions from weaker models.
• Crowdsourcing – using human experts.
• Train powerful seq2seq model using labeled dataset.
• Refine using human-in-the-loop active learning or other techniques.
13
Dictionary based Named Entity Extraction from streaming text
14. Data Programming - Snorkel
• Start with noisy labels L from various sources
• Train generative model capable of generating probabilities P for each of
the output classes based on feature vector of noisy labels.
• Train final noise-aware discriminative model with output of generative
model P and original data X to predict class label Q for data.
• The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered
this approach and provides tooling for all these steps.
14
Dictionary based Named Entity Extraction from streaming text
Image Credit: Snorkel Project
15. SoDA v.2 Architecture
• Theoretical Foundations
• Aho-Corasick algorithm
• SolrTextTagger
• SoDA Architecture
• Scaling SoDA
15
Dictionary based Named Entity Extraction from streaming text
16. Aho-Corasick Algorithm
• Implements a data structure called “trie”
• State machine over characters
• Dictionary based NERs implement similar state machine over words in
phrases.
16
Dictionary based Named Entity Extraction from streaming text
Image Credit: ResearchGate
17. SolrTextTagger
• Lucene’s TokenStreams are finite state automatons (FSA).
• SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger)
dynamically creates FSAs from dictionary entries into a Finite State
Transducer (FST) data structure.
• Provides tag service to annotate incoming streaming text against FST.
• Input is text, output is matched dictionary entries and offsets into text.
• SolrTextTagger is OSS created by Lucene/Solr committer David Smiley.
17
Dictionary based Named Entity Extraction from streaming text
Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
18. Architecture
18
Dictionary based Named Entity Extraction from streaming text
• Co-located with standalone
Solr server.
• Scala based thin wrapper over
SolrTextTagger.
• Provides following services.
• unified JSON over HTTP
request/response
• multiple matching styles
• multiple lexicons
• hides details of managing
SolrTextTagger.
• Streaming (text) and non-
streaming (phrase) matching
services.
• Programmatic APIs for Scala
and Python.
19. Scaling
19
Dictionary based Named Entity Extraction from streaming text
• Install and configure Solr,
SolrTextTagger and SoDA and
create AMI
• Use CloudFormation (or
Terraform) templates to
instantiate cluster of
Solr+SoDA instances behind
Elastic Load Balancer.
• Autoscaling cluster
• Monitored by CloudWatch
• New dictionaries loaded by
instantiating EC2 from AMI via
Lambda and saved back into
AMI for next cluster build.
client
loader
20. Consuming Annotations at scale
20
Dictionary based Named Entity Extraction from streaming text
• Synchronous
• Asynchronous
Databricks
Notebook
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Kafka/Kinesis
Streams
Producer Consumer
21. SoDA Services
• Bulk Loader (backend)
• Client facing (front end)
• Index (status check)
• Add New Record into Lexicon
• Delete Lexicon or Entry
• Annotate Text against Lexicon
• List Available Lexicons
• Find coverage of incoming text against Lexicons
• Lookup by ID
• Reverse Lookup by Phrase
21
Dictionary based Named Entity Extraction from streaming text
22. SoDA Bulk Loader
• Multithreaded loader for bulk loading dictionaries into SoDA.
• Requires tab-separated file in following format:
• id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n
• One line per dictionary entry
• Script to run (on SoDA/Solr box).
• ./bulk_load.sh lexicon /path/to/input num_workers
22
Dictionary based Named Entity Extraction from streaming text
23. SoDA Health Check – index.json
• Returns a status message. Meant to be used for testing if the SoDA application is up.
• Python client code
• Scala client code
• Output
23
Dictionary based Named Entity Extraction from streaming text
24. Annotate Text against Lexicon – annot.json
• Annotates text against a specific lexicon and match type.
• Match types can be one of the following:
• exact – matches text spans with dictionary entries.
• lower – same as exact, but matches are case-sensitive
• stop – same as lower, but stop words removed from both text and dictionary entries
• stem1 – same as stop, but stemmed with Solr minimal English stemmer
• stem2 – same as stop, but stemmed with Solr Kstem stemmer
• stem3 – same as stop, but stemmed with Solr Porter stemmer.
• Input (HTTP POST)
24
Dictionary based Named Entity Extraction from streaming text
25. Annotate Text against Lexicon (2)
• Python client code
• Scala client code
• Output
25
Dictionary based Named Entity Extraction from streaming text
26. List Available Lexicons – dicts.json
• Returns a list of lexicons available to annotate against.
• Python client
• Scala client
• Output
26
Dictionary based Named Entity Extraction from streaming text
27. Check Coverage – coverage.json
• This can be used to find which lexicons are appropriate for annotating your text.
The service allows you to send a piece of text to all hosted lexicons and returns
with the number of matches found in each.
• Input (HTTP POST)
• Python client
• Scala client
27
Dictionary based Named Entity Extraction from streaming text
28. Check Coverage (2)
• Output
28
Dictionary based Named Entity Extraction from streaming text
29. Lookup by ID – lookup.json
• Allows looking up a dictionary entry by lexicon and ID.
• Input (HTTP POST)
• Python client
• Scala client
29
Dictionary based Named Entity Extraction from streaming text
30. Lookup by ID (2)
• Output
30
Dictionary based Named Entity Extraction from streaming text
31. Reverse Lookup by Phrase
• Matches phrases against specific lexicon and match type.
• Match types can be one of the following:
• All match types supported by Annotation service (annot.json)
• lsort – case-insensitive matching against phrase with words sorted
alphabetically.
• s3sort – case-insensitive matching against phrase stemmed using
Porter Stemmer (stem3) and its words sorted alphabetically.
• Input
31
Dictionary based Named Entity Extraction from streaming text
32. Reverse Lookup by Phrase (2)
• Python client
• Scala client
• Output
32
Dictionary based Named Entity Extraction from streaming text
33. Future Work
• List of open items on the SoDA issues page and continuously updated as
I find them (https://github.com/elsevierlabs-os/soda/issues).
• Please feel free to post issues and ideas for improvement.
33
Dictionary based Named Entity Extraction from streaming text