Named Entity Recognition (NER) is foundational for many downstream NLP tasks such as Information Retrieval, Relation Extraction, Question Answering, and Knowledge Base Construction. While many high-quality pre-trained NER models exist, they usually cover a small subset of popular entities such as people, organizations, and locations. But what if we need to recognize domain specific entities such as proteins, chemical names, diseases, etc? The Open Source Named Entity Recognition for Data Scientists (NERDS) toolkit, from the Elsevier Data Science team, was built to address this need.
NERDS aims to speed up development and evaluation of NER models by providing a set of NER algorithms that are callable through the familiar scikit-learn style API. The uniform interface allows reuse of code for data ingestion and evaluation, resulting in cleaner and more maintainable NER pipelines. In addition, customizing NERDS by adding new and more advanced NER models is also very easy, just a matter of implementing a standard NER Model class.
Our presentation will describe the main features of NERDS, then walk through a demonstration of developing and evaluating NER models that recognize biomedical entities. We will then describe a Neural Network based NER algorithm (a Bi-LSTM seq2seq model written in Pytorch) that we will then integrate into the NERDS NER pipeline.
We believe NERDS addresses a real need for building domain specific NER models quickly and efficiently. NER is an active field of research, and the hope is that this presentation will spark interest and contributions of new NER algorithms and Data Adapters from the community that can in turn help to move the field forward.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
This is presentation about what skip-gram and CBOW is in seminar of Natural Language Processing Labs.
- how to make vector of words using skip-gram & CBOW.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
This is presentation about what skip-gram and CBOW is in seminar of Natural Language Processing Labs.
- how to make vector of words using skip-gram & CBOW.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
Tutorial for KDD 2019:
Search and recommender systems process rich natural language text data such as user queries and documents. Achieving high-quality search and recommendation results requires processing and understanding such information effectively and efficiently, where natural language processing (NLP) technologies are widely deployed. In recent years, the rapid development of deep learning models has been proven successful for improving various NLP tasks, indicating their great potential of promoting search and recommender systems.
In this tutorial, we summarize the current effort of deep learning for NLP in search/recommender systems. We first give an overview of search/recommender systems with NLP, then introduce basic concept of deep learning for NLP, covering state-of-the-art technologies in both language understanding and language generation. After that, we share our hands-on experience with LinkedIn applications. In the end, we highlight several important future trends.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Basic concept of Deep Learning with explaining its structure and backpropagation method and understanding autograd in PyTorch. (+ Data parallism in PyTorch)
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
General background and conceptual explanation of word embeddings (word2vec in particular). Mostly aimed at linguists, but also understandable for non-linguists.
Leiden University, 23 March 2018
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in FirmwareLastline, Inc.
Over the last few years, as the world has moved closer to realizing the idea of the Internet of Things, an increasing number of the analog things with which we used to interact every day have been replaced with connected devices. The increasingly-complex systems that drive these devices have one thing in common – they must all communicate to carry out their intended functionality. Such communication is handled by firmware embedded in the device. And firmware, like any piece of software, is susceptible to a wide range of errors and vulnerabilities.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
Tutorial for KDD 2019:
Search and recommender systems process rich natural language text data such as user queries and documents. Achieving high-quality search and recommendation results requires processing and understanding such information effectively and efficiently, where natural language processing (NLP) technologies are widely deployed. In recent years, the rapid development of deep learning models has been proven successful for improving various NLP tasks, indicating their great potential of promoting search and recommender systems.
In this tutorial, we summarize the current effort of deep learning for NLP in search/recommender systems. We first give an overview of search/recommender systems with NLP, then introduce basic concept of deep learning for NLP, covering state-of-the-art technologies in both language understanding and language generation. After that, we share our hands-on experience with LinkedIn applications. In the end, we highlight several important future trends.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Basic concept of Deep Learning with explaining its structure and backpropagation method and understanding autograd in PyTorch. (+ Data parallism in PyTorch)
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
General background and conceptual explanation of word embeddings (word2vec in particular). Mostly aimed at linguists, but also understandable for non-linguists.
Leiden University, 23 March 2018
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in FirmwareLastline, Inc.
Over the last few years, as the world has moved closer to realizing the idea of the Internet of Things, an increasing number of the analog things with which we used to interact every day have been replaced with connected devices. The increasingly-complex systems that drive these devices have one thing in common – they must all communicate to carry out their intended functionality. Such communication is handled by firmware embedded in the device. And firmware, like any piece of software, is susceptible to a wide range of errors and vulnerabilities.
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
Технологии с открытым исходным кодом, такие как Microsoft Orleans и ElasticSearch, - ключевые элементы архитектуры YouScan. О том, как они помогают справляться с постоянно растущими объемами данных из социальных сетей, об эволюции архитектуры YouScan, я расскажу в данном докладе.
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
Speaker: Artem Chernodub, Chief Scientist at Clikque Technology and Associate Professor at Ukrainian Catholic University
Summary: Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well-known NER 2003 shared task dataset.
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
Oplægget blev holdt ved et seminar i InfinIT-interessegruppen Højniveausprog til Indlejrede Systemer den 2. oktober 2013. Læs mere om interessegruppen her: http://infinit.dk/dk/interessegrupper/hoejniveau_sprog_til_indlejrede_systemer/hoejniveau_sprog_til_indlejrede_systemer.htm
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table.
The openCypher Project - An Open Graph Query LanguageNeo4j
We want to present the openCypher project, whose purpose is to make Cypher available to everyone – every data store, every tooling provider, every application developer. openCypher is a continual work in progress. Over the next few months, we will move more and more of the language artifacts over to GitHub to make it available for everyone.
openCypher is an open source project that delivers four key artifacts released under a permissive license: (i) the Cypher reference documentation, (ii) a Technology compatibility kit (TCK), (iii) Reference implementation (a fully functional implementation of key parts of the stack needed to support Cypher inside a data platform or tool) and (iv) the Cypher language specification.
We are also seeking to make the process of specifying and evolving the Cypher query language as open as possible, and are actively seeking comments and suggestions on how to improve the Cypher query language.
The purpose of this talk is to provide more details regarding the above-mentioned aspects.
We want to present the openCypher project, whose purpose is to make Cypher available to everyone – every data store, every tooling provider, every application developer. openCypher is a continual work in progress. Over the next few months, we will move more and more of the language artifacts over to GitHub to make it available for everyone.
openCypher is an open source project that delivers four key artifacts released under a permissive license: (i) the Cypher reference documentation, (ii) a Technology compatibility kit (TCK), (iii) Reference implementation (a fully functional implementation of key parts of the stack needed to support Cypher inside a data platform or tool) and (iv) the Cypher language specification.
We are also seeking to make the process of specifying and evolving the Cypher query language as open as possible, and are actively seeking comments and suggestions on how to improve the Cypher query language.
The purpose of this talk is to provide more details regarding the above-mentioned aspects.
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
We describe our Dictionary based Named Entity Recognizer and Semantic Matcher that enables us to leverage our Knowledge Graph to provide Concept Search. We also describe our Named Entity Linking based Concept Recommender to support manual curation of our Knowledge Graph.
Youtube URL for talk: https://youtu.be/5UWrS_j8dDg
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Slides accompanying project submission video for Google AI Hackathon. Describes a LCEL and DSPy based evaluation framework inspired by the RAGAS project.
Accompanying video URL: https://youtu.be/yOIU65chc98
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The ability to handle long question style queries is often de rigueur for modern search engines. Search giants such as Bing and Google are addressing this by building Large Language Models (LLMs) into their search pipelines. Unfortunately, this approach requires large investments in infrastructure and involves high operational costs. It can also lead to loss of confidence when the LLM hallucinates non-factual answers.
A best practice for designing search pipelines is to make the search layer as cheap and fast as possible, and move heavyweight operations into the indexing layer. With that in mind, we present an approach that combines the use of LLMs during indexing to generate questions from passages, and matching them to incoming questions during search, using either text based or vector based matching. We believe this approach can provide good quality question answering capabilities for search applications and address the cost and confidence issues mentioned above.
Vector search goes far beyond just text, and, in this interactive workshop, you will learn how to use it for multimodal search through an in-depth look at CLIP, a vision and language model, developed by OpenAI. Sujit Pal, technology research director at Elsevier, and Raphael Pisoni, senior computer vision engineer at Partium.io, will walk you through two applications of image search and then have a panel discussion with our staff developer advocate, James, on how to use CLIP for image and text search.
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
The power of community: training a Transformer Language Model on a shoestringSujit Pal
I recently participated in a community event to train an ALBERT language model for the Bengali language. The event was organized by Neuropark, Hugging Face, and Yandex Research. The training was done collaboratively in a distributed manner using free GPU resources provided by Colab and Kaggle. Volunteers were recruited on Twitter and project coordination happened on Discord. At its peak, there were approximately 50 volunteers from all over the world simultaneously engaged in training the model. The distributed training was done on the Hivemind platform from Yandex Research, and the software to train the model in a data-parallel manner was developed by Hugging Face. In this talk I provide my perspective of the project as a somewhat curious participant. I will describe the Hivemind platform, the training regimen, and the evaluation of the language model on downstream tasks. I will also cover some challenges we encountered that were peculiar to the Bengali language (and Indic languages in general).
Accelerating NLP with Dask and Saturn CloudSujit Pal
Slides for talk delivered at NY NLP Meetup. Abstract -- Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines. This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. The pipeline was built and executed on Saturn Cloud, a platform that makes it easy to launch and manage Dask clusters. The talk will present an introduction to Dask and explain how users can easily accelerate Python and NLP code across clusters of machines.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
This deck discusses two papers by Dr Leslie Smith. The first paper discusses empirical findings around learning rate (LR) and other regularization parameters for neural networks, and leads to the idea of Cyclic Learning Rates (CLR). The second paper discusses CLR in depth, as well as how to estimate its parameters. The slides also covers LR Finder, a tool first introduced in the Fast.AI library to find optimal parameters for CLR, including how to run it and interpret its outputs.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering.
To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components -- a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores.
Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor.
The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.
Graph Techniques for Natural Language ProcessingSujit Pal
Natural Language embodies the human ability to make “infinite use of finite means” (Humboldt, 1836; Chomsky, 1965). A relatively small number of words can be combined using a grammar in myriad different ways to convey all kinds of information. Languages model inter-relationships between their words, just like graphs model inter-relationships between their vertices. It is not surprising then, that graphs are a natural tool to study Natural Language and glean useful information from it, automatically, and at scale. This presentation will focus on NLP techniques to convert raw text to graphs, and present Graph Theory based solutions to some common NLP problems. Solutions presented will use Apache Spark or Neo4j depending on problem size and scale. Examples of Graph Theory solutions presented include PageRank for Document Summarization, Link Prediction from raw text for Knowledge Graph enhancement, Label Propagation for entity classification, and Random Walk techniques to find similar documents.
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
An introduction to Learning to Rank, with case studies using RankLib with and without plugins provided by Solr and Elasticsearch. RankLib is a library of learning to rank algorithms, which includes some popular LTR algorithms such as LambdaMART, RankBoost, RankNet, etc.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Search summit-2018-content-engineering-slidesSujit Pal
Slides accompanying content engineering tutorial presented at RELX Search Summit 2018. Contains techniques for keyword extraction using various statistical, rule based and machine learning methods, keyword de-duplication using SimHash and Dedupe, and dimensionality reduction techniques such as Topic Modeling, NMF, Word vectors, etc.
Evolving a Medical Image Similarity SearchSujit Pal
Slides for talk at Haystack Conference 2018. Covers evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. Discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale.
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Sujit Pal
Slides for talk at PyData Seattle 2017 about Matthew Honnibal's 4-step recipe for Deep Learning NLP pipelines. Description of the stages in pipeline as well as 3 examples of document classification, document similarity and sentence similarity. Examples include Keras custom layers for different types of attention.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. About me
• Work at Elsevier Labs
• (Mostly self-taught) data scientist
• Mostly work with Deep Learning, Machine
Learning, Natural Language Processing, and
Search.
• Got interested in Named Entity Recognition
(NER) and NERDS as part of Search and
Knowledge Graph development.
2
I am NOT the author or maintainer of NERDS!
• Originally built by Panagiotis Eustratiadis.
• See CONTRIBUTORS.md for list of contributors.
• Open sourced by Elsevier July 3, 2018.
3. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
3
4. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
4
5. What can NER do for you?
• In general…
• Foundational task for NLP pipelines.
• Good NERs available OOB for “standard” named entities.
• Topic Modeling, Co-reference Resolution, etc.
• Information Retrieval (IR)
• Chunk Entities into meaningful multi-word phrases.
• Understanding query intent.
• Automated Knowledge Graph Construction (AKBC)
• NER extracts entities from incoming text.
• Relationship Extraction extracts relationships between entity pairs.
• Entity Relationship triple inserted into Knowledge Graph.
5
ConceptSearch!
6. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
6
7. Evolution of NER Techniques
• Rules
• Regular
Expressions
• Gazetteers
7
• Word-based
models – PMI,
log-likelihood.
• Sequence models
– Conditional
Random Fields
• Bi-LSTM
• Bi-LSTM+CRF
• Transformer
based Models
Traditional Statistical Neural
8. Input Format – BIO Tagging
• BIO – Begin In Out.
• Barack/B-PER Obama/I-PER is/O 44th/O
United/B-LOC States/I-LOC President/O
./O
• BILOU – a tagging variant:
• U – Unit token (for single token entities)
• L – Last token in sequence, ex. Barack/B-
PER Obama/L-PER
8
Barack B-PER
Obama I-PER
is O
44th O
United B-LOC
States I-LOC
President O
. O
9. Gazetteer – Aho Corasick
• Create in-memory data
structure from dictionary.
• Stream content against data
structure.
• Multiple matches with single
pass.
9
Aho, A.V., and Corasick, M.J., 1975. Efficient String Matching: An aid to bibliographic search
21
43
0
Barack Obama
United States
NOT(Barack, United)
5
Airlines
PER
LOC
ORG
10. Sequence Modeling - CRF
• Sequence version of logistic regression.
• Computes optimum labeling l (y0, …, yn) over entire sentence s.
• Build multiple feature functions f on each token, return real value in range 0..1.
Function parameters:
• sentence s with tokens (x0, …, xn) – feature can use any token, the entire
sentence, or functions computed over the sentence (POS),
• current position i,
• previous and next labels yi-1 and yi+1.
• Optimum labeling computed as follows, probability computed using softmax.
• Weights wj learned using gradient descent.
10
11. Neural Model - BiLSTM
• Input is sequence of tokens, output is sequence of BIO tags.
• Weights trained end-to-end, no feature engineering needed.
• Bidirectional LSTM gets signal from neighboring words on both sides.
11
B-PER I-PER O O B-LOC I-LOC O O
Barack Obama is 44th United States PresidentStates .
12. Neural Model – BiLSTM-CRF
• Same as previous model, with additional CRF layer.
• No feature engineering for CRF, unlike CRF only NER model.
• Pre-trained embeddings observed to improve performance.
12
Barack Obama is 44th United States PresidentStates .
B-PER I-PER O O B-LOC I-LOC O O
CRFBi-LSTM
13. Neural Model – adding char embeddings
• Concatenate char embedding + word embedding and feed to Bi-LSTM-CRF.
• All weights learned end-to-end.
• Handles rare / unknown words; Exploits signal in prefix/suffix.
13
.Barack Obama is 44th United PresidentStates
B-PER I-PER O O B-LOC I-LOC O O
word embeddings char LSTM/CNN
Bi-LSTM-CRF
concatenate
14. Neural Model – ELMo preprocessing
14
.Barack Obama is 44th United PresidentStates
B-PER I-PER O O B-LOC I-LOC O O
char LSTM/CNN
Bi-LSTM-CRF
concat
Contextualized
wordembeddings
15. Neural Model – Transformer based
• BERT = Bidirectional Encoder Representation for Transformers.
• Source of embeddings similar to ELMo in standard BiLSTM + CRF models, OR
• Fine-tune LM backed NERs such as HuggingFace’s BertForTokenClassification.
15
.Barack Obama is 44th United PresidentStates[CLS]
B-PER I-PER O O B-LOC I-LOC O O
16. More Info on NER Techniques
• High level overview on NER in series of blog posts by Tobias Sterbak
(https://bit.ly/2pNdgPG).
• Traditional NER techniques covered in paper by Rahul Shernagat (2014) -- Named
Entity Recognition: A Literature Survey (https://bit.ly/2NRaCAg).
• Introduction to Neural Models in paper by Ronan Collolbert and Jason Weston
(2008) – A Unified Architecture for Natural Language Processing: Deep Neural
Networks with Multitask Learning (https://bit.ly/32rRYnO)
• Others (more modern papers) mentioned in slides.
16
17. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
17
18. NERDS Overview
• Framework that provides easy to use NER capabilities to Data
Scientists.
• Wraps various popular third party NER models.
• Extendable, new third party NER tools can be added as needed.
• Software Engineering tooling to boost Data Science productivity.
• Looking for support, bug reports, contributions, and ideas.
18
19. Unification through I/O Format
19
pyAhoCorasick CRFSuite SpaCy NER Anago BiLSTM
AnnotatedDocument (
doc: Document(“Barack Obama is 44th United States President .”),
annotations: [
Annotation(start_offset:0, end_offset:12, text:”Barack Obama”, label:”PER”),
Annotation(start_offset:22, end_offset:35, text:”United States”, label:”LOC”)
])
20. Benefits of Unification
• Consistent API – all models are subclasses of NERModel.
• Data prep. done once per project and reused across multiple models.
• Reusable Training and Evaluation code.
• Familiar Scikit-Learn like API, and access to Scikit-Learn utility functions.
• Duck-typing allows us to build Ensembles of NER.
• Easy to benchmark NER label data.
20
21. Can we do better?
21
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
DictionaryNER
I/O
Convert
SpacyNER
I/O
Convert
CrfNER BiLstmCrfNER
22. ELMo NER Model from Anago
22
DictionaryNER CrfNER SpacyNER BiLstmCrfNER
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
I/O
Convert
I/O
Convert
ElmoNER
23. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
23
24. Dataset
• Bio Entity recognition task from BioNLP 2004.
• Training and Test sets provided in BIO format.
• 511,097 training examples
• 104,895 test examples.
• Entity Distribution (training set)
• 25,307 DNA
• 2,481 RNA
• 11,217 cell_line
• 15,466 cell_type
• 55,117 protein
24
25. Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
25
26. Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
26
27. CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements in this fork:
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
27
28. CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
28
29. SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• Improvements in this fork:
• More robust to large data sizes, uses mini-batches for training.
29
30. SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• Improvements in this fork:
• More robust to large data sizes, uses mini-batches for training.
30
31. BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Improvements in this fork:
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
31
32. BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Improvements in this fork:
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
32
33. Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
33
34. Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
34
35. Ensemble NER
• Max Voting
• Improvements in this fork:
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
35
36. Ensemble NER
• Max Voting
• Improvements in this fork:
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
36
37. Results (OOTB)
• Comparison across models
• ELMO based CRF has best performance.
• SpaCy and BiLSTM have comparable
performance, but CRF is competitive.
• Model based NERs outperform gazetteers.
• F1-scores range from 0.65 to 0.80
• Comparison across entity types
• Some correlation observed between data
volume and F1-scores for other models.
• F1-scores range from 0.61 to 0.81
37
38. Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
38
39. Future Work
• Current API is only superficially Scikit-Learn like, convert to make models
fully conform to Scikit-Learn Classifier API.
• Eliminate Serialization issues reported by joblib.Parallel.
• Eliminate EnsembleNER in favor of ScikitLearn’s VotingClassifier.
• Leverage Scikit-Learn’s Model Selection classes
(RandomizedSearchCV and GridSearchCV).
• Add FLAIR and BERT based NER to supported model collection.
• BRAT annotation adapter.
39