SlideShare a Scribd company logo
Dictionary Based Annotation at
scale with Spark, SolrTextTagger
and OpenNLP
Sujit Pal, Elsevier Labs
Introduction
• About Me
– Work at Elsevier Labs.
– Interested in Search, NLP and Distributed Processing.
– URL: labs.elsevier.com
– Email: sujit.pal@elsevier.com
– Twitter: @palsujit
• About Elsevier
– World’s largest publisher of STM Books and Journals.
– Uses Data to inform and enable consumers of STM info.
– And like everybody else, we are hiring!
Agenda
• Overview and Background
• Features and API
• Scaling out
• Q&A
Overview/Background
Problem Definition
• What is the problem?
– Annotate millions of documents from different corpora.
• 14M docs from Science Direct alone.
• More from other corpora, dependency parsing, etc.
– Critical step for Machine Reading and Knowledge Graph applications.
• Why is this such a big deal?
– Takes advantage of existing linked data.
– No model training for multiple complex STM domains.
– However, simple until done at scale.
Annotation Pipeline
Dictionary Based NE Annotator (SoDA)
• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from external Dictionaries.
• Built with Open Source Components
– Apache Solr – Highly reliable, scalable and fault-tolerant search index.
– SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.
– Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.
– Apache Spark – Lightning fast, large scale data processing.
• Uses ideas from other Open Source libraries
– FuzzyWuzzy – Fuzzy String Matching like a boss.
• Contributed back to Open Source
– https://github.com/elsevierlabs-os/soda
SoDA Architecture
How does it work (exact/case matching)?
• Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against
it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”,
“her”, “hers”, “she”}.
• Michael McCandless implemented FSTs in Lucene (blog post).
• David Smiley built SolrTextTagger to use Lucene FSTs.
• SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
How does it work (fuzzy matching)?
• Pre-normalizes each dictionary entry into various forms
– Original – “Astrocytoma, Subependymal Giant Cell”
– Lowercased – “astrocytoma, subependymal giant cell”
– Punctuation – “astrocytoma subependymal giant cell”
– Sorted – “astrocytoma cell giant subependymal”
– Stemmed – “astrocytoma cell giant subependym”
• Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired
normalization level and matches against corresponding field.
• Caller specifies normalization level.
Features and API
Feature Overview
• Provides JSON over HTTP interface
– Compose request as JSON document
– HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests).
– Receive response as JSON document.
• Language-Agnostic and Cross-Platform.
• API can be used from standalone clients, Spark jobs and Databricks notebooks.
• Examples in Scala and Python
Services
• Status
– index.json – returns a JSON (suitable for health check monitoring)
• Single Lexicon Services
– annot.json – annotates a block of text in streaming manner. Supports different levels of
matching (strict to permissive).
– matchphrase.json – annotates short phrases. Supports same matching levels as annot.json.
• Multi-Lexicon Services
– dicts.json – lists all lexicons available.
– coverage.json – returns number of annotations by lexicon found for text across all available
lexicons.
• Indexing Services
– delete.json – deletes entire lexicon from index.
– add.json – adds an entry to the specified lexicon.
Annotation Service I/O
Example annotation request
{
“lexicon” : “countries”,
“text” : “Institute of Clean Coal Technology, East
China University”,
“matching” : “exact”
}
Example annotated response
[
{
“id” : “http://www.geonames.org/CHN”,
“lexicon” : “countries”,
“begin” : 41,
“end” : 46,
“coveredText” : “China”,
“confidence” : 1.0
}
]
Calling Annotation Service
• Client originally written in Python, using built-in json and requests libraries.
• For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala.
• Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown.
• Match against a country lexicon to find country names.
Annotation Service – Python Client
Annotation Service – Scala Client
Annotation Service - Outputs
• Each Annotation result provides:
– Entity ID (not shown)
– Begin position in text
– End position in text
– Matched Text
– Confidence (not shown)
• Zero or more Annotations possible per input text.
Loading Dictionaries
• Dictionary entries represented by:
– Lexicon Name
– Entry ID (unique across lexicons)
– List of possible synonym terms
• JSON Request to add an entry for MeSH dictionary.
{ “id”: http://id.nlm.nih.gov/mesh/2015/M0021699,
“lexicon”: “mesh”,
“names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...],
“commit”: false }
• Preferable to commit periodically and after batch.
Loading Dictionaries – Scala Client
Scaling Out
SoDA Performance – Expected
• Test: annotate 14M docs in “reasonable time”.
– Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU).
– Total estimated time: 16.2 months!
• Questions
– Can we make the process faster?
– Can we scale out the process?
Where is the time being spent?
• Majority of time spent in Solr.
• Some time spent in SoDA (decreases slower than Solr as transactions get shorter).
• Almost no additional time spent in Spark.
Optimization #1: Combine Paragraphs
• Performance measured using 10K random
articles.
• Time to annotate 1 article: Mean 2.9s, Median
2.1s.
• Annotation done per paragraph, 40
paragraphs/article on average.
• Reduce HTTP network + parsing overhead by
sending full document.
• Time to annotate 1 article: Mean 1.4s, Median
0.3s.
• 2x - 7x improvement.
Optimization #2: Tune Solr GC
• OOB Solr would GC very frequently, slowing
down Spark and causing timeouts.
• Current Index Size: 2.1 GB
• Need to size box so approximately 75% RAM
given to OS and remaining 25% allocated to
Solr (Uwe Schindler's Blog Post).
• Heap size should be 3-4x index size (Internal
Guideline).
• Current Solr Heap Size = 8 GB
• RAM is 30.5 GB
• CMS (Concurrent Mark-Sweep) Garbage
Collection.
Optimization #3: Larger Spark Cluster
• Running on cluster Master + 4 Workers.
• Each worker has 1 Executor with 4 Cores.
• Number of simultaneous Solr clients = 16 (4
workers * 1 executor * 4 cores) – measured
with lsof –p in a loop on Solr server.
• Throughput increases with number of
partitions till about 2x the number of worker
cores.
• Best throughput 5 docs/sec with #-
partitions=30 for 16 cores.
Optimization #4: Solr Scaleout
• Upgrade to r3.xlarge (30.5GB RAM, 80GB
SSD, 4vCPU)
– Throughput 7.9 docs/s
• Upgrade to 2x r3.2xlarge (61GB RAM, 160GB
SSD, 8vCPU) with c3.large LB (3.75GB RAM,
32GB Disk, 2vCPU) running HAProxy.
#-workers #-
requests/serv
er
Throughput
(docs/sec)
4 8 8.62
8 16 17.334
12 24 20.64
16 32 26.845
Performance – Did we meet expectations?
• At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate
against our largest dictionary (8M entries).
• Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have
enough capacity to serve requests.
• Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests.
• Number of simultaneous requests approximately equal to number of worker cores.
• Example: annotate 14M documents in 3 days.
– Throughput required: 14M / (3 * 86400) = 54 docs/s
– Number of workers: 54 / 1.5 = 36 workers
– Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144
– Number of Solr servers: 144 / 32 = 4.5 = 5 servers
Future Work
• More Lexicons
• Investigate Lexicon-Centric scale out.
– Allows more lexicons.
– Not limited to single index.
• Move to Lucene, eliminate network
overhead.
– Asynchronous model
– Use Kafka topic with multiple partitions
– Lucene based tagging consumers
– Write output to S3.
Q&A
Thank you for listening!
• Questions?
• SoDA available on GitHub
– https://github.com/elsevierlabs-os/soda
• Contact me
– sujit.pal@elsevier.com

More Related Content

What's hot

Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
nathanmarz
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Lucidworks
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Lucidworks
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
Amazon Web Services
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsSynchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Lucidworks
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Spark Summit
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
John Ramey
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 

What's hot (20)

Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming ExpressionsSynchronizing Clusters in Fusion: CDCR and Streaming Expressions
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Parsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 

Similar to Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP

The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
Alex Payne
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersSematext Group, Inc.
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Rafał Kuć
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
Sujit Kumar
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
LINE Corporation
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge ShareingPhilip Zhong
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
Minsoo Jun
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
eby
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 

Similar to Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP (20)

The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch Clusters
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
SFDC Introduction to Apex
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
Sujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Recently uploaded

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 

Recently uploaded (20)

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 

Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP

  • 1. Dictionary Based Annotation at scale with Spark, SolrTextTagger and OpenNLP Sujit Pal, Elsevier Labs
  • 2. Introduction • About Me – Work at Elsevier Labs. – Interested in Search, NLP and Distributed Processing. – URL: labs.elsevier.com – Email: sujit.pal@elsevier.com – Twitter: @palsujit • About Elsevier – World’s largest publisher of STM Books and Journals. – Uses Data to inform and enable consumers of STM info. – And like everybody else, we are hiring!
  • 3. Agenda • Overview and Background • Features and API • Scaling out • Q&A
  • 5. Problem Definition • What is the problem? – Annotate millions of documents from different corpora. • 14M docs from Science Direct alone. • More from other corpora, dependency parsing, etc. – Critical step for Machine Reading and Knowledge Graph applications. • Why is this such a big deal? – Takes advantage of existing linked data. – No model training for multiple complex STM domains. – However, simple until done at scale.
  • 7. Dictionary Based NE Annotator (SoDA) • Part of Document Annotation Pipeline. • Annotates text with Named Entities from external Dictionaries. • Built with Open Source Components – Apache Solr – Highly reliable, scalable and fault-tolerant search index. – SolrTextTagger – Solr component for text tagging, uses Lucene FST technology. – Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text. – Apache Spark – Lightning fast, large scale data processing. • Uses ideas from other Open Source libraries – FuzzyWuzzy – Fuzzy String Matching like a boss. • Contributed back to Open Source – https://github.com/elsevierlabs-os/soda
  • 9. How does it work (exact/case matching)? • Uses Aho-Corasick algorithm – treats the dictionary as a FST and streams text against it. Matches all patterns simultaneously. Diagram shows FST for vocabulary {“his”, “he”, “her”, “hers”, “she”}. • Michael McCandless implemented FSTs in Lucene (blog post). • David Smiley built SolrTextTagger to use Lucene FSTs. • SoDA uses SolrTextTagger for streaming exact and case-insensitive matching.
  • 10. How does it work (fuzzy matching)? • Pre-normalizes each dictionary entry into various forms – Original – “Astrocytoma, Subependymal Giant Cell” – Lowercased – “astrocytoma, subependymal giant cell” – Punctuation – “astrocytoma subependymal giant cell” – Sorted – “astrocytoma cell giant subependymal” – Stemmed – “astrocytoma cell giant subependym” • Uses OpenNLP to parse input text into phrases, normalizes each phrase into the desired normalization level and matches against corresponding field. • Caller specifies normalization level.
  • 12. Feature Overview • Provides JSON over HTTP interface – Compose request as JSON document – HTTP POST document to JSON endpoint URL (HTTP GET for URL only requests). – Receive response as JSON document. • Language-Agnostic and Cross-Platform. • API can be used from standalone clients, Spark jobs and Databricks notebooks. • Examples in Scala and Python
  • 13. Services • Status – index.json – returns a JSON (suitable for health check monitoring) • Single Lexicon Services – annot.json – annotates a block of text in streaming manner. Supports different levels of matching (strict to permissive). – matchphrase.json – annotates short phrases. Supports same matching levels as annot.json. • Multi-Lexicon Services – dicts.json – lists all lexicons available. – coverage.json – returns number of annotations by lexicon found for text across all available lexicons. • Indexing Services – delete.json – deletes entire lexicon from index. – add.json – adds an entry to the specified lexicon.
  • 14. Annotation Service I/O Example annotation request { “lexicon” : “countries”, “text” : “Institute of Clean Coal Technology, East China University”, “matching” : “exact” } Example annotated response [ { “id” : “http://www.geonames.org/CHN”, “lexicon” : “countries”, “begin” : 41, “end” : 46, “coveredText” : “China”, “confidence” : 1.0 } ]
  • 15. Calling Annotation Service • Client originally written in Python, using built-in json and requests libraries. • For Scala client, SoDA JAR provides classes to mimic json and requests functionality in Scala. • Input to both our (somewhat contrived) examples are: (pii: String, affStr: String) tuples as shown. • Match against a country lexicon to find country names.
  • 16. Annotation Service – Python Client
  • 17. Annotation Service – Scala Client
  • 18. Annotation Service - Outputs • Each Annotation result provides: – Entity ID (not shown) – Begin position in text – End position in text – Matched Text – Confidence (not shown) • Zero or more Annotations possible per input text.
  • 19. Loading Dictionaries • Dictionary entries represented by: – Lexicon Name – Entry ID (unique across lexicons) – List of possible synonym terms • JSON Request to add an entry for MeSH dictionary. { “id”: http://id.nlm.nih.gov/mesh/2015/M0021699, “lexicon”: “mesh”, “names”: [“Baby Tooth”, “Dentitions, Primary”, “Milk Tooth”, ...], “commit”: false } • Preferable to commit periodically and after batch.
  • 20. Loading Dictionaries – Scala Client
  • 22. SoDA Performance – Expected • Test: annotate 14M docs in “reasonable time”. – Approx. 3s/doc with SoDA+Solr on ec2 r3.large box (15.5GB RAM, 32GB SSD, 2vCPU). – Total estimated time: 16.2 months! • Questions – Can we make the process faster? – Can we scale out the process?
  • 23. Where is the time being spent? • Majority of time spent in Solr. • Some time spent in SoDA (decreases slower than Solr as transactions get shorter). • Almost no additional time spent in Spark.
  • 24. Optimization #1: Combine Paragraphs • Performance measured using 10K random articles. • Time to annotate 1 article: Mean 2.9s, Median 2.1s. • Annotation done per paragraph, 40 paragraphs/article on average. • Reduce HTTP network + parsing overhead by sending full document. • Time to annotate 1 article: Mean 1.4s, Median 0.3s. • 2x - 7x improvement.
  • 25. Optimization #2: Tune Solr GC • OOB Solr would GC very frequently, slowing down Spark and causing timeouts. • Current Index Size: 2.1 GB • Need to size box so approximately 75% RAM given to OS and remaining 25% allocated to Solr (Uwe Schindler's Blog Post). • Heap size should be 3-4x index size (Internal Guideline). • Current Solr Heap Size = 8 GB • RAM is 30.5 GB • CMS (Concurrent Mark-Sweep) Garbage Collection.
  • 26. Optimization #3: Larger Spark Cluster • Running on cluster Master + 4 Workers. • Each worker has 1 Executor with 4 Cores. • Number of simultaneous Solr clients = 16 (4 workers * 1 executor * 4 cores) – measured with lsof –p in a loop on Solr server. • Throughput increases with number of partitions till about 2x the number of worker cores. • Best throughput 5 docs/sec with #- partitions=30 for 16 cores.
  • 27. Optimization #4: Solr Scaleout • Upgrade to r3.xlarge (30.5GB RAM, 80GB SSD, 4vCPU) – Throughput 7.9 docs/s • Upgrade to 2x r3.2xlarge (61GB RAM, 160GB SSD, 8vCPU) with c3.large LB (3.75GB RAM, 32GB Disk, 2vCPU) running HAProxy. #-workers #- requests/serv er Throughput (docs/sec) 4 8 8.62 8 16 17.334 12 24 20.64 16 32 26.845
  • 28. Performance – Did we meet expectations? • At 26 docs/sec and 14M documents, it will take our current cluster little over 6 days to annotate against our largest dictionary (8M entries). • Throughput scales linearly @ 1.5 docs/sec per additional worker, as long as Solr servers have enough capacity to serve requests. • Each Solr box (as configured) can serve sustained loads of up to 30-35 simultaneous requests. • Number of simultaneous requests approximately equal to number of worker cores. • Example: annotate 14M documents in 3 days. – Throughput required: 14M / (3 * 86400) = 54 docs/s – Number of workers: 54 / 1.5 = 36 workers – Number of simultaneous requests (4 cores/worker) = 36 * 4 = 144 – Number of Solr servers: 144 / 32 = 4.5 = 5 servers
  • 29. Future Work • More Lexicons • Investigate Lexicon-Centric scale out. – Allows more lexicons. – Not limited to single index. • Move to Lucene, eliminate network overhead. – Asynchronous model – Use Kafka topic with multiple partitions – Lucene based tagging consumers – Write output to S3.
  • 30. Q&A
  • 31. Thank you for listening! • Questions? • SoDA available on GitHub – https://github.com/elsevierlabs-os/soda • Contact me – sujit.pal@elsevier.com