Improving Organizational Knowledge
with Natural Language Processing
Enriched Data Pipelines
Dataworks Summit, Washington D.C.
May 23, 2019
Introductions
Eric Wolok
● Over $500 million in commercial
real estate investment sales
● Specializing in Multi-Family
within the Chicago market
● ericw@partnersco.com
Jeff Zemerick
● Cloud/big-data consultant
● Apache OpenNLP PMC member
● ASF Member
● Morgantown, WV
● jeff.zemerick@mtnfog.com
2
Information is Everywhere
The answer to a question is often spread across
multiple locations.
In the real estate domain:
3
Others?
Answers are Distributed
Bringing these sources together brings the full answer together.
4
Real Estate
Includes structured and unstructured data:
● Property ownership history
● Individual buyer’s purchase history
● News articles mentioning properties
● Property listings
● And so on…
This data can be widely distributed across many
sources.
Let’s bring it together.
5
Key Technologies Used
● Apache Kafka
○ Message broker for data ingest.
● Apache NiFi
○ Orchestrate the flow of data in a pipeline.
● Apache MiNiFi
○ Facilitate ingest from edge locations.
● Apache OpenNLP
○ Process unstructured text.
● Apache Superset
○ Create dashboards from the extracted
data.
● Apache ZooKeeper
○ Cluster coordination for Kafka and Nifi.
● Amazon Web Services
○ Managed by CloudFormation for
infrastructure as code.
● Docker
○ Containerization of NLP microservices.
● Relational database
○ Storage of data extracted by the pipeline.
6
Apache Kafka
● Pub/sub message broker for
streaming data.
● Can handle massive amounts of
messages.
● Allows replay of data.
7
By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
Apache NiFi and MiNiFi
Apache NiFi
● Provides “directed graphs for data routing,
transformation, and system mediation
logic.”
● Has a drag and drop interface where
“processors” are configured to form the
data flow.
Apache MiNiFi
● A sub-project of NiFi that extends NiFi’s
capabilities to IoT and edge devices.
● Facilitates data collection and ingestion
into a NiFi cluster or other system.
8
Apache OpenNLP
9
● A machine learning-based toolkit for processing natural language text.
● Supports:
○ Tokenization
○ Sentence segmentation
○ Part-of-speech tagging
○ Named entity extraction
○ Document classification
○ Chunking
○ Parsing
○ Language detection
The Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
10
A happy user!
Sentiment
Analysis
The Tools of the Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
11
Sentiment
Analysis
Kafka OpenNLP
DockerMiNiFi
OpenNLP Database Superset
NiFi
ZooKeeper
Amazon Web Services
The Architecture
Data Ingestion
● Plain text is published to Kafka topics.
● Each published message is a single
document.
● Originates from:
○ Bash scripts publishing to the Kafka topic.
○ MiNiFi publishing to the Kafka topic from
database tables (QueryDatabaseTable).
13
NLP Microservices
● Microservices wrap OpenNLP functionality:
○ Language detection microservice
○ Sentence extraction microservice
○ Sentence tokenization microservice
○ Named-entity extraction microservice
● Spring Boot applications as runnable jars.
○ No external dependencies other than JRE.
● Provides simple REST-based interfaces to NLP operations.
○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract
● Available on GitHub under Apache license.
14
NLP Microservices Pipelines
Language
Detection
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Supports independent pipelines supporting multiple source languages.
eng
deu
15
others
NLP Microservices Deployment
● Deployed as containers
○ https://github.com/mtnfog/nlp-building-blocks
● Stateless
○ Each can scale independently.
○ Deployed behind a load balancer.
Load
Balancer
NLP
NLP
NLP
16
NLP Operations - An Example
George Washington was president. He was the first president.
1. Language = eng
2. Sentences = [“George Washington was president.”, “He was
the first president.”]
3. Tokens = {[“George”, “Washington”, “was”, “president”],
[“He”, “was”, “the”, “first”, “president”]}
4. Entities = [“George Washington”]
17
Managing the NLP Models
NLP models are stored externally and pulled by the containers from S3/HTTP
when first needed.
NLP Microservice
Containers
S3 Bucket
containing models
Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename]
18
Extracted Information
Entity types extracted from natural language text:
● Person’s names - John Smith (model-based)
● Street addresses - 123 Main St, Anytown, CA (model-based / regex)
● Currency - $250,000, $300000 (regex)
● Phone numbers - (123) 456-7890, 123-456-7890 (regex)
19
Entity Confidence Thresholds
● Each entity extracted by a model has a confidence value that indicates
model’s confidence it actually is an entity.
● Varies depending on the text and the model.
○ How well does the actual text represent the training text?
● Monitored thresholds to determine best minimum cutoff values.
○ Entities with confidence less than a threshold are filtered.
20
0.00 to 1.00
Storage - Persisting the Extracted Data
● JSON responses from the microservices are converted to SQL.
○ Via NiFi’s ConvertJSONToSQL processor.
● Extracted entities are persisted to a relational database.
○ Via NiFi’s PutSQL processor.
● Extracted entities are persisted to an Elasticsearch index.
○ Via NiFi’s PutElasticsearch processor.
21
Reporting
● Dashboards in Apache Superset show views on the data.
22
Entity Counts by Context
● Shows a high-level view of entity counts per context.
23
A context name is
arbitrary, e.g. a
book, a chapter, a
document, etc.
Entities by Type
● Shows entity counts by type of entity.
24
Stopped ingestion
at 1000 person
entities.
Entity Confidence Values
● Shows the distribution of entity confidence values (0.0 to 1.0).
25
Entities with
confidence = 1 are
phone numbers.
Putting It All Together - The NiFi Pipeline
26
The NiFi Flow
27
George Washington was president. He was the first president.
George Washington was president. He was the first president.
[“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”]
[“George Washington”]
{entity: text: “George Washington”, confidence: 0.9}
INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George
Washington”, 0.9)
One entity sent to the database.
Attribute - eng
[]
{}
Nothing sent to database.
Design Challenges
● Repeatable
○ How to deploy it for testing, for different environments, for other clients, …?
● Scalable
○ How can we handle massive amounts of text efficiently?
● Extensible / Updatable
○ How can we ensure that the process can easily be customized or changed?
28
Repeatable
● 100% infrastructure as code.
○ AWS CloudFormation + Bash scripts = Source control in git
● Fully automated.
○ No part of the deployment/configuration is done manually.
○ Entire architecture is created by kicking off a command.
● Apache NiFi flow is version controlled in the NiFi Registry.
○ Registry is automatically connected to NiFi when NiFi is configured.
○ NiFi flow is also under source control.
29
Scalable
● Leveraged autoscaling from the cloud platform where appropriate.
○ NLP microservice containers and underlying hardware have scaling policies set.
● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed.
30
Extensible
● Using Apache NiFi insulates us from a hard-coded solution.
○ No pipeline code to write and manage!
● We can modify the pipeline simply by adding new processors.
● Architecture is layered logically.
○ Layers can be swapped in and out if technology requirements change.
○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc.
● Data can be replayed through Kafka.
○ Allows for updating the data captured.
31
Summary
We presented a method for extracting information from distributed data.
● This pipeline…
○ Ingests data
○ Processes the data
○ Extracts the entities
○ Stores the entities
○ Visualizes the results
● … in a scalable, loosely-connected, and repeatable fashion.
● We can build systems, such as recommendation engines, around this data.
32
Questions?
Credits and thanks to:
● The open source projects and contributors who make this work possible.
33

Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines

  • 1.
    Improving Organizational Knowledge withNatural Language Processing Enriched Data Pipelines Dataworks Summit, Washington D.C. May 23, 2019
  • 2.
    Introductions Eric Wolok ● Over$500 million in commercial real estate investment sales ● Specializing in Multi-Family within the Chicago market ● ericw@partnersco.com Jeff Zemerick ● Cloud/big-data consultant ● Apache OpenNLP PMC member ● ASF Member ● Morgantown, WV ● jeff.zemerick@mtnfog.com 2
  • 3.
    Information is Everywhere Theanswer to a question is often spread across multiple locations. In the real estate domain: 3 Others?
  • 4.
    Answers are Distributed Bringingthese sources together brings the full answer together. 4
  • 5.
    Real Estate Includes structuredand unstructured data: ● Property ownership history ● Individual buyer’s purchase history ● News articles mentioning properties ● Property listings ● And so on… This data can be widely distributed across many sources. Let’s bring it together. 5
  • 6.
    Key Technologies Used ●Apache Kafka ○ Message broker for data ingest. ● Apache NiFi ○ Orchestrate the flow of data in a pipeline. ● Apache MiNiFi ○ Facilitate ingest from edge locations. ● Apache OpenNLP ○ Process unstructured text. ● Apache Superset ○ Create dashboards from the extracted data. ● Apache ZooKeeper ○ Cluster coordination for Kafka and Nifi. ● Amazon Web Services ○ Managed by CloudFormation for infrastructure as code. ● Docker ○ Containerization of NLP microservices. ● Relational database ○ Storage of data extracted by the pipeline. 6
  • 7.
    Apache Kafka ● Pub/submessage broker for streaming data. ● Can handle massive amounts of messages. ● Allows replay of data. 7 By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
  • 8.
    Apache NiFi andMiNiFi Apache NiFi ● Provides “directed graphs for data routing, transformation, and system mediation logic.” ● Has a drag and drop interface where “processors” are configured to form the data flow. Apache MiNiFi ● A sub-project of NiFi that extends NiFi’s capabilities to IoT and edge devices. ● Facilitates data collection and ingestion into a NiFi cluster or other system. 8
  • 9.
    Apache OpenNLP 9 ● Amachine learning-based toolkit for processing natural language text. ● Supports: ○ Tokenization ○ Sentence segmentation ○ Part-of-speech tagging ○ Named entity extraction ○ Document classification ○ Chunking ○ Parsing ○ Language detection
  • 10.
    The Unstructured DataPipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 10 A happy user! Sentiment Analysis
  • 11.
    The Tools ofthe Unstructured Data Pipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 11 Sentiment Analysis Kafka OpenNLP DockerMiNiFi OpenNLP Database Superset NiFi ZooKeeper Amazon Web Services
  • 12.
  • 13.
    Data Ingestion ● Plaintext is published to Kafka topics. ● Each published message is a single document. ● Originates from: ○ Bash scripts publishing to the Kafka topic. ○ MiNiFi publishing to the Kafka topic from database tables (QueryDatabaseTable). 13
  • 14.
    NLP Microservices ● Microserviceswrap OpenNLP functionality: ○ Language detection microservice ○ Sentence extraction microservice ○ Sentence tokenization microservice ○ Named-entity extraction microservice ● Spring Boot applications as runnable jars. ○ No external dependencies other than JRE. ● Provides simple REST-based interfaces to NLP operations. ○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract ● Available on GitHub under Apache license. 14
  • 15.
  • 16.
    NLP Microservices Deployment ●Deployed as containers ○ https://github.com/mtnfog/nlp-building-blocks ● Stateless ○ Each can scale independently. ○ Deployed behind a load balancer. Load Balancer NLP NLP NLP 16
  • 17.
    NLP Operations -An Example George Washington was president. He was the first president. 1. Language = eng 2. Sentences = [“George Washington was president.”, “He was the first president.”] 3. Tokens = {[“George”, “Washington”, “was”, “president”], [“He”, “was”, “the”, “first”, “president”]} 4. Entities = [“George Washington”] 17
  • 18.
    Managing the NLPModels NLP models are stored externally and pulled by the containers from S3/HTTP when first needed. NLP Microservice Containers S3 Bucket containing models Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename] 18
  • 19.
    Extracted Information Entity typesextracted from natural language text: ● Person’s names - John Smith (model-based) ● Street addresses - 123 Main St, Anytown, CA (model-based / regex) ● Currency - $250,000, $300000 (regex) ● Phone numbers - (123) 456-7890, 123-456-7890 (regex) 19
  • 20.
    Entity Confidence Thresholds ●Each entity extracted by a model has a confidence value that indicates model’s confidence it actually is an entity. ● Varies depending on the text and the model. ○ How well does the actual text represent the training text? ● Monitored thresholds to determine best minimum cutoff values. ○ Entities with confidence less than a threshold are filtered. 20 0.00 to 1.00
  • 21.
    Storage - Persistingthe Extracted Data ● JSON responses from the microservices are converted to SQL. ○ Via NiFi’s ConvertJSONToSQL processor. ● Extracted entities are persisted to a relational database. ○ Via NiFi’s PutSQL processor. ● Extracted entities are persisted to an Elasticsearch index. ○ Via NiFi’s PutElasticsearch processor. 21
  • 22.
    Reporting ● Dashboards inApache Superset show views on the data. 22
  • 23.
    Entity Counts byContext ● Shows a high-level view of entity counts per context. 23 A context name is arbitrary, e.g. a book, a chapter, a document, etc.
  • 24.
    Entities by Type ●Shows entity counts by type of entity. 24 Stopped ingestion at 1000 person entities.
  • 25.
    Entity Confidence Values ●Shows the distribution of entity confidence values (0.0 to 1.0). 25 Entities with confidence = 1 are phone numbers.
  • 26.
    Putting It AllTogether - The NiFi Pipeline 26
  • 27.
    The NiFi Flow 27 GeorgeWashington was president. He was the first president. George Washington was president. He was the first president. [“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”] [“George Washington”] {entity: text: “George Washington”, confidence: 0.9} INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George Washington”, 0.9) One entity sent to the database. Attribute - eng [] {} Nothing sent to database.
  • 28.
    Design Challenges ● Repeatable ○How to deploy it for testing, for different environments, for other clients, …? ● Scalable ○ How can we handle massive amounts of text efficiently? ● Extensible / Updatable ○ How can we ensure that the process can easily be customized or changed? 28
  • 29.
    Repeatable ● 100% infrastructureas code. ○ AWS CloudFormation + Bash scripts = Source control in git ● Fully automated. ○ No part of the deployment/configuration is done manually. ○ Entire architecture is created by kicking off a command. ● Apache NiFi flow is version controlled in the NiFi Registry. ○ Registry is automatically connected to NiFi when NiFi is configured. ○ NiFi flow is also under source control. 29
  • 30.
    Scalable ● Leveraged autoscalingfrom the cloud platform where appropriate. ○ NLP microservice containers and underlying hardware have scaling policies set. ● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed. 30
  • 31.
    Extensible ● Using ApacheNiFi insulates us from a hard-coded solution. ○ No pipeline code to write and manage! ● We can modify the pipeline simply by adding new processors. ● Architecture is layered logically. ○ Layers can be swapped in and out if technology requirements change. ○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc. ● Data can be replayed through Kafka. ○ Allows for updating the data captured. 31
  • 32.
    Summary We presented amethod for extracting information from distributed data. ● This pipeline… ○ Ingests data ○ Processes the data ○ Extracts the entities ○ Stores the entities ○ Visualizes the results ● … in a scalable, loosely-connected, and repeatable fashion. ● We can build systems, such as recommendation engines, around this data. 32
  • 33.
    Questions? Credits and thanksto: ● The open source projects and contributors who make this work possible. 33

Editor's Notes

  • #5 This data may be geographically distributed between different organizations or it may exist in a single organization.
  • #6 Can be distributed inside and outside an organization.
  • #19 This allows updating the models without having to redeploy code.