Getting Started with Unstructured Data

Getting Started with Unstructured
Data
Christine Connors & Kevin Lynch
TriviumRLG LLC

Semantic Tech & Business, Washington D.C.
November 29, 2011

Tuesday, November 29, 2011

Meta

✤ Presenter: Christine Connors

✤ @cjmconnors

✤ Presenter: Kevin Lynch

✤ @kevinjohnlynch

✤ Principals at www.triviumrlg.com


Agenda

✤ What is unstructured data?

✤ Where do we ﬁnd it?

✤ How important is it?

✤ How do we visualize it?

✤ Machine processing for actionable data

✤ Tools


What is unstructured data?

✤ Data which is

✤ Not in a database

✤ Does not adhere to a formal data model

✤ Content


Isn’t that a misnomer?

✤ Problematic term

✤ The presence of object metadata or aesthetic markup does not alone
give ‘structure’ in this sense of the word

✤ Object metadata = machine or applied properties

✤ Aesthetic markup = stylesheets; rendering information

✤ Semi-structured data is typically treated as unstructured for the
purposes of machine processing and analysis


Types of ‘un’structured data

✤ Text-based documents

✤ Word processing, presentations, email, blogs, wikis, tweets, web
pages, web components (read/write web)

✤ Audio/video ﬁles


Where do we find it?

✤ Ofﬁce productivity suites

✤ Content management systems

✤ Digital asset management systems

✤ Web content management systems

✤ Wikis, blogs, comment & discussion threads

✤ Social networking tools

✤ Twitter, Yammer, instant messengers


Is it really that important?
Structured Unstructured

15%

85%


What’s in that 80-85%?

✤ Progress reports -
created in a word processor



✤ Dashboards -
created in presentation software



✤ Progress reports -
color coded text in a
spreadsheet



✤ Brainstorming -
in messaging systems

✤ Decision making - in email



✤ Business intelligence - on the
web and more


How can we make the data more
actionable?

✤ Identify it

✤ Convert to a format you can work with

✤ Add structure, meaning:

✤ information extraction

✤ annotation

✤ content analytics


What about enterprise search?

✤ First line of defense

✤ Points you at the highest relevancy ranked data via pattern matching
and statistical analysis

✤ Does not assist in other visualizations or transformations without
further machine processing


Machine Processing

Unstructured Natural Rules-based
Statistical Semantic
Data Language Classiﬁca-
Analysis Analysis
Processing tion

Machine Processing Platform
Federated
Search A
P Index
I

Visualizations Data Stores

Let’s go a little deeper...


Good News, Bad News

✤ Good: Basic text analysis tools are widely available; cheap or free

✤ Good: The range of information you can now consider has broadened;
the intelligence you can bring to bear on that information has
increased

✤ Bad: Skillsets not widely available (but they are available!)

✤ Good: You can get started right here, understanding, identifying the
sources, and possible approaches


What Data Doesn’t Do

✤ From Coco Krumme in “Beautiful Data”

✤ Data doesn’t drive everything.

✤ Note: “narrative fallacy,” “conﬁrmation bias,” “paradox of choice”

✤ Data doesn’t: scale (cognitively), alone explain, predict

✤ The real world doesn’t create random variables

✤ Data doesn’t stand alone


Integrating Unstructured
Data

Images

From Oracle 11g presentation at www.nmoug.org/papers/11g_High_Level_April08.ppt

The Goal: Usable Knowledge

✤ Information extraction is NOT the goal

✤ Information extraction is a means to an end

✤ Knowledge discovery is the goal

✤ To this end, we will perform lots of processing to move from bits to
usable meaning


So many <near> synonyms

✤ Text analytics

✤ Content analytics

✤ Text mining

✤ Data mining

✤ Information extraction

✤ And then there’s Natural Language Processing


What’s the same?

✤ Moving from bits to meaning requires processing, and a lot of that
processing is the same, no matter what you call it

✤ We will focus primarily on textual information today


Natural Language

✤ From Peter Norvig’s “Natural Language Corpus Data: chapter in
“Beautiful Data”

✤ Google’s 1 trillion-word corpus investigating probabilistic language
models

✤ 13 million types (unique words, punctuation)

✤ 100k types cover 98% of the corpus

✤ For: word segmentation, spelling correction, language identification,
spam detection, author identification

✤ %? = “chooses pain” ; “in sufficient numbers”


Information Extraction

✤ Token identiﬁcation - “tokenization”

✤ Word segmentation

✤ Sentence splitting

✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective,
etc.)

✤ Phrase identiﬁcation - noun phrase

✤ Entity extraction - people, places, events, dates, organizations



✤ Cluster analysis - group related information, where relationship may not
be known

✤ Classification - mapping to specific categories

✤ Dependency identification / Rule generation

✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”

✤ Conference resolution (anaphoric reference resolution)

✤ e.g., “Joe is CEO at IBM. He is an IEEE member.”

✤ Summarization - key concepts or key sentences


IR and IE

✤ IR (Information Retrieval) versus IE (Information Extraction)

✤ IR retrieves documents from collections; IE retrieves facts and structured
information from collections

✤ In IR, the objects of analysis are documents; in IE, the objects of analysis
are facts

✤ IE returns knowledge at a deeper level than traditional IR

✤ Results may be imperfect, and linking them back to documents adds
value

✤ Sound familiar? (semantic web, linked data)


Two primary system types

Knowledge Engineering Learning Systems

Rule based Use statistics or other machine learning

Developed by experienced language engineers Developers do not need language engineering expertise

Make use of human intuition

Require only small amount of training data Require large amounts of annotated training data

Development can be very time consuming

Some changes may require re-annotation of the entire
Some changes may be hard to accommodate
training corpus

From http://gate.ac.uk/sale/talks/gate-course-may11/track-1/module-2-ie/module-2-ie.pdf


Text

Predicate
Subject Object

Two views of the semantic web
Machine learning, natural language processing, artiﬁcial intelligence and linked data

Images from Wikipedia


Named Entities

✤ What is NER?

✤ Named Entity Recognition

✤ identifying proper names in texts, and classiﬁcation into a set of
predeﬁned categories of interest

✤ Named entity recognition is the cornerstone of Information
Extraction, providing a foundation from which to build complex
information extraction systems


Named Entities

✤ Person names

✤ Organizations (companies, government organizations, committees)

✤ Locations (cities, countries, rivers)

✤ Date and time expressions

✤ Measures (percent, money, weight)

✤ Email addresses, web addresses, street addresses

✤ Some domain-speciﬁc entities: names of drugs, medical conditions,
names of ships, bibliographic references, etc.


NOT Named Entities
✤ Artifacts - Wall Street Journal

✤ Common nouns, referring to named entities

✤ e.g. the company, the committee

✤ Name of groups of people and things named after people

✤ e.g. the Tories, the Nobel Prize

✤ Adjectives derived from names

✤ e.g. Bulgarian, Chinese

✤ Numbers which are not times, dates, percentages or money amounts
http://gate.ac.uk/sale/talks/ne-tutorial.ppt


Break Time!


Open Tools
✤ GATE – General Architecture for
Text Engineering, from the
University of Shefﬁeld, with many
users and excellent documentation.

✤ GATE has customizable document
and corpus processing pipelines.
GATE is an architecture, a
framework, and a development
environment, with a clean separation
of algorithms, data, and
visualization.


GATE

✤ “The Volkswagen Beetle of language processing”

✤ “...more than a decade of collecting reusable code and building a
community has lead [to] a mature ecosystem for solving language
processing problems quickly.”

✤ Hamish Cunningham 2010


GATE – Key Features

✤ Component-based development

✤ Automatic performance measurement

✤ Clean separation between data structures and algorithms

✤ Consistent use of standard mechanisms for components to
communicate data

✤ Insulation from data formats

✤ Provision of a baseline set of language components


GATE – More...

✤ Free – open source, LPGL, Java

✤ Mature, at version 6, actively supported, 15 FTEs

✤ Comprehensive, standards-based, popular

✤ Used by thousands of companies, universities, and research
laboratories

✤ Well-known, tested, researched, and very well-documented


GATE Overview

✤ Architectural principles

✤ Non-prescriptive, theory neutral (strength and weakness)

✤ Re-use, interoperation, not reimplementation (diverse support, lots of
plugins)

✤ (Almost) everything is a component, and component sets are user-extendable

✤ Component-based development

✤ CREOLE = modiﬁed Java Beans (Collection of REusable Objects for
Language Engineering)

✤ The minimal component = 10 lines of Java, 10 lines of XML, 1 URL


GATE – Family

✤ GATE Developer – an integrated development environment for
language processing components bundled with the most widely used
Information Extraction system and a comprehensive set of plugins

✤ GATE Embedded – an object library optimized for inclusion in
diverse apps

✤ GATE Teamware – web app, a collaborative annotative environment

✤ GATE Cloud – parallel distributed processing


GATE – Embedded

From http://gate.ac.uk/g8/page/print/2/sale/talks/gate-apis.png

GATE – Teamware

✤ GATE Teamware – web app, a collaborative annotative environment
for high volume factory-style semantic annotation built with workﬂow

✤ Running in 5 minutes with Teamware virtual server from
GATECloud.net (itself open source):

✤ Reusable project templates

✤ Project-speciﬁc roles, users

✤ Applying GATE-based processing routines

✤ Project status, annotator activity, statistics


GATE – First Cousins

✤ Ontotext KIM: UIs demonstrating the multi-paradigm approach to
information management, navigation and search

✤ Ontotext Mimir: a massively scalable multi-paradigm index built on
Ontotext’s semantic repository family, GATE’s annotation structures
database, plus full-text indexing from MG4

✤ Ontotext FactForge: ~4B Linked Data statements, query-able


GATE – Ontotext KIM

✤ Ontotext KIM: UIs, tools, GATE Gazetteers, including a Linked Data
gazetteer (experimental)

✤ Pre-loaded knowledge base for entities

✤ Tools to upload, query, tailor the knowledge base, algorithms, UI

✤ Can crawl web, including Linked Data, creating semantic index: your
servers, theirs, or cloud

✤ Based on GATE and OWLIM



From: http://www.ontotext.com/sites/default/ﬁles/pictures/diagram.png

Structure


Patterns


Ontology


Facets


GATE – Ontotext MIMIR

✤ Ontotext Mimir: large scale indexing infrastructure supporting hybrid
search (text, annotation, meaning); massively scalable multi-paradigm
capability, combines MG4J full-text index and BigOWLIM semantic
repository; query with text, structural info, and SPARQL

✤ Integrated with GATE, customizable, scalable

✤ Open source components

✤ Can federate multiple MIMIRs

✤ Low acquisition, management cost to scale


GATE – Multi-paradigm

✤ Why “multi-paradigm?” Proliferation of retrieval technology options

✤ Full text, boolean, proximity, ranking; behavior mining, tag clouds;
concept indexing: taxonomic, ontological; annotation-based

✤ Choice depends principally on content volume + value:

✤ High volume, low (average) value: web search

✤ Medium volume, higher (personal) value: social networks, photo
sharing, tagging

✤ Low volume, high value: controlled vocabularies, taxonomies,
ontologies


GATE “Resources”

✤ Applications – groups of processes (that run on one or more
documents)

✤ Language Resources – documents or document collections (corpus,
corpora)

✤ Processing Resources – annotation tools that operate on text in
documents

✤ Applications, made up of Processing Resources, operate on Language
Resources


Plugins

✤ Applications – an application consists of any number of Processing
Resources, run sequentially over documents

✤ Plugins – a plugin is a collection of one or more Processing Resources,
bundled together.

✤ Plugins, then, are applications, that need to be loaded in order to
access their Processing Resources.


GATE – Plugins (I)


GATE – Plugins (II)


GATE


GATE Annotations

✤ Annotations are central to understanding GATE
✤ Annotations are associated with each document
✤ Each annotation has:
✤ start and end offsets
✤ an optional set of features
✤ each feature has a name and a value

GATE Annotations


✤ TE: Template Elements
✤ NE: Named Entity recognition and
typing
✤ TR: Template Relations
✤ CO: CO-reference resolution
✤ ST: Scenario Templates
✤ Example:

The shiny red rocket was ﬁred on Tuesday. It is the brainchild of Dr. Big Head.
Dr. Head is a staff scientist at We Build Rockets Inc.

✤ NE: Entities are “rocket,” “Tuesday,” “Dr. Head” and “We Build Rockets”
CO: “it” refers to the rocket; “Dr. Head” and “Dr. Big Head” are the same
TE: the rocket is “shiny red” and Head’s “brainchild”
TR: Dr. Head works for “We Build Rockets Inc.”
ST: a rocket launching event occurred with the various participants
From http://gate.ac.uk/sale/talks/ne-tutorial.ppt

ANNIE

✤ A Nearly-New Information Extraction System, packaged with GATE,
used throughout examples, and a great place to start

✤ A collection of GATE Processing Resources to perform Information
Extraction on unstructured text

✤ “Nearly new” – its name 10 years ago, that stuck

✤ Other information extraction systems include LingPipe and
OpenNLP. GATE includes wrappers for LingPipe and OpenNLP,
independently developed NLP pipelines. All three systems are
provided as pre-built application through the GATE File menu


ANNIE

✤ “Processing Resources” inside ANNIE:

✤ Tokenizer, sentence splitter, part-of-speech tagger, gazetteers, named
entity tagger, and an orthomatcher

✤ Also included are noun phrase and verb phrase chunkers

✤ Each “Processing Resource” inside ANNIE can be used as part of a
pipeline you create to add annotations or modify existing ones

✤ ANNIE is a highly customizable, rule-based system, with very useful
defaults


ANNIE

✤ “Processing Resources” inside ANNIE:

✤ Gazetteer – lookup annotations (lists)

✤ JAPE transducer – date, person, location, organization, money,
percent annotations

✤ Orthomatcher – adds match features to named entity annotations
(coreference matching)

✤ Document Reset – removes annotations


IE Steps in ANNIE

✤ “Tokenizer” performs Token identiﬁcation and word segmentation

✤ “Sentence splitter” identiﬁes sentences

✤ “POS” tagger performs Part-of-speech tagging – (noun, verb, adverb,
adjective)

✤ Must run Tokenizer and Sentence Splitter before POS tagger


IE Steps in ANNIE

✤ “Gazetteers” – lists of names (people, cities, groups); you can modify
or add lists

✤ Each list has features (majorType, minorType, language)

✤ Gazetteers generate “Lookup” annotations with features
corresponding to the matched list. When the text matches a gazetteer
entry, a Lookup annotation is created.

✤ Lookup annotation are used by ANNIE’s Named Entity transducer to
for entity identiﬁcation.


ANNIE in GATE


ANNIE Sequence

Pipeline sequence matters: tokenizer,
sentence splitter, POS tagger, gazetteer

IE Steps in ANNIE

✤ “NE Transducer” – Named Entity Transducer performs named entity
recognition (NER)

✤ Once we have built up the processing resource pipeline with the
previous steps (tokeniser, sentence splitter, POS tagger, gazetteer), we
are ready to add the transducer for named entity recognition

✤ More speciﬁc information can be added to the features now, including
the “kind” of entity, and the rules that were ﬁred


IE Steps in ANNIE

✤ “OrthoMatcher” – orthographic co-reference matches proper names
and their variants.

✤ Will match previously unclassiﬁed names, based on relations with
classiﬁed entities

✤ Matches “Kevin Lynch” with “Dr. Lynch”

✤ Matches acronyms with expansions


IE Steps in ANNIE

✤ Tokenizer, sentence splitter, and OrthoMatcher are language, domain,
and application-independent

✤ Part-of-speech tagger is language dependent and application-
independent

✤ Gazetteer lists are starting points (60K entries)

✤ ANNIE is a way to get started, with a framework for identifying the
kinds of elements that matter to your work, and for quickly testing
your ideas against existing data


Annotations In Context


Rules-based Classification

✤ Once a stand-alone project, now often part of annotation services

✤ Regex, Boolean and naive Bayesian algorithms executed on tokens

✤ And, Or, Not, Near (x), Multi, Stem, Exact, Phrase, et al (vendor or
source dependent)

✤ Assigns documents to a taxonomic category

✤ Allow for greater control over depth and breadth of categories

✤ Human aided, machine processed


Rules-based Classification


Visualization - Prefuse


Visualization - Gephi


Visualization - Cytoscape


Quick!
✤ Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of
parliament, and so on and so forth) -- call this your corpus

✤ Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy,
or something from the Linked Data cloud) -- call this your ontology

✤ Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to
the ontology (2.)

✤ Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and
measure performance against the gold standard

✤ Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems
using GATE Embedded)

✤ Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For
techies: this sits in the backroom as a RESTful web service.)

✤ Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity
graphing, time series graphing, annotation structure search and (last but not least) boolean full text search.
(More techy stuff: mash up these types of search with your existing UIs.)


Data Warehousing /
Business Intelligence

✤ Perspective

✤ Process

✤ Use cases

✤ Implications with unstructured data


DW/BI Perspective

✤ Structured data is an incomplete version of the “truth”

✤ Until information is quantiﬁed, it is not very useful

✤ Discover facts, and give them structure

✤ Complement structured data with unstructured data; try to complete
the picture (of the business, the customer, performance)


DW/BI Process

✤ Extract, then formalize

✤ Give information structure, then associations

✤ Map to existing structures in the data warehouse


DW/BI Use Cases

✤ Report indexing (of metadata, of instances)

✤ Report sections become possible

✤ Self-service for consumers

✤ “BI Search” (of those reports)

✤ Include in portal

✤ As range of reports and users increases, unstructured data approaches
have more value


DW/BI Use Case Ideas

✤ For customers, products, complaints, locations:

✤ Voice recognition indexing

✤ RSS feeds

✤ Wikis, blogs (internal and external)

✤ Instant messages


DW/BI Implications

✤ Have to store these results

✤ Have to model these results

✤ Have to map these results to something meaningful

✤ Have to include the results in a useful way (Where? Use taxonomies?
Which ones?)

✤ Quality, cost, and complexity matter; extracted entities don’t relate
directly to performance

✤ Not a replacement, an addition to the technology


Some Technical Issues

✤ Quality

✤ Integration

✤ Concurrency

✤ Security

✤ Skills


Additional Open Tools

✤ UIMA – Unstructured Information
Management Architecture (IBM’s
Watson uses this), originated at
IBM, now an Apache project.

✤ Component software architecture
with a document processing
pipeline similar to GATE. Focus on
performance and scalability, with
distributed processing (web
services).


UIMA
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
types based on existing ones and update the Common Analysis Structure (CAS) for
upstream processing.
UIMA CAS
Representation now
Common Analysis Structure (CAS) Aligned
with XMI standard
Relationship CeoOf

Arg1:Person Arg2:Org
Analysis Results
(i.e., Artifact Metadata)
Named Entity Person Organization

Parser NP VP PP

Fred Center is the CEO of Center Micros

Artifact (e.g., Document)
Chart by
IBM

UIMA

Image by
IBM

Commercial Tools

✤ Oracle Data Mining (Text Mining)

✤ IBM SPSS

✤ SAS Text Miner

✤ Smartlogic

✤ Lots of acquisitions going on in the “big data” space

✤ HP acquired Autonomy

✤ Oracle acquired Endeca


A Note on Tools

✤ UIMA and GATE – comprehensive suite of capabilities, with learning
curves.

✤ Commercial tools range from unstructured capabilities inside DBMSs
like Oracle, to Business Objects business intelligence tools (who
acquired Inxight from Xeroc Parc).

✤ Your mileage will vary. The biggest differentiator is your knowledge
of your data.


Questions?


Thank you
Christine Connors
Kevin Lynch
www.triviumrlg.com


What can unstructured data look
like post-processing?


Getting Started with Unstructured Data

More Related Content

What's hot

Viewers also liked

Similar to Getting Started with Unstructured Data

More from Christine Connors

Recently uploaded

Getting Started with Unstructured Data