Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Mining Unstructured Data:
Practical Applications

Alyona Medelyan @zelandiya
Anna Divoli @annadivoli

Problem 1

New York London

How do lawyers scan, file, store & share
client’s case documents efficiently?
Images: Ambro / FreeDigitalPhotos.net

slambo_42@flickr
Anoto AB@flickr

EHR

EMR

PHR

How do doctors, patients &
researchers distribute & share
medical records efficiently?

The FATCA Legislation Problem 3
Takes effect 1 January 2013

annual
report

30%
witholding
tax

Foreign
Financial

waiver

Ins.tu.on

with
IRS
agreement

U.S.
account
holders

U.S.
ownership
en..es

with
without
Custodian
bank

waiver
waiver
without
IRS
agreement

30%
witholding
tax

How can a financial institution find U.S. citizens
in masses of paperwork efficiently?

How much time do we actually spend on …
Searching,
gathering
info
17

Wri.ng
emails
14

Crea.ng
docs
13

Analyzing
info
10

Reviewing
docs
9

Organizing
docs
7

Crea.ng
presenta.ons
7

Edi.ng
images
6

Entering
data
6
Translates
to
annual
costs:

Search:
17h
/
week
=
$37,000
/
year

Approving
docs
4

Publishing
docs
4

IDC: Hidden cost of information
Transla.ng
docs
1 average hours / week

introduction

conclusions unstructured data
real life problems

compliance unstructured data
in finance & text analytics

healthcare metadata
records issues in legal domain

Social
News

Emails
Media

Audio
Images

Databases

Videos

Literature

Blogs

unstructured data

Linguistics Search
Statistics Data Extraction
Text Processing Document Organization
Machine Learning Business Intelligence
Natural Language Processing Opinion Mining
Text Mining

What can one mine
from unstructured data?

keywords text text text
text text text
tags text text text
text text text sentiment
text text text
text text text

genre
categories
taxonomy terms
entities

names biochemical
patterns … entities text text text
text text text

text text text

text text text

text text text

text text text

text text text
text text text
text text text
text text text
text text text
text text text

People U.S. politicians News about
U.S. politicians
News

Structured & unstructured data interplay
Unique
iden.ﬁers

Structured

biological

Literature
references

data

Experts’

annota.on

(free
text)

introduction

real life problems

compliance
unstructured data
in finance
& text analytics

healthcare metadata

Legal document processing pipeline

scan

save

ocr

New York metadata
London

dms

Images: Ambro / FreeDigitalPhotos.net

jacockshaw@flickr

Assigning metadata
(approximation)

15 docs per day
3 min per doc
0.75 h per day
240 working days per year
$200 hourly charge

$36,000 per year per lawyer

Keyword extraction
0.0027 min per doc
10 min for yearly worth of docs

Integra.ng

metadata

extrac.on

with

scanning

h[p://www.youtube.com/watch?v=kluVp25upag

Efficient (legal) document processing pipeline

keywords
tags

metadata

dms

introduction

real life problems

compliance
in finance unstructured data
& text analytics

healthcare metadata

EMR

PHR

EHR

slambo_42@flickr Anoto AB@flickr

Na.onal
Alliance
for
Health
Informa.on
Technology

EMR
(NAHIT)

deﬁni.ons

EHR

PHR
?

Discon.nued!

1.  Name,
birth
date,
blood
type

2.  Emergency
contact(s)

3.  Primary
caregiver/phone
number

4.  Medicines,
dosages,
and
how
long

taken

5.  Allergies/allergic
reac.ons

6.  Date
of
last
physical

7.  Dates/results
of
tests
and

screenings

8.  Major
illnesses/surgeries
and
their

dates

9.  Chronic
diseases
PHI

10.  Family
illness
history

11.  …

h?p://www.nlm.nih.gov/medlineplus/magazine/

de-‐idenHﬁcaHon
process

Medical
researchers
…
records
with
removed
PHI:

use
pa.ent
records
informa.on
from
structured
ﬁelds

for

discoveries…
but
mostly
from
free
text!

AMIA
2012

siliconangle.com/blog/

www.hcpro.com

www.informaHon-‐age.com

“The
Health
Insurance
Portability
and
Accountability
Act
of

1996
(HIPAA)
Privacy
and
Security
Rules”

“The
Pa.ent
Safety
and
Quality
Improvement
Act
of
2005

(PSQIA)
Pa.ent
Safety
Rule”

18 identifiers!
PHI

Names
Vehicle
iden.fiers
&

serial
numbers,
incl.
license

Geographic
subdivisions
plate
numbers

smaller
than
a
State:
street
address,

city,
county,
precinct,
zip
code…

Device
iden.fiers
&

Dates
(except
year):
birth,
serial
numbers

admission,
discharge…

URLs

/

IP
addresses

Phone
/
Fax
numbers

Email
addresses
Biometric
iden.fiers,

including
finger
and
voice
prints

Social
security
#

Face
photo
images

Medical
records

#
&
any
comparable
images

Health
plan
beneficiary#

Any
other
unique
IDs
etc.

Accounts

#

slambo_42@flickr Thanks
for
discussions:

Nigam
Shah,
Stanford

Eneida
Mendonca,
UWinscosin,
Madison

Irena
Spasic,
Cardiﬀ
University

text text text
text text text

text text text

text text text

text text text

text text text

keywords
tags
Anoto AB@flickr

The FATCA Legislation
Takes effect 1 January 2013

annual
report

30%
witholding
tax

waiver

Foreign
Financial

Ins.tu.on

with
IRS
agreement

U.S.
account
holders

U.S.
ownership
en..es

with
without
Custodian
bank

waiver
waiver
30%
witholding
tax
without
IRS
agreement

FATCA COMPLIANCE – STEP 1
Detect U.S. citizenship indicators

Recommended Solution
from FATCA Legislation:

•  “Query an electronic database using
standard queries in programming languages”

•  “Adopt similar approaches as used for the
Anti-money-laundering and Know-your-customer
requirements”

•  “Note that information, data, or files are not
electronically searchable if they are stored as
images”

walmink,
thomwatson@ﬂikr

FATCA COMPLIANCE – STEP 2
Contact client for additional info or a waver

Actual Solution
for the FATCA Legislation:
link
analysis
gather
the
trail
client’s
data

ocr
convert
all
images
to
text

en.ty
extrac.on
detect
loca.ons,
bank
numbers

analysis
auto-‐categorize

check
resolve
inconsistencies

Alyona Medelyan, PhD Anna Divoli, PhD
@zelandiya @annadivoli
Natural Language Processing Biomedical Text Mining
Text Mining Search User Interfaces
Wikipedia Mining Human Factors
Machine Learning Knowledge Discovery

Try out text analytics provided by the Pingar API!

Online demo: apidemo.pingar.com
Free Sandbox account: pingar.com/get-the-api

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Similar to Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012 (20)

More from Peter Wren-Hilton

More from Peter Wren-Hilton (6)

Recently uploaded

Recently uploaded (20)

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Editor's Notes