SlideShare a Scribd company logo
Mining Unstructured Data:
Practical Applications




Alyona Medelyan @zelandiya
Anna Divoli @annadivoli
Problem 1




          New York                               London

How do lawyers scan, file, store & share
client’s case documents efficiently?
                                           Images: Ambro / FreeDigitalPhotos.net
slambo_42@flickr
Anoto AB@flickr
                     	
  
                      EHR	
  
                      EMR	
  
                      PHR	
  




                                How do doctors, patients &
                                researchers distribute & share
                                medical records efficiently?
The FATCA Legislation                                                                       Problem 3
                 Takes effect 1 January 2013



                                      annual	
  report            	
  	
  	
  30%	
  witholding	
  tax	
  

                                              Foreign	
  Financial	
  
    waiver	
  
                                                 Ins.tu.on	
  
                                               with	
  IRS	
  agreement	
  


         U.S.	
  account	
  holders	
  
        U.S.	
  ownership	
  en..es	
  

        with	
              without	
                                                                  Custodian	
  bank	
  
       waiver	
             waiver	
                                                               without	
  IRS	
  agreement	
  
                                               30%	
  witholding	
  tax	
  



How can a financial institution find U.S. citizens
in masses of paperwork efficiently?
How much time do we actually spend on …
Searching,	
  gathering	
  info	
                                                                17	
  
              Wri.ng	
  emails	
                                                      14	
  
               Crea.ng	
  docs	
                                                   13	
  
              Analyzing	
  info	
                                         10	
  
            Reviewing	
  docs	
                                   9	
  
            Organizing	
  docs	
                          7	
  
  Crea.ng	
  presenta.ons	
                               7	
  
              Edi.ng	
  images	
                  6	
  
               Entering	
  data	
                 6	
                        Translates	
  to	
  annual	
  costs:	
  
                                                                             Search:	
  17h	
  /	
  week	
  =	
  $37,000	
  /	
  year	
  
            Approving	
  docs	
           4	
  
            Publishing	
  docs	
          4	
  
                                                                                    IDC: Hidden cost of information
            Transla.ng	
  docs	
      1                                                     average hours / week
introduction


   conclusions                              unstructured data
                                            real life problems



compliance                                     unstructured data
  in finance                                   & text analytics



          healthcare                  metadata
      records issues                  in legal domain
Social	
           News	
  
                            Emails	
                  Media	
  




                                                             Audio	
                      Images	
  
Databases	
  
                                         Videos	
  




                                                                         Literature	
  
                Blogs	
  
unstructured data



Linguistics                               Search
   Statistics                        Data Extraction
 Text Processing                   Document Organization
Machine Learning                  Business Intelligence
Natural Language Processing        Opinion Mining
     Text Mining
What can one mine
                 from unstructured data?

 keywords                       text text text
                                text text text
   tags                         text text text
                                text text text                   sentiment
                                text text text
                                text text text




                                                                      genre
   categories
taxonomy terms
                          entities


             names                               biochemical
                     patterns        …             entities    text text text
                                                               text text text 	
  
                                                               text text text 	
  
                                                               text text text 	
  
                                                               text text text	
  
                                                               text text text	
  
Social	
           News	
  
                            Emails	
                  Media	
  




                                                             Audio	
                      Images	
  
Databases	
  
                                         Videos	
  




                                                                         Literature	
  
                Blogs	
  
text text text
text text text
text text text
text text text
text text text
text text text


                 People                        U.S. politicians     News about
                                                                   U.S. politicians
 News




Structured & unstructured data interplay
                                                              Unique	
  iden.fiers	
  

                          Structured	
  	
  
                          biological	
  
                                                                  Literature	
  references	
  
                          data	
  

                                                                          Experts’	
  
                                                                          annota.on	
  
                                                                          (free	
  text)	
  
introduction


   conclusions                             unstructured data
                                           real life problems



compliance
                                             unstructured data
  in finance
                                             & text analytics


          healthcare                  metadata
      records issues                  in legal domain
Legal document processing pipeline




            scan	
  
                       save	
  
             ocr	
  

 New York   metadata	
                London


                 dms	
  
                                  Images: Ambro / FreeDigitalPhotos.net
jacockshaw@flickr

                    Assigning metadata
                         (approximation)

                          15 docs per day
                           3 min per doc
                           0.75 h per day
                      240 working days per year
                         $200 hourly charge

                     $36,000 per year per lawyer




                    Keyword extraction
                         0.0027 min per doc
                    10 min for yearly worth of docs
Integra.ng	
  
	
  	
  
metadata	
  	
  
extrac.on	
  
	
  
with	
  	
  
scanning	
  
   h[p://www.youtube.com/watch?v=kluVp25upag	
  
Efficient (legal) document processing pipeline




   keywords
     tags


                metadata	
  

                   dms	
  
introduction


   conclusions                            unstructured data
                                          real life problems



compliance
  in finance                                unstructured data
                                            & text analytics


        healthcare                  metadata
    records issues                  in legal domain
EMR	
  
PHR	
  
EHR	
  
	
  
 slambo_42@flickr   Anoto AB@flickr
Na.onal	
  Alliance	
  for	
  Health	
  Informa.on	
  Technology	
  
EMR	
                                                                                                                   (NAHIT)	
  
                                                                                                                      defini.ons	
  	
  
	
                                             EHR	
  
                                               	
  
                                                                                          PHR	
                       ?	
  
                                                                                          	
  
       	
                                                                                                          Discon.nued!	
  
       1.   Name,	
  birth	
  date,	
  blood	
  type	
                                    	
  
       2.   Emergency	
  contact(s)	
                                                     	
  
       3.   Primary	
  caregiver/phone	
  number	
  
       4.   Medicines,	
  dosages,	
  and	
  how	
  long	
                                	
  
            taken	
  
       5.  Allergies/allergic	
  reac.ons	
  
                                                                                          	
  
       6.  Date	
  of	
  last	
  physical	
  
       7.  Dates/results	
  of	
  tests	
  and	
  
            screenings	
  
       8.  Major	
  illnesses/surgeries	
  and	
  their	
  
            dates	
  
       9.  Chronic	
  diseases	
                                                                           PHI	
  
       10.  Family	
  illness	
  history	
  
       11.  …	
  

       h?p://www.nlm.nih.gov/medlineplus/magazine/	
  
                                                                                                 de-­‐idenHficaHon	
  process	
  
Medical	
  researchers	
       …	
  records	
  with	
  removed	
  PHI:	
  
use	
  pa.ent	
  records	
     informa.on	
  from	
  structured	
  fields	
  
for	
  	
  discoveries…	
      but	
  mostly	
  from	
  free	
  text!	
  




                                                                 AMIA	
  2012	
  
 



                                                                            	
  


       siliconangle.com/blog/	
  


                  	
  
                                                                                                                 www.hcpro.com	
  




                  www.informaHon-­‐age.com	
  




                          “The	
  Health	
  Insurance	
  Portability	
  and	
  Accountability	
  Act	
  of	
  
                          1996	
  (HIPAA)	
  Privacy	
  and	
  Security	
  Rules”	
  
                          	
  
                          “The	
  Pa.ent	
  Safety	
  and	
  Quality	
  Improvement	
  Act	
  of	
  2005	
  
                          (PSQIA)	
  Pa.ent	
  Safety	
  Rule”	
  
                          	
  
18 identifiers!
PHI	
  
          Names          	
                                          Vehicle	
  iden.fiers	
  &	
  
                                                                     serial	
  numbers,	
  incl.	
  license	
  
          	
  


          Geographic	
  subdivisions	
                               plate	
  numbers	
  
          smaller	
  than	
  a	
  State:	
  street	
  address,	
     	
  
                                                                     	
  
          city,	
  county,	
  precinct,	
  zip	
  code…	
  
          	
  
          	
  
                                                                     Device	
  iden.fiers	
  &	
  
          Dates	
  (except	
  year):	
  birth,	
                     serial	
  numbers	
  
                                                                     	
  
          admission,	
  discharge…	
                                 	
  


                                                                     URLs	
  	
  	
  	
  /	
  	
  	
  	
  	
  	
  	
  IP	
  addresses	
  
          	
  
          	
  


          Phone	
  /	
  Fax	
  numbers
                                                                     	
  
                                                   	
                	
  


          Email	
  addresses	
                                       Biometric	
  iden.fiers,	
  
          	
                                                         including	
  finger	
  and	
  voice	
  prints	
  
          	
                                                         	
  


          Social	
  security	
  #	
  
                                                                     	
  


                                                                     Face	
  photo	
  images	
  	
  
          Medical	
  records	
  	
  #	
                              &	
  any	
  comparable	
  images	
  
          Health	
  plan	
  beneficiary#	
  
                                                                     	
  
                                                                     	
  

                                                                     Any	
  other	
  unique	
  IDs	
  etc.	
  
          Accounts	
  	
  #	
  
slambo_42@flickr                           Thanks	
  for	
  discussions:	
  
                                             	
  	
  	
  Nigam	
  Shah,	
  Stanford	
  
                                             	
  	
  	
  Eneida	
  Mendonca,	
  UWinscosin,	
  Madison	
  
                                             	
  	
  	
  Irena	
  Spasic,	
  Cardiff	
  University	
  

                     text text text
                     text text text 	
  
                     text text text 	
  
                     text text text 	
  
                     text text text	
  
                     text text text	
  




                                  keywords
                                    tags
Anoto AB@flickr
introduction


   conclusions                              unstructured data
                                            real life problems



compliance
 in finance                                   unstructured data
                                              & text analytics


          healthcare                  metadata
      records issues                  in legal domain
The FATCA Legislation
        Takes effect 1 January 2013




                          annual	
  report        	
  	
  	
  30%	
  witholding	
  tax	
  


      waiver	
  
                                 Foreign	
  Financial	
  
                                    Ins.tu.on	
  
                                  with	
  IRS	
  agreement	
  


 U.S.	
  account	
  holders	
  
U.S.	
  ownership	
  en..es	
  

  with	
           without	
                                               Custodian	
  bank	
  
 waiver	
          waiver	
        30%	
  witholding	
  tax	
           without	
  IRS	
  agreement	
  
FATCA COMPLIANCE – STEP 1
Detect U.S. citizenship indicators
Recommended Solution
from FATCA Legislation:




          •  “Query an electronic database using
             standard queries in programming languages”

          •  “Adopt similar approaches as used for the
             Anti-money-laundering and Know-your-customer
             requirements”

          •  “Note that information, data, or files are not
             electronically searchable if they are stored as
             images”
walmink,	
  thomwatson@flikr	
  




                                  FATCA COMPLIANCE – STEP 2
                                  Contact client for additional info or a waver
Actual Solution
for the FATCA Legislation:
link	
  analysis	
   gather	
  the	
  trail	
  client’s	
  data	
  
ocr	
   convert	
  all	
  images	
  to	
  text	
  
en.ty	
  extrac.on	
   detect	
  loca.ons,	
  bank	
  numbers	
  
analysis	
   auto-­‐categorize	
  

check	
   resolve	
  inconsistencies	
  
Efficient FATCA Compliance
introduction


 conclusions                                unstructured data
                                            real life problems



compliance
  in finance                                  unstructured data
                                              & text analytics


          healthcare                  metadata
      records issues                  in legal domain
Alyona Medelyan, PhD                Anna Divoli, PhD
       @zelandiya                          @annadivoli
       Natural Language Processing         Biomedical Text Mining
       Text Mining                         Search User Interfaces
       Wikipedia Mining                    Human Factors
       Machine Learning                    Knowledge Discovery




Try out text analytics provided by the Pingar API!

             Online demo: apidemo.pingar.com
     Free Sandbox account: pingar.com/get-the-api

More Related Content

Viewers also liked

Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
Marco Gralike
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
phanleson
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
Monaheng Diaho
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
Seth Grimes
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementDataWorks Summit
 

Viewers also liked (6)

Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 

Similar to Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Andreas Haimböck-Tichy
Andreas Haimböck-TichyAndreas Haimböck-Tichy
Andreas Haimböck-Tichy
Lucia Garcia
 
Linked data and the future of scientific publishing
Linked data and the future of scientific publishingLinked data and the future of scientific publishing
Linked data and the future of scientific publishing
Bradley Allen
 
Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101
Health 2.0
 
Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Health 2.0
 
Big data
Big dataBig data
Big data
Gaetan Lion
 
Manual vs automatic vs intelligent
Manual vs automatic vs intelligentManual vs automatic vs intelligent
Manual vs automatic vs intelligentLinlin Cai
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-FinalBeat Meyer
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
ikanow
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics AKAGroup
 
Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...
Peter Conradie
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Amit Sheth
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
Kristi Holmes
 
Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010PrattSILS
 
From Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban futureFrom Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban future
Mirko Lorenz
 
Internet Research Ethics and IRBs
Internet Research Ethics and IRBsInternet Research Ethics and IRBs
Advancing Identity Management (2007)
Advancing Identity Management (2007)Advancing Identity Management (2007)
Advancing Identity Management (2007)Duane Blackburn
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?
Arab Federation for Digital Economy
 

Similar to Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012 (20)

Andreas Haimböck-Tichy
Andreas Haimböck-TichyAndreas Haimböck-Tichy
Andreas Haimböck-Tichy
 
Linked data and the future of scientific publishing
Linked data and the future of scientific publishingLinked data and the future of scientific publishing
Linked data and the future of scientific publishing
 
Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101Spring Fling San Diego: Health 2.0 101
Spring Fling San Diego: Health 2.0 101
 
Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)Spring Fling: Health 2.0 101 (PDF)
Spring Fling: Health 2.0 101 (PDF)
 
Big data
Big dataBig data
Big data
 
Manual vs automatic vs intelligent
Manual vs automatic vs intelligentManual vs automatic vs intelligent
Manual vs automatic vs intelligent
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics
 
Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...Exploring Process Barriers to Release Public Sector Information in Local Gove...
Exploring Process Barriers to Release Public Sector Information in Local Gove...
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
 
Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010Pratt SILS Knowledge Organization Spring 2010
Pratt SILS Knowledge Organization Spring 2010
 
Neil Fraser
Neil FraserNeil Fraser
Neil Fraser
 
From Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban futureFrom Attention to Trust:
 Data-driven journalism and the urban future
From Attention to Trust:
 Data-driven journalism and the urban future
 
Internet Research Ethics and IRBs
Internet Research Ethics and IRBsInternet Research Ethics and IRBs
Internet Research Ethics and IRBs
 
Advancing Identity Management (2007)
Advancing Identity Management (2007)Advancing Identity Management (2007)
Advancing Identity Management (2007)
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
 
Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?Data Ownership: Who Owns 'My Data'?
Data Ownership: Who Owns 'My Data'?
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 

More from Peter Wren-Hilton

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big data
Peter Wren-Hilton
 
Case Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsCase Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsPeter Wren-Hilton
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured Data
Peter Wren-Hilton
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciences
Peter Wren-Hilton
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics Peter Wren-Hilton
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010
Peter Wren-Hilton
 

More from Peter Wren-Hilton (6)

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big data
 
Case Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million DocumentsCase Study: Text Analytics on 2 Million Documents
Case Study: Text Analytics on 2 Million Documents
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured Data
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciences
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

  • 1. Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli
  • 2. Problem 1 New York London How do lawyers scan, file, store & share client’s case documents efficiently? Images: Ambro / FreeDigitalPhotos.net
  • 3. slambo_42@flickr Anoto AB@flickr   EHR   EMR   PHR   How do doctors, patients & researchers distribute & share medical records efficiently?
  • 4. The FATCA Legislation Problem 3 Takes effect 1 January 2013 annual  report      30%  witholding  tax   Foreign  Financial   waiver   Ins.tu.on   with  IRS  agreement   U.S.  account  holders   U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   without  IRS  agreement   30%  witholding  tax   How can a financial institution find U.S. citizens in masses of paperwork efficiently?
  • 5. How much time do we actually spend on … Searching,  gathering  info   17   Wri.ng  emails   14   Crea.ng  docs   13   Analyzing  info   10   Reviewing  docs   9   Organizing  docs   7   Crea.ng  presenta.ons   7   Edi.ng  images   6   Entering  data   6   Translates  to  annual  costs:   Search:  17h  /  week  =  $37,000  /  year   Approving  docs   4   Publishing  docs   4   IDC: Hidden cost of information Transla.ng  docs   1 average hours / week
  • 6. introduction conclusions unstructured data real life problems compliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 7. Social   News   Emails   Media   Audio   Images   Databases   Videos   Literature   Blogs  
  • 8. unstructured data Linguistics Search Statistics Data Extraction Text Processing Document Organization Machine Learning Business Intelligence Natural Language Processing Opinion Mining Text Mining
  • 9. What can one mine from unstructured data? keywords text text text text text text tags text text text text text text sentiment text text text text text text genre categories taxonomy terms entities names biochemical patterns … entities text text text text text text   text text text   text text text   text text text   text text text  
  • 10. Social   News   Emails   Media   Audio   Images   Databases   Videos   Literature   Blogs  
  • 11. text text text text text text text text text text text text text text text text text text People U.S. politicians News about U.S. politicians News Structured & unstructured data interplay Unique  iden.fiers   Structured     biological   Literature  references   data   Experts’   annota.on   (free  text)  
  • 12. introduction conclusions unstructured data real life problems compliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 13. Legal document processing pipeline scan   save   ocr   New York metadata   London dms   Images: Ambro / FreeDigitalPhotos.net
  • 14. jacockshaw@flickr Assigning metadata (approximation) 15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer Keyword extraction 0.0027 min per doc 10 min for yearly worth of docs
  • 15. Integra.ng       metadata     extrac.on     with     scanning   h[p://www.youtube.com/watch?v=kluVp25upag  
  • 16. Efficient (legal) document processing pipeline keywords tags metadata   dms  
  • 17. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 18. EMR   PHR   EHR     slambo_42@flickr Anoto AB@flickr
  • 19. Na.onal  Alliance  for  Health  Informa.on  Technology   EMR   (NAHIT)   defini.ons       EHR     PHR   ?       Discon.nued!   1.  Name,  birth  date,  blood  type     2.  Emergency  contact(s)     3.  Primary  caregiver/phone  number   4.  Medicines,  dosages,  and  how  long     taken   5.  Allergies/allergic  reac.ons     6.  Date  of  last  physical   7.  Dates/results  of  tests  and   screenings   8.  Major  illnesses/surgeries  and  their   dates   9.  Chronic  diseases   PHI   10.  Family  illness  history   11.  …   h?p://www.nlm.nih.gov/medlineplus/magazine/   de-­‐idenHficaHon  process  
  • 20. Medical  researchers   …  records  with  removed  PHI:   use  pa.ent  records   informa.on  from  structured  fields   for    discoveries…   but  mostly  from  free  text!   AMIA  2012  
  • 21.     siliconangle.com/blog/     www.hcpro.com   www.informaHon-­‐age.com   “The  Health  Insurance  Portability  and  Accountability  Act  of   1996  (HIPAA)  Privacy  and  Security  Rules”     “The  Pa.ent  Safety  and  Quality  Improvement  Act  of  2005   (PSQIA)  Pa.ent  Safety  Rule”    
  • 22. 18 identifiers! PHI   Names   Vehicle  iden.fiers  &   serial  numbers,  incl.  license     Geographic  subdivisions   plate  numbers   smaller  than  a  State:  street  address,       city,  county,  precinct,  zip  code…       Device  iden.fiers  &   Dates  (except  year):  birth,   serial  numbers     admission,  discharge…     URLs        /              IP  addresses       Phone  /  Fax  numbers       Email  addresses   Biometric  iden.fiers,     including  finger  and  voice  prints       Social  security  #     Face  photo  images     Medical  records    #   &  any  comparable  images   Health  plan  beneficiary#       Any  other  unique  IDs  etc.   Accounts    #  
  • 23. slambo_42@flickr Thanks  for  discussions:        Nigam  Shah,  Stanford        Eneida  Mendonca,  UWinscosin,  Madison        Irena  Spasic,  Cardiff  University   text text text text text text   text text text   text text text   text text text   text text text   keywords tags Anoto AB@flickr
  • 24. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 25. The FATCA Legislation Takes effect 1 January 2013 annual  report      30%  witholding  tax   waiver   Foreign  Financial   Ins.tu.on   with  IRS  agreement   U.S.  account  holders   U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   30%  witholding  tax   without  IRS  agreement  
  • 26. FATCA COMPLIANCE – STEP 1 Detect U.S. citizenship indicators
  • 27. Recommended Solution from FATCA Legislation: •  “Query an electronic database using standard queries in programming languages” •  “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements” •  “Note that information, data, or files are not electronically searchable if they are stored as images”
  • 28. walmink,  thomwatson@flikr   FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver
  • 29. Actual Solution for the FATCA Legislation: link  analysis   gather  the  trail  client’s  data   ocr   convert  all  images  to  text   en.ty  extrac.on   detect  loca.ons,  bank  numbers   analysis   auto-­‐categorize   check   resolve  inconsistencies  
  • 31. introduction conclusions unstructured data real life problems compliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 32. Alyona Medelyan, PhD Anna Divoli, PhD @zelandiya @annadivoli Natural Language Processing Biomedical Text Mining Text Mining Search User Interfaces Wikipedia Mining Human Factors Machine Learning Knowledge Discovery Try out text analytics provided by the Pingar API! Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api

Editor's Notes

  1. To summarize:In this talk we gave a brief overview of what text analytics is and how powerful it is when dealing with unstructured data.We presented 3 real world examples, where text analytics eliminates manual boring error-prone labor.In the legal domain, keyword and taxonomy term extraction facilitates automated metadata assignment.Healthcare benefits from automated entity extraction for de-identification (sanitization) and mining useful associations.In the area of compliance & forensics, text analytics helpsscanning from massive amounts of data.No matter how much further our technology develops, we will always continue to communicate in human language. The amount of unstructured data will only increase. Already there are areas where manual analytics is not sustainable. And there will be even more need for efficient text analytics in the future.