Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

  • 4,378 views
Uploaded on

Alyona Medelyan (Pingar), Anna Divoli (Pingar)...

Alyona Medelyan (Pingar), Anna Divoli (Pingar)

presented at Strata O'Reilly Making Data Work Conference on March 1, 2012

The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.

Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.

In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.

In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.


And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,378
On Slideshare
3,311
From Embeds
1,067
Number of Embeds
11

Actions

Shares
Downloads
77
Comments
0
Likes
2

Embeds 1,067

http://www.annadivoli.com 808
http://strataconf.com 212
http://annadivoli.com 24
http://dev.en.oreilly.com 7
http://www.linkedin.com 6
http://lanyrd.com 5
https://si0.twimg.com 1
http://us-w1.rockmelt.com 1
https://twimg0-a.akamaihd.net 1
http://webcache.googleusercontent.com 1
http://w.annadivoli.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • To summarize:In this talk we gave a brief overview of what text analytics is and how powerful it is when dealing with unstructured data.We presented 3 real world examples, where text analytics eliminates manual boring error-prone labor.In the legal domain, keyword and taxonomy term extraction facilitates automated metadata assignment.Healthcare benefits from automated entity extraction for de-identification (sanitization) and mining useful associations.In the area of compliance & forensics, text analytics helpsscanning from massive amounts of data.No matter how much further our technology develops, we will always continue to communicate in human language. The amount of unstructured data will only increase. Already there are areas where manual analytics is not sustainable. And there will be even more need for efficient text analytics in the future.

Transcript

  • 1. Mining Unstructured Data:Practical ApplicationsAlyona Medelyan @zelandiyaAnna Divoli @annadivoli
  • 2. Problem 1 New York LondonHow do lawyers scan, file, store & shareclient’s case documents efficiently? Images: Ambro / FreeDigitalPhotos.net
  • 3. slambo_42@flickrAnoto AB@flickr   EHR   EMR   PHR   How do doctors, patients & researchers distribute & share medical records efficiently?
  • 4. The FATCA Legislation Problem 3 Takes effect 1 January 2013 annual  report      30%  witholding  tax   Foreign  Financial   waiver   Ins.tu.on   with  IRS  agreement   U.S.  account  holders   U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   without  IRS  agreement   30%  witholding  tax  How can a financial institution find U.S. citizensin masses of paperwork efficiently?
  • 5. How much time do we actually spend on …Searching,  gathering  info   17   Wri.ng  emails   14   Crea.ng  docs   13   Analyzing  info   10   Reviewing  docs   9   Organizing  docs   7   Crea.ng  presenta.ons   7   Edi.ng  images   6   Entering  data   6   Translates  to  annual  costs:   Search:  17h  /  week  =  $37,000  /  year   Approving  docs   4   Publishing  docs   4   IDC: Hidden cost of information Transla.ng  docs   1 average hours / week
  • 6. introduction conclusions unstructured data real life problemscompliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 7. Social   News   Emails   Media   Audio   Images  Databases   Videos   Literature   Blogs  
  • 8. unstructured dataLinguistics Search Statistics Data Extraction Text Processing Document OrganizationMachine Learning Business IntelligenceNatural Language Processing Opinion Mining Text Mining
  • 9. What can one mine from unstructured data? keywords text text text text text text tags text text text text text text sentiment text text text text text text genre categoriestaxonomy terms entities names biochemical patterns … entities text text text text text text   text text text   text text text   text text text   text text text  
  • 10. Social   News   Emails   Media   Audio   Images  Databases   Videos   Literature   Blogs  
  • 11. text text texttext text texttext text texttext text texttext text texttext text text People U.S. politicians News about U.S. politicians NewsStructured & unstructured data interplay Unique  iden.fiers   Structured     biological   Literature  references   data   Experts’   annota.on   (free  text)  
  • 12. introduction conclusions unstructured data real life problemscompliance unstructured data in finance & text analytics healthcare metadata records issues in legal domain
  • 13. Legal document processing pipeline scan   save   ocr   New York metadata   London dms   Images: Ambro / FreeDigitalPhotos.net
  • 14. jacockshaw@flickr Assigning metadata (approximation) 15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer Keyword extraction 0.0027 min per doc 10 min for yearly worth of docs
  • 15. Integra.ng      metadata    extrac.on    with    scanning   h[p://www.youtube.com/watch?v=kluVp25upag  
  • 16. Efficient (legal) document processing pipeline keywords tags metadata   dms  
  • 17. introduction conclusions unstructured data real life problemscompliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 18. EMR  PHR  EHR     slambo_42@flickr Anoto AB@flickr
  • 19. Na.onal  Alliance  for  Health  Informa.on  Technology  EMR   (NAHIT)   defini.ons       EHR     PHR   ?       Discon.nued!   1.  Name,  birth  date,  blood  type     2.  Emergency  contact(s)     3.  Primary  caregiver/phone  number   4.  Medicines,  dosages,  and  how  long     taken   5.  Allergies/allergic  reac.ons     6.  Date  of  last  physical   7.  Dates/results  of  tests  and   screenings   8.  Major  illnesses/surgeries  and  their   dates   9.  Chronic  diseases   PHI   10.  Family  illness  history   11.  …   h?p://www.nlm.nih.gov/medlineplus/magazine/   de-­‐idenHficaHon  process  
  • 20. Medical  researchers   …  records  with  removed  PHI:  use  pa.ent  records   informa.on  from  structured  fields  for    discoveries…   but  mostly  from  free  text!   AMIA  2012  
  • 21.     siliconangle.com/blog/     www.hcpro.com   www.informaHon-­‐age.com   “The  Health  Insurance  Portability  and  Accountability  Act  of   1996  (HIPAA)  Privacy  and  Security  Rules”     “The  Pa.ent  Safety  and  Quality  Improvement  Act  of  2005   (PSQIA)  Pa.ent  Safety  Rule”    
  • 22. 18 identifiers!PHI   Names   Vehicle  iden.fiers  &   serial  numbers,  incl.  license     Geographic  subdivisions   plate  numbers   smaller  than  a  State:  street  address,       city,  county,  precinct,  zip  code…       Device  iden.fiers  &   Dates  (except  year):  birth,   serial  numbers     admission,  discharge…     URLs        /              IP  addresses       Phone  /  Fax  numbers       Email  addresses   Biometric  iden.fiers,     including  finger  and  voice  prints       Social  security  #     Face  photo  images     Medical  records    #   &  any  comparable  images   Health  plan  beneficiary#       Any  other  unique  IDs  etc.   Accounts    #  
  • 23. slambo_42@flickr Thanks  for  discussions:        Nigam  Shah,  Stanford        Eneida  Mendonca,  UWinscosin,  Madison        Irena  Spasic,  Cardiff  University   text text text text text text   text text text   text text text   text text text   text text text   keywords tagsAnoto AB@flickr
  • 24. introduction conclusions unstructured data real life problemscompliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 25. The FATCA Legislation Takes effect 1 January 2013 annual  report      30%  witholding  tax   waiver   Foreign  Financial   Ins.tu.on   with  IRS  agreement   U.S.  account  holders  U.S.  ownership  en..es   with   without   Custodian  bank   waiver   waiver   30%  witholding  tax   without  IRS  agreement  
  • 26. FATCA COMPLIANCE – STEP 1Detect U.S. citizenship indicators
  • 27. Recommended Solutionfrom FATCA Legislation: •  “Query an electronic database using standard queries in programming languages” •  “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements” •  “Note that information, data, or files are not electronically searchable if they are stored as images”
  • 28. walmink,  thomwatson@flikr   FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver
  • 29. Actual Solutionfor the FATCA Legislation:link  analysis   gather  the  trail  client’s  data  ocr   convert  all  images  to  text  en.ty  extrac.on   detect  loca.ons,  bank  numbers  analysis   auto-­‐categorize  check   resolve  inconsistencies  
  • 30. Efficient FATCA Compliance
  • 31. introduction conclusions unstructured data real life problemscompliance in finance unstructured data & text analytics healthcare metadata records issues in legal domain
  • 32. Alyona Medelyan, PhD Anna Divoli, PhD @zelandiya @annadivoli Natural Language Processing Biomedical Text Mining Text Mining Search User Interfaces Wikipedia Mining Human Factors Machine Learning Knowledge DiscoveryTry out text analytics provided by the Pingar API! Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api