Getting Started with Unstructured Data

2,711 views
2,529 views

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,711
On SlideShare
0
From Embeds
0
Number of Embeds
955
Actions
Shares
0
Downloads
68
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Getting Started with Unstructured Data

  1. 1. Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC November 17, 2011Thursday, November 17, 2011
  2. 2. Meta ✤ Presenter: Christine Connors ✤ @cjmconnors ✤ Presenter: Kevin Lynch ✤ @kevinjohnlynch ✤ Principals at www.triviumrlg.com ✤ Partnering with DataversityThursday, November 17, 2011
  3. 3. Agenda ✤ What is unstructured data? ✤ Where do we find it? ✤ How important is it? ✤ How do we visualize it? ✤ Machine processing for actionable data ✤ ToolsThursday, November 17, 2011
  4. 4. What is unstructured data? ✤ Data which is ✤ Not in a database ✤ Does not adhere to a formal data model ✤ ContentThursday, November 17, 2011
  5. 5. Isn’t that a misnomer? ✤ Problematic term ✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word ✤ Object metadata = machine or applied properties ✤ Aesthetic markup = stylesheets; rendering information ✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysisThursday, November 17, 2011
  6. 6. Types of ‘un’structured data ✤ Text-based documents ✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web) ✤ Audio/video filesThursday, November 17, 2011
  7. 7. Where do we find it? ✤ Office productivity suites ✤ Content management systems ✤ Digital asset management systems ✤ Web content management systems ✤ Wikis, blogs, comment & discussion threads ✤ Social networking tools ✤ Twitter, Yammer, instant messengersThursday, November 17, 2011
  8. 8. Is it really that important? Structured Unstructured 15% 85%Thursday, November 17, 2011
  9. 9. What’s in that 80-85%? ✤ Progress reports - created in a word processorThursday, November 17, 2011
  10. 10. What’s in that 80-85%? ✤ Dashboards - created in presentation softwareThursday, November 17, 2011
  11. 11. What’s in that 80-85%? ✤ Progress reports - color coded text in a spreadsheetThursday, November 17, 2011
  12. 12. What’s in that 80-85%? ✤ Brainstorming - in messaging systems ✤ Decision making - in emailThursday, November 17, 2011
  13. 13. What’s in that 80-85%? ✤ Business intelligence - on the web and moreThursday, November 17, 2011
  14. 14. How can we make the data more actionable? ✤ Identify it ✤ Convert to a format you can work with ✤ Add structure, meaning: ✤ information extraction ✤ annotation ✤ content analyticsThursday, November 17, 2011
  15. 15. What about enterprise search? ✤ First line of defense ✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis ✤ Does not assist in other visualizations or transformations without further machine processingThursday, November 17, 2011
  16. 16. Information Extraction ✤ Token identification - “tokenization” ✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.) ✤ Phrase identification - noun phrase ✤ Entity extraction - people, places, events, dates, organizationsThursday, November 17, 2011
  17. 17. Information Extraction ✤ Cluster analysis - group related information, where relationship may not be known ✤ Classification - mapping to specific categories ✤ Dependency identification / Rule generation ✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM” ✤ Summarization - key concepts or key sentencesThursday, November 17, 2011
  18. 18. Open Tools ✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation. ✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization.Thursday, November 17, 2011
  19. 19. Open Tools ✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project. ✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services).Thursday, November 17, 2011
  20. 20. UIMA UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. UIMA CAS Representation now Common Analysis Structure (CAS) Aligned with XMI standard Relationship CeoOf Arg1:Person Arg2:Org Analysis Results (i.e., Artifact Metadata) Named Entity Person Organization Parser NP VP PP Fred Center is the CEO of Center Micros Artifact (e.g., Document) Chart by IBMThursday, November 17, 2011
  21. 21. UIMA Image by IBMThursday, November 17, 2011
  22. 22. Commercial Tools ✤ Oracle Data Mining (Text Mining) ✤ IBM SPSS ✤ SAS Text Miner ✤ Smartlogic ✤ Lots of acquisitions going on in the “big data” space ✤ HP acquired Autonomy ✤ Oracle acquired EndecaThursday, November 17, 2011
  23. 23. A Note on Tools ✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves. ✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc). ✤ Your mileage will vary. The biggest differentiator is your knowledge of your data.Thursday, November 17, 2011
  24. 24. What can unstructured data look like post-processing?Thursday, November 17, 2011
  25. 25. Machine Processing Unstructured Natural Rules-based Statistical Semantic Data Language Classifica- Analysis Analysis Processing tion Machine Processing Platform Federated Search A P Index I Visualizations Data StoresThursday, November 17, 2011
  26. 26. Questions?Thursday, November 17, 2011
  27. 27. Thank you Christine Connors Kevin Lynch www.triviumrlg.comThursday, November 17, 2011

×