Slide 1 Copyright © 2010 MarkLogic® Corporation.
Taming The Unstructured Data Problem
Dave Kellogg
Chief Executive Officer...
Slide 2 Copyright © 2010 MarkLogic® Corporation.
Topics
 About MarkLogic
 What do we mean by “unstructured”
 What peopl...
Slide 3 Copyright © 2010 MarkLogic® Corporation.
MarkLogic Government Began
With a Hunch
 A belief that Government agenci...
Slide 4 Copyright © 2010 MarkLogic® Corporation.
As Hunches Go … It Was a Good One
0
50
100
150
200
250
2004 2005 2006 200...
Slide 5 Copyright © 2010 MarkLogic® Corporation.
Media Customers
Government Customers
Financial Services and Other Custome...
Slide 6 Copyright © 2010 MarkLogic® Corporation.
Topics
 About MarkLogic
 What do we mean by “unstructured”
 What peopl...
Slide 7 Copyright © 2010 MarkLogic® Corporation.
My Database Journey
 Lawrence Berkeley Lab
 Seismic metadata in Ingres
...
Slide 8 Copyright © 2010 MarkLogic® Corporation.
What Do We Mean by “Unstructured?”
“It is estimated that about
80% of ent...
Slide 9 Copyright © 2010 MarkLogic® Corporation.
The Information Continuum
Information Continuum
“Unstructured”“Structured...
Slide 10 Copyright © 2010 MarkLogic® Corporation.
A Practical Definition of “Unstructured”
You could put in:
 Books, jour...
Slide 11 Copyright © 2010 MarkLogic® Corporation.
An Old Saw, Adapted
If your only data modeling element’s a table, then
e...
Slide 12 Copyright © 2010 MarkLogic® Corporation.
Topics
 About MarkLogic
 What do we mean by “unstructured”
 What peop...
Slide 13 Copyright © 2010 MarkLogic® Corporation.
Digital Publishing:
Custom Textbook Publishing
Search
Browse
Chapters
Cu...
Slide 14 Copyright © 2010 MarkLogic® Corporation.
Digital Publishing:
Web 2.0 Applications
Topics
Activity / Feed
Profiles...
Slide 15 Copyright © 2010 MarkLogic® Corporation.
Person-of-Interest Databases
 Multi-valued attributes
 Discard nothing...
Slide 16 Copyright © 2010 MarkLogic® Corporation.
Metadata Catalogs
 Digital card catalogs for tracking information asset...
Slide 17 Copyright © 2010 MarkLogic® Corporation.
Situational Awareness
 Integrating information in real-time from multip...
Slide 18 Copyright © 2010 MarkLogic® Corporation.
Intelligence Applications
 Open source intelligence
 Scrape and enrich...
Slide 19 Copyright © 2010 MarkLogic® Corporation.
Topics
 About MarkLogic
 What do we mean by “unstructured”
 What peop...
Slide 20 Copyright © 2010 MarkLogic® Corporation.
The Relational “Data Base” Was
Invented in 1970
 Provide flexible ad ho...
Slide 21 Copyright © 2010 MarkLogic® Corporation.
What Else Happened in 1970?
 Super bowl IV
 Janis Joplin died
 Mariah...
Slide 23 Copyright © 2010 MarkLogic® Corporation.
Thank You!
(And Please Follow Me At …)
 www.kellblog.com
 twitter.com/...
Dave Kellogg at MarkLogic 2010 Government Summit
Upcoming SlideShare
Loading in...5
×

Dave Kellogg at MarkLogic 2010 Government Summit

1,990

Published on

These are the slides from Dave Kellogg's presentation at the 2010 MarkLogic Government Summit in Tyson's Corner, VA.

Published in: Business, Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,990
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dave Kellogg at MarkLogic 2010 Government Summit

  1. 1. Slide 1 Copyright © 2010 MarkLogic® Corporation. Taming The Unstructured Data Problem Dave Kellogg Chief Executive Officer 11/17/10
  2. 2. Slide 2 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  3. 3. Slide 3 Copyright © 2010 MarkLogic® Corporation. MarkLogic Government Began With a Hunch  A belief that Government agencies would have  Large amounts of  Unstructured information and  Want an open way to store it  And a standard way to run complex queries against it  Somewhere around 2005, we chose to make Government the second key sector for MarkLogic  The first was media/publishing
  4. 4. Slide 4 Copyright © 2010 MarkLogic® Corporation. As Hunches Go … It Was a Good One 0 50 100 150 200 250 2004 2005 2006 2007 2008 2009 2010P Employees removed
  5. 5. Slide 5 Copyright © 2010 MarkLogic® Corporation. Media Customers Government Customers Financial Services and Other Customers 200+ Customers
  6. 6. Slide 6 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  7. 7. Slide 7 Copyright © 2010 MarkLogic® Corporation. My Database Journey  Lawrence Berkeley Lab  Seismic metadata in Ingres  Ingres 6.3  Product manager for first DBMS with user-defined types  BusinessObjects  Ran marketing for 9 years from $30M to $1B  MarkLogic  Structured/unstructured divide  First-class citizenship
  8. 8. Slide 8 Copyright © 2010 MarkLogic® Corporation. What Do We Mean by “Unstructured?” “It is estimated that about 80% of enterprise information is unstructured … and contains text and data that is not readily accessible but holds immeasurable value.” -- IDC, White Paper 9/06 “Excuse me for saying so, but there is no such thing as unstructured information. Even the simplest information has a sequence in which there is a beginning, a middle, and an end.” -- Steven Newcomb, Topic Maps, Chapter 3. <enter>long debate</enter>
  9. 9. Slide 9 Copyright © 2010 MarkLogic® Corporation. The Information Continuum Information Continuum “Unstructured”“Structured” Free textRelational Hierarchical Semi-structured Time-varying XML Metadata Geospatial Sparse Graph N-schema
  10. 10. Slide 10 Copyright © 2010 MarkLogic® Corporation. A Practical Definition of “Unstructured” You could put in:  Books, journals  Web pages  Message, cable traffic  Doctrine, procedures  Metadata  Hierarchies, graphs  Sparse data But should you? That which does not model well relationally RELATIONERTIA
  11. 11. Slide 11 Copyright © 2010 MarkLogic® Corporation. An Old Saw, Adapted If your only data modeling element’s a table, then every problem looks like a column  We believe there is a better way  Use XML as means represent unstructured information  Use XQuery as language for building apps and analytics  Implement a specialized DBMS, purpose-built for managing vast amounts of unstructured information (MarkLogic Server)
  12. 12. Slide 12 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  13. 13. Slide 13 Copyright © 2010 MarkLogic® Corporation. Digital Publishing: Custom Textbook Publishing Search Browse Chapters Customize Create
  14. 14. Slide 14 Copyright © 2010 MarkLogic® Corporation. Digital Publishing: Web 2.0 Applications Topics Activity / Feed Profiles Social bookmarking Social network Targeted Ads
  15. 15. Slide 15 Copyright © 2010 MarkLogic® Corporation. Person-of-Interest Databases  Multi-valued attributes  Discard nothing: as many heights as sources  Repeating groups drive creation of table per attribute  Sparse data  Thousands of possible attributes of which few are known  Typical result  500+ largely empty tables  Huge joins cripple query performance  Bonus  Fun attributes like body markings  Transliteration: Gadafi vs. Khadafi A seemingly simple problem made difficult by 2 things
  16. 16. Slide 16 Copyright © 2010 MarkLogic® Corporation. Metadata Catalogs  Digital card catalogs for tracking information assets  Intelligence community information sharing  Libraries and archives  Digital asset repositories  If you can’t search the content, search the metadata  Why MarkLogic?  Changing metadata standards  Evolving metadata fields  User-generated metadata (tagging, folksonomy)  Text metadata where search-style matching desirable
  17. 17. Slide 17 Copyright © 2010 MarkLogic® Corporation. Situational Awareness  Integrating information in real-time from multiple sources to improve operational decision making  Scraping websites, chat sessions, news, …  Integrating geospatial information  Pulling information from existing systems  Civilian and Defense applications  Why MarkLogic?  Geospatial indexing  Zero-latency indexing, real-time query performance  Ability to handle diverse content in different structures
  18. 18. Slide 18 Copyright © 2010 MarkLogic® Corporation. Intelligence Applications  Open source intelligence  Scrape and enrich publicly available Internet content  Load into content repository  Build applications that enable search and annotation  Cellphone exploitation  Collect contacts, call history, and messages  Quickly load into database in the field  Search social network for suspects  Link analysis  Analyze the graph of contacts and organizations
  19. 19. Slide 19 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  20. 20. Slide 20 Copyright © 2010 MarkLogic® Corporation. The Relational “Data Base” Was Invented in 1970  Provide flexible ad hoc queries to structured data  Wasn’t thinking about  Web content  PDFs  Word files  SIGINT  RSS feeds  Tweets  21st century challenges
  21. 21. Slide 21 Copyright © 2010 MarkLogic® Corporation. What Else Happened in 1970?  Super bowl IV  Janis Joplin died  Mariah Carey was born  Beatles disbanded after Let It Be  Monday Night Football debuted  First episode of All My Children  Boeing 747 entered service  First F-14 tomcat test flight  Gas cost $0.36/gallon  Storage cost over $200/megabyte
  22. 22. Slide 23 Copyright © 2010 MarkLogic® Corporation. Thank You! (And Please Follow Me At …)  www.kellblog.com  twitter.com/kellblog

×