Your SlideShare is downloading. ×
Dave Kellogg at MarkLogic 2010 Government Summit
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dave Kellogg at MarkLogic 2010 Government Summit

1,928
views

Published on

These are the slides from Dave Kellogg's presentation at the 2010 MarkLogic Government Summit in Tyson's Corner, VA.

These are the slides from Dave Kellogg's presentation at the 2010 MarkLogic Government Summit in Tyson's Corner, VA.

Published in: Business, Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,928
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Slide 1 Copyright © 2010 MarkLogic® Corporation. Taming The Unstructured Data Problem Dave Kellogg Chief Executive Officer 11/17/10
  • 2. Slide 2 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  • 3. Slide 3 Copyright © 2010 MarkLogic® Corporation. MarkLogic Government Began With a Hunch  A belief that Government agencies would have  Large amounts of  Unstructured information and  Want an open way to store it  And a standard way to run complex queries against it  Somewhere around 2005, we chose to make Government the second key sector for MarkLogic  The first was media/publishing
  • 4. Slide 4 Copyright © 2010 MarkLogic® Corporation. As Hunches Go … It Was a Good One 0 50 100 150 200 250 2004 2005 2006 2007 2008 2009 2010P Employees removed
  • 5. Slide 5 Copyright © 2010 MarkLogic® Corporation. Media Customers Government Customers Financial Services and Other Customers 200+ Customers
  • 6. Slide 6 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  • 7. Slide 7 Copyright © 2010 MarkLogic® Corporation. My Database Journey  Lawrence Berkeley Lab  Seismic metadata in Ingres  Ingres 6.3  Product manager for first DBMS with user-defined types  BusinessObjects  Ran marketing for 9 years from $30M to $1B  MarkLogic  Structured/unstructured divide  First-class citizenship
  • 8. Slide 8 Copyright © 2010 MarkLogic® Corporation. What Do We Mean by “Unstructured?” “It is estimated that about 80% of enterprise information is unstructured … and contains text and data that is not readily accessible but holds immeasurable value.” -- IDC, White Paper 9/06 “Excuse me for saying so, but there is no such thing as unstructured information. Even the simplest information has a sequence in which there is a beginning, a middle, and an end.” -- Steven Newcomb, Topic Maps, Chapter 3. <enter>long debate</enter>
  • 9. Slide 9 Copyright © 2010 MarkLogic® Corporation. The Information Continuum Information Continuum “Unstructured”“Structured” Free textRelational Hierarchical Semi-structured Time-varying XML Metadata Geospatial Sparse Graph N-schema
  • 10. Slide 10 Copyright © 2010 MarkLogic® Corporation. A Practical Definition of “Unstructured” You could put in:  Books, journals  Web pages  Message, cable traffic  Doctrine, procedures  Metadata  Hierarchies, graphs  Sparse data But should you? That which does not model well relationally RELATIONERTIA
  • 11. Slide 11 Copyright © 2010 MarkLogic® Corporation. An Old Saw, Adapted If your only data modeling element’s a table, then every problem looks like a column  We believe there is a better way  Use XML as means represent unstructured information  Use XQuery as language for building apps and analytics  Implement a specialized DBMS, purpose-built for managing vast amounts of unstructured information (MarkLogic Server)
  • 12. Slide 12 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  • 13. Slide 13 Copyright © 2010 MarkLogic® Corporation. Digital Publishing: Custom Textbook Publishing Search Browse Chapters Customize Create
  • 14. Slide 14 Copyright © 2010 MarkLogic® Corporation. Digital Publishing: Web 2.0 Applications Topics Activity / Feed Profiles Social bookmarking Social network Targeted Ads
  • 15. Slide 15 Copyright © 2010 MarkLogic® Corporation. Person-of-Interest Databases  Multi-valued attributes  Discard nothing: as many heights as sources  Repeating groups drive creation of table per attribute  Sparse data  Thousands of possible attributes of which few are known  Typical result  500+ largely empty tables  Huge joins cripple query performance  Bonus  Fun attributes like body markings  Transliteration: Gadafi vs. Khadafi A seemingly simple problem made difficult by 2 things
  • 16. Slide 16 Copyright © 2010 MarkLogic® Corporation. Metadata Catalogs  Digital card catalogs for tracking information assets  Intelligence community information sharing  Libraries and archives  Digital asset repositories  If you can’t search the content, search the metadata  Why MarkLogic?  Changing metadata standards  Evolving metadata fields  User-generated metadata (tagging, folksonomy)  Text metadata where search-style matching desirable
  • 17. Slide 17 Copyright © 2010 MarkLogic® Corporation. Situational Awareness  Integrating information in real-time from multiple sources to improve operational decision making  Scraping websites, chat sessions, news, …  Integrating geospatial information  Pulling information from existing systems  Civilian and Defense applications  Why MarkLogic?  Geospatial indexing  Zero-latency indexing, real-time query performance  Ability to handle diverse content in different structures
  • 18. Slide 18 Copyright © 2010 MarkLogic® Corporation. Intelligence Applications  Open source intelligence  Scrape and enrich publicly available Internet content  Load into content repository  Build applications that enable search and annotation  Cellphone exploitation  Collect contacts, call history, and messages  Quickly load into database in the field  Search social network for suspects  Link analysis  Analyze the graph of contacts and organizations
  • 19. Slide 19 Copyright © 2010 MarkLogic® Corporation. Topics  About MarkLogic  What do we mean by “unstructured”  What people do with unstructured information  Conclusions
  • 20. Slide 20 Copyright © 2010 MarkLogic® Corporation. The Relational “Data Base” Was Invented in 1970  Provide flexible ad hoc queries to structured data  Wasn’t thinking about  Web content  PDFs  Word files  SIGINT  RSS feeds  Tweets  21st century challenges
  • 21. Slide 21 Copyright © 2010 MarkLogic® Corporation. What Else Happened in 1970?  Super bowl IV  Janis Joplin died  Mariah Carey was born  Beatles disbanded after Let It Be  Monday Night Football debuted  First episode of All My Children  Boeing 747 entered service  First F-14 tomcat test flight  Gas cost $0.36/gallon  Storage cost over $200/megabyte
  • 22. Slide 23 Copyright © 2010 MarkLogic® Corporation. Thank You! (And Please Follow Me At …)  www.kellblog.com  twitter.com/kellblog

×