• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Dave Kellogg at MarkLogic 2010 Government Summit

on

  • 2,134 views

These are the slides from Dave Kellogg's presentation at the 2010 MarkLogic Government Summit in Tyson's Corner, VA.

These are the slides from Dave Kellogg's presentation at the 2010 MarkLogic Government Summit in Tyson's Corner, VA.

Statistics

Views

Total Views
2,134
Views on SlideShare
1,650
Embed Views
484

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 484

http://kellblog.com 483
http://feeds.feedburner.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Dave Kellogg at MarkLogic 2010 Government Summit Dave Kellogg at MarkLogic 2010 Government Summit Presentation Transcript

  • Taming The Unstructured Data Problem
    Dave Kellogg
    Chief Executive Officer
    11/17/10
  • Topics
    About MarkLogic
    What do we mean by “unstructured”
    What people do with unstructured information
    Conclusions
  • MarkLogic Government Began With a Hunch
    A belief that Government agencies would have
    Large amounts of
    Unstructured information and
    Want an open way to store it
    And a standard way to run complex queries against it
    Somewhere around 2005, we chose to make Government the second key sector for MarkLogic
    The first was media/publishing
  • As Hunches Go … It Was a Good One
    removed
  • Media Customers
    Government Customers
    Financial Services and Other Customers
    200+ Customers
  • Topics
    About MarkLogic
    What do we mean by “unstructured”
    What people do with unstructured information
    Conclusions
  • My Database Journey
    Lawrence Berkeley Lab
    Seismic metadata in Ingres
    Ingres 6.3
    Product manager for first DBMS with user-defined types
    BusinessObjects
    Ran marketing for 9 years from $30M to $1B
    MarkLogic
    Structured/unstructured divide
    First-class citizenship
  • What Do We Mean by “Unstructured?”
    “It is estimated that about 80% of enterprise information is unstructured
    … and contains text and data that is not readily accessible but holds immeasurable value.”
    -- IDC, White Paper 9/06
    “Excuse me for saying so, but there is no such thing as unstructured information.
    Even the simplest information has a sequence in which there is a beginning, a middle, and an end.”
    -- Steven Newcomb, Topic Maps, Chapter 3.
    <enter>long debate</enter>
  • The Information Continuum
    Information Continuum
    XML
    Metadata
    Geospatial
    Graph
    Free text
    Relational
    Time-varying
    Sparse
    N-schema
    Hierarchical
    Semi-structured
    “Unstructured”
    “Structured”
  • A Practical Definition of “Unstructured”
    That which does not model well relationally
    You could put in:
    Books, journals
    Web pages
    Message, cable traffic
    Doctrine, procedures
    Metadata
    Hierarchies, graphs
    Sparse data
    But should you?
    RELATIONERTIA
  • An Old Saw, Adapted
    If your only data modeling element’s a table, then every problem looks like a column
    We believe there is a better way
    Use XML as means represent unstructured information
    Use XQuery as language for building apps and analytics
    Implement a specialized DBMS, purpose-built for managing vast amounts of unstructured information (MarkLogic Server)
  • Topics
    About MarkLogic
    What do we mean by “unstructured”
    What people do with unstructured information
    Conclusions
  • Digital Publishing: Custom Textbook Publishing
    Browse
    Chapters
    Customize
    Create
    Search
  • Digital Publishing: Web 2.0 Applications
    Profiles
    Social network
    Social bookmarking
    Targeted Ads
    Topics
    Activity / Feed
  • Person-of-Interest Databases
    A seemingly simple problem made difficult by 2 things
    Multi-valued attributes
    Discard nothing: as many heights as sources
    Repeating groups drive creation of table per attribute
    Sparse data
    Thousands of possible attributes of which few are known
    Typical result
    500+ largely empty tables
    Huge joins cripple query performance
    Bonus
    Fun attributes like body markings
    Transliteration: Gadafi vs. Khadafi
  • Metadata Catalogs
    Digital card catalogs for tracking information assets
    Intelligence community information sharing
    Libraries and archives
    Digital asset repositories
    If you can’t search the content, search the metadata
    Why MarkLogic?
    Changing metadata standards
    Evolving metadata fields
    User-generated metadata (tagging, folksonomy)
    Text metadata where search-style matching desirable
  • Situational Awareness
    Integrating information in real-time from multiple sources to improve operational decision making
    Scraping websites, chat sessions, news, …
    Integrating geospatial information
    Pulling information from existing systems
    Civilian and Defense applications
    Why MarkLogic?
    Geospatial indexing
    Zero-latency indexing, real-time query performance
    Ability to handle diverse content in different structures
  • Intelligence Applications
    Open source intelligence
    Scrape and enrich publicly available Internet content
    Load into content repository
    Build applications that enable search and annotation
    Cellphone exploitation
    Collect contacts, call history, and messages
    Quickly load into database in the field
    Search social network for suspects
    Link analysis
    Analyze the graph of contacts and organizations
  • Topics
    About MarkLogic
    What do we mean by “unstructured”
    What people do with unstructured information
    Conclusions
  • The Relational “Data Base” Was Invented in 1970
    Provide flexible ad hoc queries to structured data
    Wasn’t thinking about
    Web content
    PDFs
    Word files
    SIGINT
    RSS feeds
    Tweets
    21st century challenges
  • What Else Happened in 1970?
    Super bowl IV
    Janis Joplin died
    Mariah Carey was born
    Beatles disbanded after Let It Be
    Monday Night Football debuted
    First episode of All My Children
    Boeing 747 entered service
    First F-14 tomcat test flight
    Gas cost $0.36/gallon
    Storage cost over $200/megabyte
  • Thank You! (And Please Follow Me At …)
    www.kellblog.com
    twitter.com/kellblog