Kellogg XML Holland Speech


Published on

Speech by Dave Kellogg of Mark Logic at XML Holland 2007 in Amsterdam.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Kellogg XML Holland Speech

  1. 1. Content Applications: The Key To Next-Generation Publishing Dave Kellogg Chief Executive Officer 29 November 2007 Check out my blog at
  2. 2. Topics <ul><li>A brief history of databases </li></ul><ul><li>Search engines: filling the gap </li></ul><ul><li>Web 2.0 </li></ul><ul><li>Content applications </li></ul>
  3. 3. A Brief History of Database Management Systems (DBMSs) <ul><li>Hierarchical </li></ul><ul><ul><li>Data model: hierarchy </li></ul></ul><ul><ul><li>Example: IMS </li></ul></ul><ul><li>Network </li></ul><ul><ul><li>Data model: directed graph </li></ul></ul><ul><ul><li>Examples: IDMS, DBMS32 </li></ul></ul><ul><li>Relational </li></ul><ul><ul><li>Data model: table (i.e., relation) </li></ul></ul><ul><ul><li>Examples: Oracle, DB2, MySQL </li></ul></ul>Mainstream market evolution through three generations
  4. 4. Aside: The Great Irony of RDBMS Evolution <ul><li>The value proposition for the RDBMS was decision support oriented </li></ul><ul><ul><li>Prior-generation systems were great at OLTP </li></ul></ul><ul><ul><li>Answer any question – without knowing in advance you’ll ask </li></ul></ul><ul><li>That focus was completely lost in the late 1980s / early 1990s </li></ul><ul><ul><li>TP1, DebitCredit, TPC-x, etc. benchmark wars </li></ul></ul><ul><ul><li>Use the new car to do what the old one did well </li></ul></ul><ul><li>Only a decade later did the decision support focus return </li></ul><ul><ul><li>Data warehousing </li></ul></ul><ul><ul><li>Business intelligence </li></ul></ul><ul><li>Kellogg’s rule: IT repeatedly buys insight and get transactions </li></ul><ul><ul><li>Examples: RDBMS, ERP </li></ul></ul><ul><ul><li>Only BI delivered on the insight value proposition </li></ul></ul>
  5. 5. Specialized DBMS Markets <ul><li>Object databases </li></ul><ul><ul><li>Network underlying structure with fantastic integration to object programming languages </li></ul></ul><ul><ul><li>Died: when the programmers fought the DBAs, the DBAs won </li></ul></ul><ul><ul><li>The poster-child for special-purpose DBMS failure </li></ul></ul><ul><li>Multi-dimensional databases </li></ul><ul><ul><li>Also known as OLAP (online analytic processing) servers </li></ul></ul><ul><ul><li>Dimensionalize, aggregate, and pre-calculate answers to intersections (e.g., product, geography, time) </li></ul></ul><ul><ul><li>Approx $1B DBMS market </li></ul></ul><ul><ul><li>Products: Essbase, SQL Server Analysis Services, Qlikview </li></ul></ul><ul><ul><li>The overlooked example of special-purpose DBMS success </li></ul></ul>
  6. 6. Generational Differences and Commonalities <ul><li>Differences </li></ul><ul><li>Degree of BI / query processing </li></ul><ul><li>Ability to process unpredicted queries </li></ul><ul><li>Multi-dimensional processing </li></ul><ul><li>Elegance of language interfaces </li></ul><ul><li>Commonalities </li></ul><ul><li>Business data processing </li></ul><ul><li>Data-orientation </li></ul><ul><li>Assume consistent schema </li></ul><ul><li>Documents ignored </li></ul>
  7. 7. Schema Regularity vs. Query Flexibility Schema Regularity Query Flexibility Hi Low Low Hi NDBMS RDBMS
  8. 8. Topics <ul><li>A brief history of databases </li></ul><ul><li>Search engines: filling the gap </li></ul><ul><li>Web 2.0 </li></ul><ul><li>Content applications </li></ul>
  9. 9. Enter Search Engines <ul><li>Assume nothing </li></ul><ul><ul><li>Index the words, and “filter” the rest </li></ul></ul><ul><li>Don’t even know where the content is </li></ul><ul><ul><li>Spiders to find it </li></ul></ul><ul><li>Indexes without data </li></ul><ul><ul><li>Leave content in place and point to it -- not the copy of record </li></ul></ul><ul><ul><li>But build a “cache” so it’s copied anyway </li></ul></ul><ul><ul><li>404 type errors: data integrity </li></ul></ul><ul><li>Read-only </li></ul><ul><ul><li>Designed to help you find documents that might be useful </li></ul></ul><ul><li>Imprecise results </li></ul><ul><ul><li>There is no one precise “right” answer </li></ul></ul><ul><ul><li>No two ever return the same  meta-search engines </li></ul></ul><ul><ul><li>Trade-offs in precision vs. recall </li></ul></ul><ul><ul><li>The relevancy quest </li></ul></ul>
  10. 10. The Relevancy Quest <ul><li>TF/IDF </li></ul><ul><ul><li>Term frequency inverse document frequency </li></ul></ul><ul><ul><li>Webmaster effectively controls </li></ul></ul><ul><ul><li>Easily gamed by spammers </li></ul></ul><ul><li>Link authority </li></ul><ul><ul><li>Inbound link analysis </li></ul></ul><ul><ul><li>Other webmasters effectively control </li></ul></ul><ul><ul><li>Increasingly gamed by spammers </li></ul></ul><ul><li>Interactive query refinement </li></ul><ul><ul><li>Extraction, taxonomy, clustering, faceted navigation </li></ul></ul><ul><li>Search history </li></ul><ul><li>Semantic analysis </li></ul><ul><ul><li>Hakia’s SemanticRank </li></ul></ul><ul><li>Humans </li></ul><ul><ul><li>Cha Cha guides </li></ul></ul>You try to guess what someone wants when they blurt two words! Is it Quixotic? Consider BI: let users access a database via a tool that generates a query language. No guessing required.
  11. 11. Aside: Fun with People-Powered Search <ul><li>Search: my wife … </li></ul><ul><li>Status: Looking for a guide ... connected to Pamela C </li></ul><ul><li>Pamela C: Welcome to ChaCha! </li></ul><ul><li>You: I'm looking for my wife, can you help me? </li></ul><ul><li>Pamela C: Nope </li></ul><ul><li>You: I thought you guys were experts in searching? She left an hour ago but her cell phone is off. </li></ul><ul><li>Pamela C: searchable searches online </li></ul><ul><li>You: you mean to say with all the world's information at our fingertips we can't put our heads together and find my wife? </li></ul><ul><li>Pamela C: nope </li></ul><ul><li>You: chacha sucks! </li></ul><ul><li>Pamela C: I am sorry </li></ul><ul><li>You: will you marry me Pamela? </li></ul><ul><li>Pamela C: You are married </li></ul><ul><li>You: yeah, but she's gone. I live in Utah. </li></ul><ul><li>Pamela C: Okay </li></ul><ul><li>You: awesome. </li></ul><ul><li>Pamela C: Awesome </li></ul><ul><li>Pamela C: What are you doin </li></ul><ul><li>You: crying </li></ul><ul><li>Pamela C: Why </li></ul><ul><li>You: I miss my other wife. </li></ul><ul><li>Pamela C: What is her name </li></ul><ul><li>You: Pamela B. </li></ul><ul><li>You: you are the third pamela, but I promise to love you as much as the others. </li></ul>
  12. 12. Search Viewed From a DBMS Perspective <ul><li>You can run one query (very well) </li></ul><ul><ul><li>Return links to documents where document contains [word | phrase] </li></ul></ul><ul><li>You doesn’t serve as the master repository for the content </li></ul><ul><ul><li>But seems to “cache” everything so it takes as much space as if it did </li></ul></ul><ul><li>Indexes can point to things that no longer exist </li></ul><ul><li>Indexes are not updated in real-time </li></ul><ul><li>There’s no locking or transactions; no read-consistent snapshots </li></ul><ul><li>I like </li></ul><ul><ul><li>How you assume nothing about the content </li></ul></ul><ul><ul><li>Your indexing and query processing for text </li></ul></ul><ul><ul><li>Your scalability architecture and clustering </li></ul></ul>My, what a funny-looking DBMS you have there!
  13. 13. Schema Regularity vs. Query Flexibility Schema Regularity Query Flexibility Hi Low Low Hi NDBMS RDBMS Enterprise Search Engine ?
  14. 14. Topics <ul><li>A brief history of databases </li></ul><ul><li>Search engines: filling the gap </li></ul><ul><li>Web 2.0 </li></ul><ul><li>Content applications </li></ul>
  15. 15. Enter Web 2.0 <ul><li>People don’t just want to find documents anymore </li></ul><ul><li>They want to </li></ul><ul><ul><li>Search content </li></ul></ul><ul><ul><li>At granular level (just give me what I want) </li></ul></ul><ul><ul><li>Navigate content </li></ul></ul><ul><ul><li>See dynamic content </li></ul></ul><ul><ul><li>View content in context (of a task or problem) </li></ul></ul><ul><ul><li>Understand the social graph associated with content </li></ul></ul><ul><ul><li>Transcend the data / content boundary </li></ul></ul><ul><ul><li>Integrate and mash-up content </li></ul></ul><ul><ul><li>Interact with content </li></ul></ul><ul><li>In short, they want full-blown applications on content </li></ul><ul><ul><li>Which we, shockingly, call “content applications” </li></ul></ul>
  16. 16. Some Thoughts on Web 2.0 <ul><li>Lots of funny little companies with one or two syllable names? </li></ul><ul><li>Hype designed by venture capitals to fuel Bubble 2.0? </li></ul><ul><li>What Tim 2 (O’Reilly) did while Tim 1 (Berners-Lee) waited for the semantic web? </li></ul><ul><li>Transformation of the web from publishing to programming platform? </li></ul><ul><li>In my opinion, it’s about </li></ul><ul><ul><li>The specialization of search: Retrevo, Kayak, PickPackGo </li></ul></ul><ul><ul><li>The wisdom of crowds: Wikis, PageRank, feedback (e.g., Diggs, eBay) </li></ul></ul><ul><ul><li>The re-emergence of community: Facebook, LinkedIn, MySpace </li></ul></ul><ul><ul><li>The transformation of the web </li></ul></ul><ul><ul><ul><li>From an e-commerce and publishing platform </li></ul></ul></ul><ul><ul><ul><li>To an interactive information platform </li></ul></ul></ul><ul><ul><li>The transformation to content applications </li></ul></ul>
  17. 17. Topics <ul><li>A brief history of databases </li></ul><ul><li>Search engines: filling the gap </li></ul><ul><li>Web 2.0 </li></ul><ul><li>Content applications </li></ul>
  18. 18. Content Applications <ul><li>Task aware </li></ul><ul><ul><li>Put content in context of what you’re trying to do </li></ul></ul><ul><li>Role aware </li></ul><ul><ul><li>And in the context of who you are and what you can see </li></ul></ul><ul><li>Are read/write </li></ul><ul><li>Have classical (e.g., workflow) and web 2.0 features </li></ul><ul><li>Seamlessly blend software and content to support that task </li></ul><ul><ul><li>Help a pathologist diagnose </li></ul></ul><ul><ul><li>Help a pilot land with one engine </li></ul></ul><ul><ul><li>Help a salesperson negotiate a contract </li></ul></ul><ul><ul><li>Help a solider defuse an explosive device </li></ul></ul><ul><ul><li>Help a traveler assemble an integrated itinerary </li></ul></ul><ul><ul><li>Help a worker perform maintenance on a PET scanner </li></ul></ul><ul><ul><li>Help a researcher explore prior research </li></ul></ul>
  19. 19. Content Apps: Top-to-Bottom XML <ul><li>My browser speaks XML, and my content’s in XML </li></ul><ul><li>So why I am doing all this mapping between ... </li></ul><ul><ul><li>XML = hierarchies </li></ul></ul><ul><ul><li>Java = objects and classes </li></ul></ul><ul><ul><li>RDBMS = tables </li></ul></ul><ul><li>Answer: you don’t have to </li></ul><ul><ul><li>XQuery = misnamed and underpositioned </li></ul></ul><ul><ul><li>Not just a query language, a full application development language </li></ul></ul><ul><ul><li>Transcends data and content </li></ul></ul>
  20. 20. XML: Not Just Web Content, Office Content <ul><li>Real XML formats are coming to the enterprise / office document world </li></ul><ul><ul><li>Open Office XML (OOXML) </li></ul></ul><ul><ul><li>Open Document Format (ODF) </li></ul></ul><ul><li>Basic entity extraction through SmartTags </li></ul><ul><li>Quark and Adobe are increasingly XML oriented </li></ul><ul><li>In 3-5 years, you will be able not only to </li></ul><ul><ul><li>Get web content in an XML format </li></ul></ul><ul><ul><li>But also most enterprise content as well </li></ul></ul>
  21. 21. The Future Content Application Platform? <ul><li>Bolted-together search engines and RDBMS </li></ul><ul><ul><li>The usual choice today </li></ul></ul><ul><ul><li>Negatives: integration work; performance / can’t push processing close enough to the data; thick middle tiers </li></ul></ul><ul><li>Extended RDBMS </li></ul><ul><ul><li>Columns of type XML </li></ul></ul><ul><ul><li>Franken-queries in SQL/XQuery hybrids </li></ul></ul><ul><ul><li>Negatives: Will never be optimized for XML and content; impossibility of general purpose data processing and content optimization; limits on extendibility </li></ul></ul><ul><ul><li>Even COBOL wasn’t forever </li></ul></ul><ul><li>XML servers </li></ul><ul><ul><li>Native XML, native XQuery </li></ul></ul><ul><ul><li>Optimized for content </li></ul></ul>} Methinks these
  22. 22. The End of One Size Fits All <ul><li>One Size Fits All: An Idea Whose Time Has Come and Gone </li></ul><ul><li>“ The last 25 years of commercial DBMS development can be summed up in a single phrase: “One size fits all”. This phrase refers to the fact that the traditional DBMS architecture (originally designed and optimized for business data processing) has been used to support many data-centric applications with widely varying characteristics and requirements. </li></ul><ul><li>In this paper, we argue that this concept is no longer applicable to the database market, and that the commercial world will fracture into a collection of independent database engines, some of which may be unified by a common front-end parser.” </li></ul><ul><li>Stonebraker and Cetintemel </li></ul><ul><ul><li> </li></ul></ul>
  23. 23. The Rise of Special-Purpose DBMSs <ul><li>Streams </li></ul><ul><ul><li>Streambase, Skyler, Coral8 </li></ul></ul><ul><li>Huge memory stores </li></ul><ul><ul><li>TimesTen </li></ul></ul><ul><li>Hypercubes </li></ul><ul><ul><li>Essbase, Qlikview </li></ul></ul><ul><li>XML contentbases </li></ul><ul><ul><li>MarkLogic </li></ul></ul><ul><li>Column-orientation </li></ul><ul><ul><li>Vertica </li></ul></ul><ul><li>Data warehouses </li></ul><ul><ul><li>Greenplum, Hyperroll, Teradata </li></ul></ul><ul><li>XML data / messages </li></ul><ul><ul><li>Ipedo, Tamino </li></ul></ul><ul><li>RDF triples </li></ul>Things Codd wasn’t thinking about when he invented the relational model
  24. 24. Wall Street Journal 11/14/07, Excerpts <ul><li>Start-Ups Mine Database Field </li></ul><ul><li>Most databases are based on technology that originated 30 years ago. But change is in the air. </li></ul><ul><li>A mob of start-ups have been developing variants of the software, which provides the equivalent of filing cabinets for corporate information. Customers say the offerings are generating faster answers to questions that require sifting through huge volumes of business information. </li></ul><ul><li>Some predict specialized products will find a niche. &quot; One kind of database is not going to suit all of the different applications we are coming up with,&quot; said Donald Feinberg, [head database] analyst at market researcher Gartner Inc . </li></ul>
  25. 25. Schema Regularity vs. Query Flexibility Schema Regularity Query Flexibility Hi Low Low Hi NDBMS RDBMS Enterprise Search Engine XML Server
  26. 26. Conclusions <ul><li>I’m a 25 year career database guy </li></ul><ul><ul><li>Content has always been viewed as “funny data” </li></ul></ul><ul><li>Relational databases have had a great run </li></ul><ul><ul><li>But only diamonds are forever </li></ul></ul><ul><li>The same attributes that drove RDBMS success will drive XML server success </li></ul><ul><ul><li>Database platform needed for a new generation of applications </li></ul></ul><ul><ul><li>Ability to answer any question </li></ul></ul><ul><li>Those applications must </li></ul><ul><ul><li>Handle varying (XML) schema-ed content </li></ul></ul><ul><ul><li>Support users in tasks </li></ul></ul><ul><ul><li>Deliver insight (content analytics), not just content </li></ul></ul><ul><ul><li>Enable web 2.0 features </li></ul></ul>Check out my blog at