Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XML Amsterdam 2012 Keynote

7,229 views

Published on

Defining Modern XML that works with Bigdata

Published in: Technology, News & Politics
  • If you’re struggling with your assignments like me, check out ⇒ www.WritePaper.info ⇐. My friend sent me a link to to tis site. This awesome company. After I was continuously complaining to my family and friends about the ordeals of student life. They wrote my entire research paper for me, and it turned out brilliantly. I highly recommend this service to anyone in my shoes. ⇒ www.WritePaper.info ⇐.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I pasted a website that might be helpful to you: HelpWriting.net Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Tattoo surgery too expensive? Lemon + this ingredient can take it off, Safe, effective, and inexpensive. ♥♥♥ http://t.cn/A67tYDYR
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The 3 Secrets To Your Bulimia Recovery ▲▲▲ https://tinyurl.com/y88w4b6s
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/39pMlLF ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

XML Amsterdam 2012 Keynote

  1. 1. BigData and Modern XMLJim Fulleremail: jim.fuller@marklogic.com twitter: @xquerySenior Engineer, Europe19/09/12
  2. 2. Senior engineerhttp://jim.fuller.namehttp://exslt.org @xquery XSLT UK 2001http://www.xmlprague.cz @perl6 Perlmonks Pilgrim
  3. 3. KickoffXML current status Modern XML & BigData
  4. 4. ‘ontogeny recapitulates phylogeny’ or A (very)Brief History of ML• Late 1950s: Noam Chomsky ‘generative grammars’• 1969: Charles Goldfarb (w/ Ed Mosher and Ray Lorie) created GML• 1986: SGML formalized• 1998: XML 1.0 W3C recommendation• 1998 – 2012: A lot of stuff happened• Future: XML 2.0 … microXML ?
  5. 5. RDBMS Goliath vs XML David• Back then, XML was the proto ‘nosql’• X in AJAX• Now many ‘davids’• AJAJ
  6. 6. Documents• Back then, it wasn’t unusual for vendor to say ‘tough luck’ with your data (pay up)• Now, most office documents are in XML
  7. 7. The ‘long tail’ of XML Vocabularies • Back then, vocabularies built with proprietary approaches • Today, 1000’s of vocabularies based on XML – ‘2012 U.S. GAAP Taxonomy Adopted by SEC; FASB Webcast April 3’
  8. 8. Anyone heard of shipdex ?
  9. 9. Back then, XML/Markup Conferences• Software Development 99 East, November 8-13, 1999, Washington D.C.• XML One Fall 99, November 8-11, 1999, Santa Clara, CA• XML 99 December 6-9, 1999, Philadelphia PA• Markup Technologies 99 Conference December 5-9, 1999, Philadelphia• Web Design 2000, February 7-9, 2000, Atlanta• XTech 2000, February 27-March 2, San Jose• Software Development 2000 West, March 20-24, 2000, San Jose• Sixteenth International Unicode Conference, Boston, March 27-30, 2000• The Ninth International World Wide Web Conference, May 15-19, 2000, Amsterdam• DL 2000: Fifth ACM Conference on Digital Libraries, June 3-6 2000, Texas• XML Europe 2000, June 12-16, Paris• Web Design World 2000, July 17-21, 2000, Seattle, Washington• MetaStructures, August 14-16, 2000, Montreal, Quebec, Canada• XML Developers Conference, August 17-18, 2000, Montreal, Quebec• Internet World Expo, October 25-27, 2000, New York City• XML 2000/Markup Technologies 2000, December 3-7, Washington• ….. Even a Geek Cruises XML Excursion - January 2001
  10. 10. Today - XML/Markup Conferences• The XML ‘parallelogram’ – Balisage – XML Summer School – XML Prague – XML Amsterdam• Xtech*• markupForum• XATA• MarkLogic World (600 ppl)• databaseX (London November 2013 ?)
  11. 11. Other important good stuff• Evolution of the Operating System – Unix is the operating system for text – Windows tried to be the operating system for binaries, then adopted xml .. Mixed bag – Java (vm) has a strong xml stack• The web changed everything to text based markup.• cheap RAM/Disk/CPU• Virtualization = scale out
  12. 12. Other important good stuffhttp://googleblog.blogspot.cz/2012/02/unicode-over-60-percent-of-web.html
  13. 13. Unfair to point out failure ?• Namespaces• XLINK• WS* astronautics• Draconian error checking• XML SCHEMA• XFORMS• XSLT 1.0 (or any xml) in the browser• XHTML vs HTML5• Too many specs (modularity good, complexity bad)
  14. 14. “Winning isn’t everything. Thereshould be no conceit in victory and no despair in defeat.” - Matt Busby• 2001 I was the RDBMS serial killer – ‘kill RDBMS’• Define successful ? – Adoption ? – Cheaper ? – Faster ? – Better ?
  15. 15. Drill Down distraction - Why is Xquery successful productive ?• Choose my most successful (adhoc stories, visible success)• Functional, dynamic … work with structure, text and values … stored proc + query lang• XPATH^• Is it possible to qualify/quantify Xquery productivity?
  16. 16. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular string processing problem - public domain 2006.
  17. 17. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular string processing problem - public domain 2006.
  18. 18. * 28msec – 2011 http://www.28msec.com/html/home Java XQuerySimpleDB 2905 572S3 8589 1469SNS 2309 455 13803 2496
  19. 19. Developing an Enterprise WebApplication in XQuery - 2009 Martin Kaufmann, Donald Kossmann Java/J2EE XQuery Model 3100 240 View 4100 1500 Controller 900 1180 8100 (?) 2920 (3490)
  20. 20. Nooooo! The problem with loc correlation of failure with very high loc is the only certain fact with loc That’s about it
  21. 21. An empirical comparison of C, C++,Java, Perl, Python, Rexx, and Tcl for a search/string-processing programLutz Prechelt (prechelt@ira.uka.de) Fakulta ̈t fu ̈r Informatik Universita ̈t Karlsruhe Language #loc per Function Point C 91 C++ 53 Java 54 Perl 21 * Designing and writing programs using dynamic languages tended to take half as long as well as resulting in half the code.
  22. 22. Function Point MethodNooooo! #loc per FP = Lines of code Per Function Point
  23. 23. Project Uncertainty Principle * Dilbert Comic 2003 United Features Syndicate Inc
  24. 24. Reviewed 11 projects FP Analysis Calc FP inputs/outputs Calc VAF (0.65 + [ (Ci) / 100]) AVP = VAF * sum(FP) #loc using cloc = #loc per FP* FP overview - http://www.softwaremetrics.com/fpafund.htm
  25. 25. Language #loc per Function Point Perl 21 Eiffel 21 SQL 13-30 XQuery 27-33 Haskell 38 Erlang 40 Python 42-47 Java 50-80 Javascript 50-55 Scheme 53 C++ 59-80 C 128-140 http://www.qsm.com/resources/function-point-languages-table
  26. 26. Xquery 2011 Survey
  27. 27. Preferred Programming Language 73% 55% 45% 32% 22%
  28. 28. Which data formats do you use the most ? 95% 40% 39% 32% 27% 18% 15%
  29. 29. Do you think XQuery makes you a more productive programmer ? 67% 14% 10% 8%
  30. 30. Is XQuery more productive then (with???) Java in developing web based data applications ? 58% 22% 12% 8%
  31. 31. Time to bust one myth• xml is too slow and bloated• http://www.navioo.com/ajax/ajax_json_x ml_Benchmarking.php• In data orientated AJAJ scenarios with JSON … best most benchmarks today is 30% faster with less load (so more with less resources)
  32. 32. mongodb * http://www.linkedin.com/skills/skill
  33. 33. Javascript * http://www.linkedin.com/skills/skill
  34. 34. XQuery * http://www.linkedin.com/skills/skill
  35. 35. XSLT * http://www.linkedin.com/skills/skill
  36. 36. hadoop * http://www.linkedin.com/skills/skill
  37. 37. Java * http://www.linkedin.com/skills/skill
  38. 38. JSON * http://www.linkedin.com/skills/skill
  39. 39. XML * http://www.linkedin.com/skills/skill
  40. 40. Back When SQL Was Invented…
  41. 41. born in the 90’s
  42. 42. XML ?
  43. 43. Might even be
  44. 44. Channel effect of Aging inTechnology• “Average age of @guardian Facebook audience is 29. Website is 37, print paper 44. Amazing channel effect, really. #newsrw”• Babyboomers, Gen X, Y and Z• I feel a bit uneasy framing generational arguments …
  45. 45. Death of the XML Child…Overachieving Child Prodigies grow up
  46. 46. Lets not get distracted.
  47. 47. Don’t mention the war
  48. 48. XML Hard Core - XML Hype cycle 2002 2006 20121998 2007 XML’s reported death-> 2009
  49. 49. REST of the World - XML Hype cycle 2006 2002 20091998 2012 XML’s reported death->
  50. 50. hype cycle*2012 Gartner Hype Cycle http://www.gartner.com
  51. 51. 2001 Edd Dumbill – xml.com‘Stop the XML hype, I want to get off As editor of XML.com, I welcome the massivesuccess XML has had. But things prized by the XMLcommunity — openness and interoperability — aregetting swallowed up in a blaze of marketing hype. Is thisthe price of success, or something we can avoid? ‘ Source: Edd Dumbill (March 2001)
  52. 52. 2012 Edd Dumbill g+ post‘For many years I was the editor of XML.com,and the chair of the XML Europe conference.Today, it seems that XMLs mission to be a weblanguage is mostly dead. Im not saying XML isuseless: it has proved itself as a more easily-usedSGML, but Im not sure its expanded too faroutside of that.’ Source: Edd Dumbill (March 2012)
  53. 53. Current Status: XML is dead• XML fought too many battles (RDBMS, NoSQL, web developers, HTML5)• Age channeling and Hype curve in effect• But XML technology stack is embracing JSON etc …• No room for sentimentality in technology
  54. 54. XML is dead boring
  55. 55. Halftime Break
  56. 56. Big Data & Modern XML
  57. 57. What’s the problem ?
  58. 58. Is XML Applicable to Big Data ?• We know it is, that’s why I am here• Some of you already know• Need to dig into the detail• But we first need to simplify things
  59. 59. http://kensall.com/big-picture/bigpix22.html
  60. 60. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity
  61. 61. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity
  62. 62. managing data variability, volume & velocity is hardYou need to be a (data) scientist to build this rocket ship.
  63. 63. So whats the problem again ? #1 – How to Apply Modern XML to your BigData problems ? #1a: XML Milieu too complicated, need to identify what is successful as Modern XML #1b – BigData is a huge opportunity #1c – BigData has a huge learning curve and high risks
  64. 64. Solving #1 – Defining Modern XML• Identify the technologies• Identify and classify the Scenarios
  65. 65. Modern XML Technology analysis• Internal survey of ML Customer projects &External survey of projects (w/ pref towardsBig/Complex projects)• Informal Survey (polldaddy)• Qualitative and quantitative
  66. 66. Eisenhower - "What is important is seldomurgent and what is urgent is seldom important," URGENT NOT URGENTIMPORTANT Critical GoalsNOT IMPORTANT interruptions Distractions
  67. 67. Survey Interpretations• XML 1.0, Namespaces is important now• XProc, XHTML important now• XSLT 2 and XQuery 1 very important now• XSLT 2 and XQuery 2 in the browser near future• XQuery 3.0 important near future• SAX/DOM now, XOM possible future• XML Schema 1.0 now, 1.1 for the near future• Schematron surprising• Semweb is for the future• SVG and MathML due to web browser support• XML vocabulary has a very ‘long tail’
  68. 68. Modern XML Technology CandidatesCore XML 1.0 These technologies trended Namespaces highly across all analysisOther Bold – could be trending due to browser impl/historicalTransform XSLT 2.0 dep XQuery 1.0Processing SAX, DOMSchema Schematron XML Schema 1.0Semantics RDF OWLVocabularies Office Doc ML SVG
  69. 69. Modern XML Tier 1Core XML 1.0 These technologies trended Namespaces highly across all analysisOther XProc Bold – could be trending due to browser impl/historicalTransform XSLT 2.0 / 3.0 / browser dep XQuery 1.0 / 3.0Processing SAX, DOM Italic – strong signal, earlySchema Schematron usage, interest of unproven XML Schema 1.0 / 1.1 spec/techSemantics RDF OWLVocabularies Office Doc ML SVG
  70. 70. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0 XML Canonicalization Namespaces xml:idOther XProc XHTML*Transform XSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAX RELAX-NGSchema Schematron XML Schema 1.0 / 1.1 SPARQLSemantics RDF OWLVocabularies Office Doc ML MathML SVG Docbook SOAP* , DITA, EPUB
  71. 71. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0 XML Canonicalization Namespaces xml:id XML infosetOther XProc XHTML*Transform XSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAX RELAX-NGSchema Schematron XML Schema 1.0 / 1.1 SPARQLSemantics RDF OWLVocabularies Office Doc ML MathML SVG Docbook SOAP , DITA, EPUB,Data Formats XML, text, binary, JSON
  72. 72. The technology triggers• XML Database – reduce the complexity/risk of BigData – MarkLogic – eXist – Zorba – Sedna – Basex – Others (Oracle!)• Xquery - Rapid prototyping• Avoid purist architectures, embrace heterogeneity
  73. 73. Modern XML / BigData Scenarios• Classic Scenarios – Document (xml) Database – Aggregation – Enterprise Search – Heterogeneous Content store – Publishing• BigData Scenarios – BigData ‘classic’ – Extreme personalisation – Predictive analytics – Financial analysis – Realtime analysis (management/financial) – Actionable intelligence• Semantic Web – too early to categorize but its for real
  74. 74. Solving Problem #2 – Focus on the Practicalities• What type of Big Data problem do you have ? – The urgent, important ones you know about – The urgent, important ones you don’t know about• Create a dedicated team (analytics, problem domain experts) to identify the later• Assess data maturity (Data Audit)• With power comes responsibility … Ethical Analytics
  75. 75. BigData Tech Advice• Start using an XML database asap!• Don’t get distracted by the zoo … start hadooping right away• ‘Data outlives code’, spend more time on the data, clean abstractions, cogent, opening it up
  76. 76. Size appropriatelyVolume – will be relative to your current capability,if the requirement is a magnitude greater pastcurrent infrastructure scalingVelocity – Updates versus reads ? High volatilitywith realtime queries ?Variety – managing versioning ?Complexity – multiples, complex processes
  77. 77. Size Appropriately: Are you a ‘Facebook’ (Google, Yahoo…) ?• 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)• 2.7 billion Likes per day• 300 million photos uploaded per day• 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters• 105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes• 70,000 queries executed on these databases per day• 500+terabytes of new data ingested into the databases every day• Are you planning to scale out too ~180,900 servers ?• ~18000 database servers ingesting 500+ terabytes of data through a guestimated 50+ billion calls …. A day! http://www.datacenterknowledge.com/the-facebook-data-center-faq/
  78. 78. Solving Problem #3 – Understanding the risks• Biggest mistakes seen with BigData adoption• ‘data scientists themselves dont have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didnt understand that they were overfitting the data’. – Alex Pentland MITs “Big Data guy”
  79. 79. Summary• We reviewed some aspects of XML current status in the dataverse• Identified a subset of the XML Milieu – calling it Modern XML• Identified the scenarios where Modern XML are being brought to bear with Bigdata• Reviewed common mistakes and Risks with BigData
  80. 80. Final Thesis• Modern XML provides great foundation today – Great for ‘classic’ scenarios – Great technical positioning for addressing challenges of BigData – Great technical positioning for semweb• Adopting an XML database mitigates risk• Knowing Bigdata/Modern XML scenarios helps us mitigate risks• There is a big prize if you get BigData right
  81. 81. Avoid stereotypes I’m a RDBMS I’m a Protocol Buffer I’m a JsonI’m an XML
  82. 82. Jeni Tennison XML Prague 2012 talk JSON XML RDF HTML
  83. 83. Be wary of Paradigm Shifts• RedMonks - Language divergence• Andresson - Software is eating the world• 128bit and beyond current von neuman/harvard arch ?• Power Wall (at server farms/mobile devices)• The web revolution is not done yet (http://www.firebase.com/index.html)
  84. 84. Embrace change
  85. 85. ‘Form is temporary. Class is permanent’• XML is emerging from its ‘Trough of disillusionment’, because its useful, productive and reacting to new requirements.• Modern XML is successful on many different measure, mature and dead boring• Modern XML can help solve your BigData problems
  86. 86. Pull the Technology Trigger – Try an XML Database Today!• MarkLogic 6 – Web dev ‘surface area’, work with JSON – REST API – Java API – Work across different data• Zorba• eXist• BaseX• Sedna

×