XML Amsterdam 2012 Keynote


Published on

Defining Modern XML that works with Bigdata

Published in: Technology, News & Politics
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • First encounter with BigData – mapmaking (Gravity map of Rhode Island) late 1980’s – geophysics generates a lot of data points
  • Apologies for the gratitious football analogies … it was either that or Jaws
  • Chomsky proposed the notion of grammar to capture the structural constraints of a particular language. A grammar is described as a set of production rules. Depending on the kind of rules one is allowed to write, Chomsky distinguished four types of grammars of decreasing complexity, from type 0 (unconstrained) to type 3 (regular grammar). While type 0 and type 1 grammars need a full-fledged Turing machine to be checked, type 2 or context free grammars (CFG) only need a stack machine, and type 3 or regular grammars only need a finite state automaton. The last two are interesting from a computer science perspective, as they require less complex algorithms.
  • Binaries replaced in most office programsAvg 200 word docs on each pc (comScore tech matrix study 2008), 1 billion * 100 billion xml files latently living on pc users hard drivesGartner study 2010 – as little as a few hundred billion xml based MS Word docs on the webWhats in email, sharehpoint, websites ?These are all lowball figures … not including open source file formats, or ebooks80% of all companies use some form of Office (a few years ago MS quote that there were billion instances of office worldwide) with nearly half of these being versions that default generate XML … that’s a lot of xmlAustralia Australia's Department of Finance has released a desktop policy that required all agencies to adopt Office Open XML as the standard document format.[37]Belgium Belgium's Federal Public Service for Information and Communication Technology in 2006 was evaluating the adoption of the Office Open XML format. It already then confirmed that it would consider all ISO standards to be open standards, mentioning Office Open XML as such a possible future ISO standard.[38]Denmark In June 2007, the DanishMinistry of Science, Technology and Innovation recommended that beginning with January 1, 2008 public authorities must support at least one of the two word processing document formats Office Open XML or Open Document Format in all new IT solutions, where appropriate.[39]Germany In Germany the Office Open XML standard is currently under observation by the Federal Commissioner for Information Technology ("Die Beauftragte der BundesregierungfürInformationstechnik"). The latest release of "SAGA" (Standards and Architectures for E-Government-Applications) includes Office Open XML file formats in both its strict and transitional variant. The ISO/IEC 29500 standard may be used to exchange complex documents when further processing is required.[40]Japan On June 29, 2007, the government of Japan published a new interoperability framework which gives preference to the procurement of products that follow open standards.[41][42] On July 2 the government declared that they hold the view that formats like Office Open XML which organizations such as Ecma International and ISO had also approved was, according to them, an open standard.[43] Also, they said that it was one of the preferences, whether the format is open, to choose which software the government shall deploy. Lithuania The Lithuanian Standards Board has adopted the ISO/IEC 29500:2008 Office Open XML format standard as the Lithuanian national standard. The decision was made by Technical Committee 4 Information Technology on March 5, 2009. The proposal to adopt the Office Open XML format standard was submitted by the Lithuanian Archives Department of the Government of the Republic of Lithuania.[44]Norway Norway's Ministry of Government Administration and Reform is evaluating the adoption of the Office Open XML format. The ministry put the document standard under observation in December 2007.[45]Sweden The Kingdom of Sweden has adopted Office Open XML as a 4 part Swedish National Standard SS-ISO/IEC 29500:2009.[46][47][48][49] Switzerland In July 2007, the Swiss Federal Council announced adherence SAGA.che-Government standards mandatory for its departments as well as for cantons, cities and municipalities. The latest version of SAGA.ch includes Office Open XML file formats.[50]United Kingdom The UK has put out an action plan for use of open standards, which includes ISO/IEC 29500 as one of several formats to be supported.[51][52]United States of America On April 15, 2009, the ANSI-accredited INCITSorganisation voted to adopt ISO/IEC 29500:2008 as an American National Standard.[53] The state of Massachusetts has been examining its options for implementing XML-based document processing. In early 2005, Eric Kriss, Secretary of Administration and Finance in Massachusetts, was the first government official in the United States to publicly connect open formats to a public policy purpose: "It is an overriding imperative of the American democratic system that we cannot have our public documents locked up in some kind of proprietary format, perhaps unreadable in the future, or subject to a proprietary system license that restricts access".[54] Since 2007 Massachusetts has classified Office Open XML as "Open Format" and has amended its approved technical standards list — the Enterprise Technical Reference Model (ETRM) — to include Office Open XML. Massachusetts, under heavy pressure from some vendors, now formally endorses Office Open XML formats for its public records.[55]
  • http://en.wikipedia.org/wiki/List_of_XML_markup_languageshttp://www.service-architecture.com/xml/articles/specific_xml_vocabularies.htmlhttp://www.iso20022.org/the_iso20022_standard.pagehttp://www.pcmag.com/encyclopedia_term/0,1237,t=XML+vocabulary&i=55060,00.asphttps://www.oasis-open.org/standards#ublv2.0NEIMhttp://www.ibm.com/developerworks/xml/library/x-NIEM1/index.htmlPMMLhttp://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
  • The The Ninth International World Wide Web Conference, May 15-19, 2000, Amsterdam had an XML Trackhttp://www9.org/http://www9.org/w9-devxml.html
  • SmallerMore focusedThere are also conferences on vocabularies, but they are less about XML and more about the problem domain itself
  • C/C++ are the language for binariesJava heavily adopted XML good at text/binariesWith html being the single preferred markup language
  • HTML5 +javascript kills flashIt remains to be seen what will kill PDF’sVirtualisationCheaper hardware/software
  • Instead of focusing on the negatives we know about, I thought I would spend some time being more precise on the positives
  • Its ML special sauce
  • Searched around in the literature of how to measure a programming language’s productivity
  • Amazon client libraries written in XQuery have 80% less code than their equivalent written in Java.
  • Useful study on implementing an entire enterprise web applicationDave Thomas mentionedThat the bigger a program gets is the single worst thingA long paper trail of software engineering studies has shown that many internal code metrics (such as methods per class, depth of inheritance tree, coupling among classes etc.) are correlated with external attributes, the most important of which is bugs. What the authors of this paper show is that when they introduce a second variable, namely, the total size of the program, into the statistical analysis and control for it, the correlation between all these code metrics and bugs disappears.Furthermore, this relates to larger development teams who by dint of their size generate large LOC e.g. the failure rate of projects with over 300-400 developers working on them skyrockets.
  • Probably not to do with loc itself, but with the fact that larger programs usually have more features to fail!
  • The following study (related to previous study) discovered the avg number of lines of code to implement a single function pointDesigning and writing programs using dynamic languages tend to take half as long, resulting in half the codeMore code = more bugs, studies have shown a direct relationship to failure with high loc
  • Settled on #loc per function pointLOCLine of code (LOC)Function pointsA method of decomposing a projects requirements in hope of being able to estimate effort to do the project
  • Before you start throwing stuff at me for mentioning LOC and FPI do not subscribe to using LOCC and FP for project estimation … though clearly there is a lot of historical analysis which I will leverage
  • Projects have been anonymised to protect the innocent (my colleagues, clients, etc) … disclaimer I did 4 of these Xquery projectsTried to reduce mixed language affect … e.g. but because of Xquery ‘dsl’ness for things like data apps no problemsFP range between ~250-1200Toke me 4 daysVAF in actuality remained close to 1.0Methodologyanalyzed 11 reasonably sized projects (4 were done my me)cloc defined lines of code based on user point of view I defined FP and summed themdefined VAF for each projectVAF = (TDI*0.01) + 0.65AVP = VAF * sum of FP
  • Close to SQLXquery is a query language and ‘good enough’ stored proc language for working with XMLSeems to matchup that its twice as productive as Java on paperVAF modifier ranges between .6 - 1.3 … in actuality for most of the projects was very close to 1.0 (confirms its usage across the industry)Largest: ~15000 loccQuite surprised by the results … they seem to confirm what people are feeling that xquery does the job with less codeWould need to do analyze a lot more projects … probably not enough xquery projects in existence to match other function point historical data tables for other languages.Threats to validityLow sample sizeInaccurate FP analysisSelection biasMixed language effectJob survey demonstrates that xquery jobs are in demand … an indirect measure that shows there is something cooking with XQueryAdhoc survey shows that a significant % of xquery developers think they are more productive when using XQuery … specifically when programming with XQuery and Java, C++, and JS working with XML, text, RDBMS and JSON.#Loc/FP Analysis confirms that XQuery is about as productive to SQL but has a much larger applicability … theadhoc survey seems to indicate that xquery is used in conjunction with an xml database is significantly leveraged when XQuery is used in conjunction with XML datastoreFindingsXquery is a DSL, though expansive not yet a GPL and its unclear if it should beNeeds better docs, tooling, librariesIs good because of fpIs bad because of fpVery good with XMLXQuery's most suitable purpose is in making semi-structured (i.e. XML) information repositories accessible, scrutable, and tractableXslt is complimentary by generating the viewXrx is productiveProductive used in conjunction with Java
  • Ran from Sept 20 – Oct 1st102 people responded15,000,000 programmers worldwide (wikipedia)~100 people95% certainconfidence interval: +- 9.8% errorUnited States 43%    United Kingdom 15%     Germany 10%     France 8%     Czech Republic 3%     Netherlands 2%     Switzerland 2%50% people put their name to the poll
  • This survey targeted developers who used Xquery.Strong correlation between usage of xquery and java and xsltMultiple choiceXquery 73 22%Java 55 17%XSLT 45 14%Javascript 32 10%C++ 22 7%python 18 5%C++ 14 4%Perl 12 4%C# 12 4%ruby 10 3%php 10 3%Haskell 9 3%Scala 9 3%Lisp 7 2%Erlang 1 >1%
  • Strong correlation between XML and usage of text, rdbms and jsonMultiple choiceXML 95 36%Text 40 15%RDBMS 39 15%JSON 32 12%Binaries (images, video, etc) 27 10%Office documents 18 7%Semantic web stuff (RDF, owl, etc..) 15 6%
  • Single OptionYes 67 67%Maybe 14 14%No 10 10%Don’tKnow 8 8%
  • Ok, we’ve drilled down into Xquery … we don’t have time to drill down into every technology we deem productive … but clearly there is something to this xml stack that is real
  • http://www.navioo.com/ajax/ajax_json_xml_Benchmarking.phpClaiming 2 to 10 times fasterNow little differencehttp://www.navioo.com/ajax/examples/json/test.phpOptimisations in the browser have helped bothIn programming languageEvidenceWith IE8 css2 started getting its act together (nightmares of IE6 fading in the distance) … earlier XSLT 1.0 looked promising, CSS3 even more promisingSafari/chrome/opera -- Data vs document orientated … clearly only some scenariosXml is too slow or bloated XML is not html … and the whole XHTML Forced xml processing with XSLT 1.0 in the browser Dynamic dispatch and fp = big learning curve for most web developersTooling and browsers misinterpreted draconian well formedness
  • Hstore for postgresql is key value store with ACIDDropping acid
  • Hstore for postgresql is key value store with ACIDDropping acid
  • If we told people that Goldfarb’s GML was born in the 60’s … which begot SGML hence XML it
  • JeniTennison evoked wonderful imagery at her XML Prague 2012 keynote
  • Though sometimes its hard to not fight a war, when encountering people with well meaning sentiments
  • We fought many warsRDBMSWeb browsers (browser ppl won html5 is markup)Interchange (JSON won)We are in a ‘don’t mention the war’ period.Not necc isolationist … modern xml technology stack (as we will identify later) is very active in embracing jsonWeb people think textual markup is dead whilst using it ? Strange irony to that, but they are just emerging from the trough of disillusionmentXML folks are embracing how to integrate with JSON … webdevppl don’t want to know about it.Lost the war with RDBMSLost the war for the browserLost the war for interchange
  • 2002- lots of books, lots of adoption, lots of hype2006- December 2005, Yahoo! began offering some of its web services in JSON and google starts providing JSON to GDATAXML’s perception tainted by financial crisis (lots of content providers going out of business)Yet XML Prague attendance doubled (and sold out between 2009-2011)Bigdata and semantic web showing that we need more
  • WS* astronautics were shooting XML into orbitHeavy on the Enterprise Investment by browsers, sun, microsoftetcXML Hype cycle was several years in the making ( we are now on the slope of enlightment = modern xml)Switching from relational to hierarchical (text, structure (mixed content), values = semistructured data)Though I find it a bit unfair … bigdata is mentioned on this list as if it was a ‘thing’ but it’s an underpinning The map/reduce hype cycle ?Disagree with some thingsHtml5 is probably just about starting down the trough of disillusionment … Ian Hickson / Anne … html is looking like PDF these days (html5+js+css3) … its great progress but not on things I consider importanthttp://www.itworld.com/it-managementstrategy/293397/gartner-dead-wrong-about-big-data-hype-cycleArgues that the hype cycle is wrong because BigData has ‘real’ benefitis … he is missing the point
  • Don’t get upset if your pet technology goes in and out of fashion … expect this to happen a few times in your career.Sentimentality – that’s like saying you should start using goto statements because you miss them ,… XML needs to have a meaning a use, a valid domain to be applied too
  • We’ve talked about where XML has been and where is is today, as well as update some of the older perma topicsBut I mainly wanted to talk to you about XML’s place in the dataverse … as it relates to BigData
  • Is the problem thatXML is dead or XML time is up ?Not really … because XML is everywhere … its not going anywhere soon.Its everywhere in a way that JSON will never be … which is one of the reasons for JSON success/uptake.The problem is not XML vs JSON, we’ve been over that debate and I think everyone here can see the benefits of each data format.
  • http://kensall.com/big-picture/bigpix22.html
  • I said the NoSQL word, now I will say the other word e.g. BigData. … Curt Monash, well known db analyst, calls this polystructured … many call it unstructured but even text data will have some structure, probably all heard about the 85% of data goes unused.Just 10% increase in using a companies existing data can result in giant gains.
  • Show how this relates to specific industry sectors …
  • The three v’s of data is hard to manage.When I first saw this graphic I thought it was a pair of programmers (mostly because the guys look kind of like Larry Wall), but I think these guys are business guys and it occurred to me that we are in a strange place now where business folk are making commercial decisions based on algorithms … algorithms are absolutely crucial to our craft but it trivilizes the solution … like saying we will use hammers to build a house; of course we will use hammers.Developers need to balance off their desire to learn algorithms with the reality of getting stuff done
  • http://jimfuller2011.polldaddy.com/surveys/1906925/report/locationsCaveat – we are talking about solutions with databases !
  • If things are in urgent/important cell, that’s what you work on first, try to push everything into the Important, not urgent categoryNever read ‘The 7 Habits of Highly Effective People’
  • http://jimfuller2011.polldaddy.com/surveys/1906925/report
  • Items in bold are almost certainly skewed by either large historical dependency and/or browsers now implementItems italic/underlined are either in recc stage or was just ‘on the line’ in terms of ranking data
  • Items in bold are almost certainly skewed by either large historical dependency and/or browsers now implementItems italic/underlined are either in recc stage or was just ‘on the line’ in terms of ranking data
  • This is much better subsetItems in bold are skewed by either large historical dependency and/or browsers support
  • http://kensall.com/big-picture/bigpix22.html
  • This is much better subsetItems in bold are skewed by either large historical dependency and/or browsers support
  • Data maturity******* Stage one – ‘no usable data’******* Stage two, ‘too much data’, isn’t much better though. When you are swamped with data it will take up too much of your time to sort through it and the chances are that you will end up with many, if not most, of your insights being unrelated to your core business strategy. Before you know it you’re running around in woods that are heavily dense with trees and inhabited by wild geese.******* Stage three, ‘the right data’, is better, as you may well assume. With the ‘right’ data you can get the insights that support your primary business focus, ensuring that you have as much information to facilitate success in your chosen field as possible.******* The ‘predictive’, stage four, is one that many consider to be the optimum stage. This is where you make the transition from reactive to proactive. When you reach the predictive stage you can start to understand how certain influences in the future will affect your business and plan accordingly. A slightly banal yet illustrative example is to calculate what the expected peaks in website visitors will be following an advertising campaign so that enough bandwidth can be employed to cope. Something more complex might involve a simulation of market patterns and supply chain effects should a large scale natural disaster occur.******* The final stage, ‘strategic’, is the most data intensive
  • So far most of the scenarios I showed are BigData … or at a minimum represent maximums for their industry sectorNo feasibility study – initial sanity check if what you want to do is possible No organized selection process – self selection means no support/buy in at the various levels needed …FOSS selects itself !No proof of conceptPremature project initiation, before data is readyOverfittingIn statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.Seems like common sense, but our most successful clients avoided most of these mistakes which reduced risk immeasurablyShoot for the stars, but you don’t really want to build a rocketshipFOSS can be an important onramp to BigData, but eventually you will want to be able to create commercial partnershipsPoC is a scaled down version, the goal is to identify gaps in your skillset … you should help build the PoCStarting a project early is common in enterprise, it’s a mistakeVendors want to sell their software … resist the urge to let them do the work and drill down into the detail with your own problem domain experts
  • Marc Andreeson 'software is eating the world' http://online.wsj.com/article/SB10001424053111903480904576512250915629460.htmlRedmonk reports that programming languages have never been as diverse as todayRedMonk Tier 1 Languages (02/12)1C2 C#3 C++4 Java5 JavaScript6 Objective-C7 PERL8 PHP9 Python10 Ruby11 ShellscriptSource:RedMonk Tier 2 Languages (02/12)1 ASP2 ActionScript3 Assembly4 Clojure5 CoffeeScript6 ColdFusion7 CommonLisp8D9 Delphi10 EmacsLispwe've been living in a fairly stable hardware bubble for 30 years e.g. same techniques yet smaller, fasterThe power wall About five years ago, however, the top speed for most microprocessors peaked when their clocks hit about 3 gigahertz. The problem is not that the individual transistors themselves can't be pushed to run faster; they can. But doing so for the many millions of them found on a typical microprocessor would require that chip to dissipate impractical amounts of heat. Computer engineers call this the power wall. Given that obstacle, it's clear that all kinds of computers, including supercomputers, are not going to advance at nearly the rates they have in the past.Advances*********** tissue engineering*********** Terascaleneuromorphic chips (memristorssynapes, nanostore memory (logic and memory together)*********** Many billions and probably trillions of electronic tattoos (less than a penny each in most cases) with processing, sensors, memory, wireless*********** 2000 qubit adiabatic quantum computers*********** The human brain project (if funded would be done and if not there are other DARPA and asian projects of comparable scale)*********** Memristors at exascale (supercomputer class), petascale for very affordable systems*********** Sensors even more capable*********** Electronic tattoos even cheaper and more capable.*********** Deep robotics commercialization adoption.*********** Beamed power and persistent UAVs*********** Megascale or gigascale adiabatic quantum computersHardware*********** Optical computing - trapping, storing and manipulating light is difficult.*********** Quantum computing*********** Neuronal computing*********** DNA computing*********** Reversible computing - Normally every computational operation that involves losing a bit of information also discards the energy used to represent it. Reversible computing aims to recover and reuse this energy.*********** Billiard Ball computing - involves chain reactions of electrons passing from molecule to molecule inside a circuit.*********** Magnetic (NMR) computing Every glass of water contains a computer, if you just know how to operate it.*********** Glooper Computer One of the weirdest computers ever built forsakes traditional hardware in favour of "gloopware". Andrew Adamatzky at the University of the West of England, UK, can make interfering waves of propagating ions in a chemical goo behave like logic gates, the building blocks of computers.*********** Mouldy computers*********** Water wave computing Perhaps the most unlikely place to see computing power is in the ripples in a tank of water. Using a ripple tank and an overhead camera, Chrisantha Fernando andSampsaSojakka at the University of Sussex, used wave patterns to make a type of logic gate called an "exclusive OR gate", or XOR gate.
  • Remember write once, run everywhereProgramming for the browserWell, things changeJava was originally designed for interactive television"Write Once, Run Anywhere" (WORA)Java AppletsBe skeptical of purityXML is for data
  • When extensibility is not required, XML will always loose against DSLs:Diversity makes strong ecosystemsAs I’ve shown you, Modern XML is being applied to BigData problems today, How it provides;A stable, fast and mature toolset of technologies to work with textual markup, text and in many cases lots of different kind of dataFoundation for semwebCan be applied to a wide range of BigData scenarios
  • XML Amsterdam 2012 Keynote

    1. 1. BigData and Modern XMLJim Fulleremail: jim.fuller@marklogic.com twitter: @xquerySenior Engineer, Europe19/09/12
    2. 2. Senior engineerhttp://jim.fuller.namehttp://exslt.org @xquery XSLT UK 2001http://www.xmlprague.cz @perl6 Perlmonks Pilgrim
    3. 3. KickoffXML current status Modern XML & BigData
    4. 4. ‘ontogeny recapitulates phylogeny’ or A (very)Brief History of ML• Late 1950s: Noam Chomsky ‘generative grammars’• 1969: Charles Goldfarb (w/ Ed Mosher and Ray Lorie) created GML• 1986: SGML formalized• 1998: XML 1.0 W3C recommendation• 1998 – 2012: A lot of stuff happened• Future: XML 2.0 … microXML ?
    5. 5. RDBMS Goliath vs XML David• Back then, XML was the proto ‘nosql’• X in AJAX• Now many ‘davids’• AJAJ
    6. 6. Documents• Back then, it wasn’t unusual for vendor to say ‘tough luck’ with your data (pay up)• Now, most office documents are in XML
    7. 7. The ‘long tail’ of XML Vocabularies • Back then, vocabularies built with proprietary approaches • Today, 1000’s of vocabularies based on XML – ‘2012 U.S. GAAP Taxonomy Adopted by SEC; FASB Webcast April 3’
    8. 8. Anyone heard of shipdex ?
    9. 9. Back then, XML/Markup Conferences• Software Development 99 East, November 8-13, 1999, Washington D.C.• XML One Fall 99, November 8-11, 1999, Santa Clara, CA• XML 99 December 6-9, 1999, Philadelphia PA• Markup Technologies 99 Conference December 5-9, 1999, Philadelphia• Web Design 2000, February 7-9, 2000, Atlanta• XTech 2000, February 27-March 2, San Jose• Software Development 2000 West, March 20-24, 2000, San Jose• Sixteenth International Unicode Conference, Boston, March 27-30, 2000• The Ninth International World Wide Web Conference, May 15-19, 2000, Amsterdam• DL 2000: Fifth ACM Conference on Digital Libraries, June 3-6 2000, Texas• XML Europe 2000, June 12-16, Paris• Web Design World 2000, July 17-21, 2000, Seattle, Washington• MetaStructures, August 14-16, 2000, Montreal, Quebec, Canada• XML Developers Conference, August 17-18, 2000, Montreal, Quebec• Internet World Expo, October 25-27, 2000, New York City• XML 2000/Markup Technologies 2000, December 3-7, Washington• ….. Even a Geek Cruises XML Excursion - January 2001
    10. 10. Today - XML/Markup Conferences• The XML ‘parallelogram’ – Balisage – XML Summer School – XML Prague – XML Amsterdam• Xtech*• markupForum• XATA• MarkLogic World (600 ppl)• databaseX (London November 2013 ?)
    11. 11. Other important good stuff• Evolution of the Operating System – Unix is the operating system for text – Windows tried to be the operating system for binaries, then adopted xml .. Mixed bag – Java (vm) has a strong xml stack• The web changed everything to text based markup.• cheap RAM/Disk/CPU• Virtualization = scale out
    12. 12. Other important good stuffhttp://googleblog.blogspot.cz/2012/02/unicode-over-60-percent-of-web.html
    13. 13. Unfair to point out failure ?• Namespaces• XLINK• WS* astronautics• Draconian error checking• XML SCHEMA• XFORMS• XSLT 1.0 (or any xml) in the browser• XHTML vs HTML5• Too many specs (modularity good, complexity bad)
    14. 14. “Winning isn’t everything. Thereshould be no conceit in victory and no despair in defeat.” - Matt Busby• 2001 I was the RDBMS serial killer – ‘kill RDBMS’• Define successful ? – Adoption ? – Cheaper ? – Faster ? – Better ?
    15. 15. Drill Down distraction - Why is Xquery successful productive ?• Choose my most successful (adhoc stories, visible success)• Functional, dynamic … work with structure, text and values … stored proc + query lang• XPATH^• Is it possible to qualify/quantify Xquery productivity?
    16. 16. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular string processing problem - public domain 2006.
    17. 17. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular string processing problem - public domain 2006.
    18. 18. * 28msec – 2011 http://www.28msec.com/html/home Java XQuerySimpleDB 2905 572S3 8589 1469SNS 2309 455 13803 2496
    19. 19. Developing an Enterprise WebApplication in XQuery - 2009 Martin Kaufmann, Donald Kossmann Java/J2EE XQuery Model 3100 240 View 4100 1500 Controller 900 1180 8100 (?) 2920 (3490)
    20. 20. Nooooo! The problem with loc correlation of failure with very high loc is the only certain fact with loc That’s about it
    21. 21. An empirical comparison of C, C++,Java, Perl, Python, Rexx, and Tcl for a search/string-processing programLutz Prechelt (prechelt@ira.uka.de) Fakulta ̈t fu ̈r Informatik Universita ̈t Karlsruhe Language #loc per Function Point C 91 C++ 53 Java 54 Perl 21 * Designing and writing programs using dynamic languages tended to take half as long as well as resulting in half the code.
    22. 22. Function Point MethodNooooo! #loc per FP = Lines of code Per Function Point
    23. 23. Project Uncertainty Principle * Dilbert Comic 2003 United Features Syndicate Inc
    24. 24. Reviewed 11 projects FP Analysis Calc FP inputs/outputs Calc VAF (0.65 + [ (Ci) / 100]) AVP = VAF * sum(FP) #loc using cloc = #loc per FP* FP overview - http://www.softwaremetrics.com/fpafund.htm
    25. 25. Language #loc per Function Point Perl 21 Eiffel 21 SQL 13-30 XQuery 27-33 Haskell 38 Erlang 40 Python 42-47 Java 50-80 Javascript 50-55 Scheme 53 C++ 59-80 C 128-140 http://www.qsm.com/resources/function-point-languages-table
    26. 26. Xquery 2011 Survey
    27. 27. Preferred Programming Language 73% 55% 45% 32% 22%
    28. 28. Which data formats do you use the most ? 95% 40% 39% 32% 27% 18% 15%
    29. 29. Do you think XQuery makes you a more productive programmer ? 67% 14% 10% 8%
    30. 30. Is XQuery more productive then (with???) Java in developing web based data applications ? 58% 22% 12% 8%
    31. 31. Time to bust one myth• xml is too slow and bloated• http://www.navioo.com/ajax/ajax_json_x ml_Benchmarking.php• In data orientated AJAJ scenarios with JSON … best most benchmarks today is 30% faster with less load (so more with less resources)
    32. 32. mongodb * http://www.linkedin.com/skills/skill
    33. 33. Javascript * http://www.linkedin.com/skills/skill
    34. 34. XQuery * http://www.linkedin.com/skills/skill
    35. 35. XSLT * http://www.linkedin.com/skills/skill
    36. 36. hadoop * http://www.linkedin.com/skills/skill
    37. 37. Java * http://www.linkedin.com/skills/skill
    38. 38. JSON * http://www.linkedin.com/skills/skill
    39. 39. XML * http://www.linkedin.com/skills/skill
    40. 40. Back When SQL Was Invented…
    41. 41. born in the 90’s
    42. 42. XML ?
    43. 43. Might even be
    44. 44. Channel effect of Aging inTechnology• “Average age of @guardian Facebook audience is 29. Website is 37, print paper 44. Amazing channel effect, really. #newsrw”• Babyboomers, Gen X, Y and Z• I feel a bit uneasy framing generational arguments …
    45. 45. Death of the XML Child…Overachieving Child Prodigies grow up
    46. 46. Lets not get distracted.
    47. 47. Don’t mention the war
    48. 48. XML Hard Core - XML Hype cycle 2002 2006 20121998 2007 XML’s reported death-> 2009
    49. 49. REST of the World - XML Hype cycle 2006 2002 20091998 2012 XML’s reported death->
    50. 50. hype cycle*2012 Gartner Hype Cycle http://www.gartner.com
    51. 51. 2001 Edd Dumbill – xml.com‘Stop the XML hype, I want to get off As editor of XML.com, I welcome the massivesuccess XML has had. But things prized by the XMLcommunity — openness and interoperability — aregetting swallowed up in a blaze of marketing hype. Is thisthe price of success, or something we can avoid? ‘ Source: Edd Dumbill (March 2001)
    52. 52. 2012 Edd Dumbill g+ post‘For many years I was the editor of XML.com,and the chair of the XML Europe conference.Today, it seems that XMLs mission to be a weblanguage is mostly dead. Im not saying XML isuseless: it has proved itself as a more easily-usedSGML, but Im not sure its expanded too faroutside of that.’ Source: Edd Dumbill (March 2012)
    53. 53. Current Status: XML is dead• XML fought too many battles (RDBMS, NoSQL, web developers, HTML5)• Age channeling and Hype curve in effect• But XML technology stack is embracing JSON etc …• No room for sentimentality in technology
    54. 54. XML is dead boring
    55. 55. Halftime Break
    56. 56. Big Data & Modern XML
    57. 57. What’s the problem ?
    58. 58. Is XML Applicable to Big Data ?• We know it is, that’s why I am here• Some of you already know• Need to dig into the detail• But we first need to simplify things
    59. 59. http://kensall.com/big-picture/bigpix22.html
    60. 60. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity
    61. 61. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity
    62. 62. managing data variability, volume & velocity is hardYou need to be a (data) scientist to build this rocket ship.
    63. 63. So whats the problem again ? #1 – How to Apply Modern XML to your BigData problems ? #1a: XML Milieu too complicated, need to identify what is successful as Modern XML #1b – BigData is a huge opportunity #1c – BigData has a huge learning curve and high risks
    64. 64. Solving #1 – Defining Modern XML• Identify the technologies• Identify and classify the Scenarios
    65. 65. Modern XML Technology analysis• Internal survey of ML Customer projects &External survey of projects (w/ pref towardsBig/Complex projects)• Informal Survey (polldaddy)• Qualitative and quantitative
    66. 66. Eisenhower - "What is important is seldomurgent and what is urgent is seldom important," URGENT NOT URGENTIMPORTANT Critical GoalsNOT IMPORTANT interruptions Distractions
    67. 67. Survey Interpretations• XML 1.0, Namespaces is important now• XProc, XHTML important now• XSLT 2 and XQuery 1 very important now• XSLT 2 and XQuery 2 in the browser near future• XQuery 3.0 important near future• SAX/DOM now, XOM possible future• XML Schema 1.0 now, 1.1 for the near future• Schematron surprising• Semweb is for the future• SVG and MathML due to web browser support• XML vocabulary has a very ‘long tail’
    68. 68. Modern XML Technology CandidatesCore XML 1.0 These technologies trended Namespaces highly across all analysisOther Bold – could be trending due to browser impl/historicalTransform XSLT 2.0 dep XQuery 1.0Processing SAX, DOMSchema Schematron XML Schema 1.0Semantics RDF OWLVocabularies Office Doc ML SVG
    69. 69. Modern XML Tier 1Core XML 1.0 These technologies trended Namespaces highly across all analysisOther XProc Bold – could be trending due to browser impl/historicalTransform XSLT 2.0 / 3.0 / browser dep XQuery 1.0 / 3.0Processing SAX, DOM Italic – strong signal, earlySchema Schematron usage, interest of unproven XML Schema 1.0 / 1.1 spec/techSemantics RDF OWLVocabularies Office Doc ML SVG
    70. 70. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0 XML Canonicalization Namespaces xml:idOther XProc XHTML*Transform XSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAX RELAX-NGSchema Schematron XML Schema 1.0 / 1.1 SPARQLSemantics RDF OWLVocabularies Office Doc ML MathML SVG Docbook SOAP* , DITA, EPUB
    71. 71. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0 XML Canonicalization Namespaces xml:id XML infosetOther XProc XHTML*Transform XSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAX RELAX-NGSchema Schematron XML Schema 1.0 / 1.1 SPARQLSemantics RDF OWLVocabularies Office Doc ML MathML SVG Docbook SOAP , DITA, EPUB,Data Formats XML, text, binary, JSON
    72. 72. The technology triggers• XML Database – reduce the complexity/risk of BigData – MarkLogic – eXist – Zorba – Sedna – Basex – Others (Oracle!)• Xquery - Rapid prototyping• Avoid purist architectures, embrace heterogeneity
    73. 73. Modern XML / BigData Scenarios• Classic Scenarios – Document (xml) Database – Aggregation – Enterprise Search – Heterogeneous Content store – Publishing• BigData Scenarios – BigData ‘classic’ – Extreme personalisation – Predictive analytics – Financial analysis – Realtime analysis (management/financial) – Actionable intelligence• Semantic Web – too early to categorize but its for real
    74. 74. Solving Problem #2 – Focus on the Practicalities• What type of Big Data problem do you have ? – The urgent, important ones you know about – The urgent, important ones you don’t know about• Create a dedicated team (analytics, problem domain experts) to identify the later• Assess data maturity (Data Audit)• With power comes responsibility … Ethical Analytics
    75. 75. BigData Tech Advice• Start using an XML database asap!• Don’t get distracted by the zoo … start hadooping right away• ‘Data outlives code’, spend more time on the data, clean abstractions, cogent, opening it up
    76. 76. Size appropriatelyVolume – will be relative to your current capability,if the requirement is a magnitude greater pastcurrent infrastructure scalingVelocity – Updates versus reads ? High volatilitywith realtime queries ?Variety – managing versioning ?Complexity – multiples, complex processes
    77. 77. Size Appropriately: Are you a ‘Facebook’ (Google, Yahoo…) ?• 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)• 2.7 billion Likes per day• 300 million photos uploaded per day• 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters• 105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes• 70,000 queries executed on these databases per day• 500+terabytes of new data ingested into the databases every day• Are you planning to scale out too ~180,900 servers ?• ~18000 database servers ingesting 500+ terabytes of data through a guestimated 50+ billion calls …. A day! http://www.datacenterknowledge.com/the-facebook-data-center-faq/
    78. 78. Solving Problem #3 – Understanding the risks• Biggest mistakes seen with BigData adoption• ‘data scientists themselves dont have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didnt understand that they were overfitting the data’. – Alex Pentland MITs “Big Data guy”
    79. 79. Summary• We reviewed some aspects of XML current status in the dataverse• Identified a subset of the XML Milieu – calling it Modern XML• Identified the scenarios where Modern XML are being brought to bear with Bigdata• Reviewed common mistakes and Risks with BigData
    80. 80. Final Thesis• Modern XML provides great foundation today – Great for ‘classic’ scenarios – Great technical positioning for addressing challenges of BigData – Great technical positioning for semweb• Adopting an XML database mitigates risk• Knowing Bigdata/Modern XML scenarios helps us mitigate risks• There is a big prize if you get BigData right
    81. 81. Avoid stereotypes I’m a RDBMS I’m a Protocol Buffer I’m a JsonI’m an XML
    82. 82. Jeni Tennison XML Prague 2012 talk JSON XML RDF HTML
    83. 83. Be wary of Paradigm Shifts• RedMonks - Language divergence• Andresson - Software is eating the world• 128bit and beyond current von neuman/harvard arch ?• Power Wall (at server farms/mobile devices)• The web revolution is not done yet (http://www.firebase.com/index.html)
    84. 84. Embrace change
    85. 85. ‘Form is temporary. Class is permanent’• XML is emerging from its ‘Trough of disillusionment’, because its useful, productive and reacting to new requirements.• Modern XML is successful on many different measure, mature and dead boring• Modern XML can help solve your BigData problems
    86. 86. Pull the Technology Trigger – Try an XML Database Today!• MarkLogic 6 – Web dev ‘surface area’, work with JSON – REST API – Java API – Work across different data• Zorba• eXist• BaseX• Sedna