Applying Semantics to Unstructured
     Data (Big and Getting Bigger)
                Wednesday, November 30, 2012
                        4:00 – 5:00
Bryan Bell
  Vice President, Enterprise Solutions, Expert
  System
Lynda Moulton,
  Analyst & Consultant, LWM Technology Services
Peter O'Kelly
  Principal Analyst, O'Kelly Associates
Overall Session Agenda
• Introduction and context-setting
• "Big Data" 101 for Business
• Semantics and the Big Data Opportunity




                                           2
Big Data 101 Agenda
•   Big data in context
•   Recap
•   Risks
•   Recommendations




                                  3
Big Data in Context
• What is “big data”?
  – Unhelpfully, both “big data” and “NoSQL,” generally
    considered a key part of the big data wave, are defined
    more in terms of what they aren’t than what they are
  – A typical big data definition (Wikipedia):
     • “[…] data sets that grow so large that they become awkward to
       work with using on-hand database management tools”
  – Often associated with Gartner’s volume, variety (and
    complexity), and velocity model
     • Also value and veracity considerations

                                                                 4
Big Data in Context
• Why is big data a big deal now?
  – Commoditized hardware, software, and networking
     • Capability and price/performance curves that continue to
       defy all economic “laws”
     • Cloud services with radical new capability/cost equations
  – Maturation and uptake of related open source
    software, especially Hadoop
     • Powerful and often no- or low-cost



                                                             5
Big Data in Context
• Why is big data a big deal now (continued)?
  – Market enthusiasm for “NoSQL” systems
  – Useful and often “open source”/public domain data
    sources and services
  – Mainstreaming of semantic tools and techniques




                                                   6
A Prime Minicomputer, c1982




                              7
Fast-Forward to 2012




                       8
Fast-Forward to 2012




                       9
Fast-Forward to 2012




                       10
Fast-Forward to 2012




                       11
Fast-Forward to 2012




                       12
Google BigQuery




                  13
Hadoop
• Hadoop is often considered central to big data
  – Originating with Google’s MapReduce architecture,
    Apache Hadoop is an open source architecture for
    distributed processing on networks of commodity
    hardware
  – From Wikipedia:
     • “’Map’ step: The master node takes the input, divides it into
       smaller sub-problems, and distributes them to worker nodes
     • ‘Reduce’ step: The master node then collects the answers to
       all the sub-problems and combines them in some way to
       form the output – the answer to the problem it was
       originally trying to solve”

                                                                   14
Hadoop
• Commercial application domains include (from
  Wikipedia)
  –   Log and/or clickstream analysis of various kinds
  –   Marketing analytics
  –   Machine learning and/or sophisticated data mining
  –   Image processing
  –   Processing of XML messages
  –   Web crawling and/or text processing
  –   General archiving, including of relational/tabular data,
      e.g. for compliance

                                                             15
Hadoop
• Hadoop is popular and rapidly evolving
  – Most leading information management vendors
    have embraced Hadoop
  – There is now a Hadoop ecosystem




                                                  16
Meanwhile, Back in the Googleplex
• Dremel, BigQuery, Spanner, and other really
  big data projects




                                                17
Meanwhile, Back in the Googleplex




                                18
Google Now




             19
A NoSQL Taxonomy
• From the NoSQL Wikipedia article:




                                      20
A View of the NoSQL Landscape




                                21
Another NoSQL Landscape View
NoSQL Perspectives
• The “NoSQL” meme confusingly conflates
   – Document database requirements
      • Best served by XML DBMS (XDBMS)
   – Physical database model decisions on which only DBAs and
     systems architects should focus
      • And which are more complementary than competitive with DBMS
   – Object databases, which have floundered for decades
      • But with which some application developers are nonetheless
        enamored, for minimized “impedance mismatch,” despite significant
        information management compromises
   – Semantic (e.g., RDF) models
      • Also more complementary than competitive with RDBMS/XDBMS
• Also consider: the “traditional” DBMS players can leverage
  the same underlying technology power curves

                                                                            23
Data as a Service
• The (single source of) truth is out there?...
   – High-quality data sources are being commoditized
   – Value is shifting to the ability to discern and leverage conceptual
     connections, not just to manage big databases
• Some resources and developments to explore
   –   Social networking graphs and activities
   –   Data.com (Salesforce.com)
   –   Data.gov
   –   Google Knowledge Graph
   –   Linked Data
   –   Microsoft Windows Azure Data Marketplace
   –   Wikidata.org
   –   Wolfram Alpha

                                                                      24
Mainstreaming Semantics
• Tools and techniques applied in search of
  more meaning, e.g.,
  – Vocabulary management
  – Disambiguation and auto-categorization
  – Text mining and analysis
  – Context and relationship analysis
• It’s still ideal to help people capture and apply
  data and metadata in context
  – Semantic tools/techniques are complementary

                                                  25
Mainstreaming Semantics
• The Semantic Web is still more vision than reality
   – But Google, Microsoft, and Yahoo, and Yandex, for
     example, are improving Web searches by capturing
     and applying more metadata and relationships via
     schema.org schemas in Web pages
   – And Google’s Knowledge Graph is about “things, not
     strings,” with, as of mid-2012, “500 million objects, as
     well as more than 3.5 billion facts about and
     relationships between these different objects”



                                                            26
Recap
• Commoditization and cloud
  – Very significant new opportunities
• Hadoop and related frameworks
  – Complementary to RDBMS and XDBMS
• NoSQL
  – Likely headed for meme-bust…
• Data services
  – Game-changing potential
• Semantic tools and techniques
  – Rapidly gaining momentum

                                         27
Risks
• The potential for an ever-expanding set of information silos
   – Focus on minimized redundancy and optimized integration
• GIGO (garbage in, garbage out) at super-scale
   – New opportunities for unprecedented self-inflicted damage, for
     organizations that don’t model or query effectively
• Cognitive overreach
   – The potential for information workers to create and act on
     nonsensical queries based on poorly-designed and/or
     misunderstood information models
• Skills gaps can create competitive disadvantages
   – Modeling, query formulation, and data analysis
   – Critical thinking and information literacy



                                                                  28
Recommendations
• Aim high: big data is in many respects just
  getting started…
   – A lot of technology recycling but also
     significant and disruptive innovation
• Work to build consensus among stake-
  holders on the opportunities and risks
• Focus on human skills – e.g., critical
  thinking and information literacy
   – For now, an instance of the most creative and
     powerful type of semantic big data processor
     we know of is between your ears

                                                     29

Gilbane Boston 2012 Big Data 101

  • 1.
    Applying Semantics toUnstructured Data (Big and Getting Bigger) Wednesday, November 30, 2012 4:00 – 5:00 Bryan Bell Vice President, Enterprise Solutions, Expert System Lynda Moulton, Analyst & Consultant, LWM Technology Services Peter O'Kelly Principal Analyst, O'Kelly Associates
  • 2.
    Overall Session Agenda •Introduction and context-setting • "Big Data" 101 for Business • Semantics and the Big Data Opportunity 2
  • 3.
    Big Data 101Agenda • Big data in context • Recap • Risks • Recommendations 3
  • 4.
    Big Data inContext • What is “big data”? – Unhelpfully, both “big data” and “NoSQL,” generally considered a key part of the big data wave, are defined more in terms of what they aren’t than what they are – A typical big data definition (Wikipedia): • “[…] data sets that grow so large that they become awkward to work with using on-hand database management tools” – Often associated with Gartner’s volume, variety (and complexity), and velocity model • Also value and veracity considerations 4
  • 5.
    Big Data inContext • Why is big data a big deal now? – Commoditized hardware, software, and networking • Capability and price/performance curves that continue to defy all economic “laws” • Cloud services with radical new capability/cost equations – Maturation and uptake of related open source software, especially Hadoop • Powerful and often no- or low-cost 5
  • 6.
    Big Data inContext • Why is big data a big deal now (continued)? – Market enthusiasm for “NoSQL” systems – Useful and often “open source”/public domain data sources and services – Mainstreaming of semantic tools and techniques 6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Hadoop • Hadoop isoften considered central to big data – Originating with Google’s MapReduce architecture, Apache Hadoop is an open source architecture for distributed processing on networks of commodity hardware – From Wikipedia: • “’Map’ step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes • ‘Reduce’ step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve” 14
  • 15.
    Hadoop • Commercial applicationdomains include (from Wikipedia) – Log and/or clickstream analysis of various kinds – Marketing analytics – Machine learning and/or sophisticated data mining – Image processing – Processing of XML messages – Web crawling and/or text processing – General archiving, including of relational/tabular data, e.g. for compliance 15
  • 16.
    Hadoop • Hadoop ispopular and rapidly evolving – Most leading information management vendors have embraced Hadoop – There is now a Hadoop ecosystem 16
  • 17.
    Meanwhile, Back inthe Googleplex • Dremel, BigQuery, Spanner, and other really big data projects 17
  • 18.
    Meanwhile, Back inthe Googleplex 18
  • 19.
  • 20.
    A NoSQL Taxonomy •From the NoSQL Wikipedia article: 20
  • 21.
    A View ofthe NoSQL Landscape 21
  • 22.
  • 23.
    NoSQL Perspectives • The“NoSQL” meme confusingly conflates – Document database requirements • Best served by XML DBMS (XDBMS) – Physical database model decisions on which only DBAs and systems architects should focus • And which are more complementary than competitive with DBMS – Object databases, which have floundered for decades • But with which some application developers are nonetheless enamored, for minimized “impedance mismatch,” despite significant information management compromises – Semantic (e.g., RDF) models • Also more complementary than competitive with RDBMS/XDBMS • Also consider: the “traditional” DBMS players can leverage the same underlying technology power curves 23
  • 24.
    Data as aService • The (single source of) truth is out there?... – High-quality data sources are being commoditized – Value is shifting to the ability to discern and leverage conceptual connections, not just to manage big databases • Some resources and developments to explore – Social networking graphs and activities – Data.com (Salesforce.com) – Data.gov – Google Knowledge Graph – Linked Data – Microsoft Windows Azure Data Marketplace – Wikidata.org – Wolfram Alpha 24
  • 25.
    Mainstreaming Semantics • Toolsand techniques applied in search of more meaning, e.g., – Vocabulary management – Disambiguation and auto-categorization – Text mining and analysis – Context and relationship analysis • It’s still ideal to help people capture and apply data and metadata in context – Semantic tools/techniques are complementary 25
  • 26.
    Mainstreaming Semantics • TheSemantic Web is still more vision than reality – But Google, Microsoft, and Yahoo, and Yandex, for example, are improving Web searches by capturing and applying more metadata and relationships via schema.org schemas in Web pages – And Google’s Knowledge Graph is about “things, not strings,” with, as of mid-2012, “500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects” 26
  • 27.
    Recap • Commoditization andcloud – Very significant new opportunities • Hadoop and related frameworks – Complementary to RDBMS and XDBMS • NoSQL – Likely headed for meme-bust… • Data services – Game-changing potential • Semantic tools and techniques – Rapidly gaining momentum 27
  • 28.
    Risks • The potentialfor an ever-expanding set of information silos – Focus on minimized redundancy and optimized integration • GIGO (garbage in, garbage out) at super-scale – New opportunities for unprecedented self-inflicted damage, for organizations that don’t model or query effectively • Cognitive overreach – The potential for information workers to create and act on nonsensical queries based on poorly-designed and/or misunderstood information models • Skills gaps can create competitive disadvantages – Modeling, query formulation, and data analysis – Critical thinking and information literacy 28
  • 29.
    Recommendations • Aim high:big data is in many respects just getting started… – A lot of technology recycling but also significant and disruptive innovation • Work to build consensus among stake- holders on the opportunities and risks • Focus on human skills – e.g., critical thinking and information literacy – For now, an instance of the most creative and powerful type of semantic big data processor we know of is between your ears 29

Editor's Notes

  • #8 At my employer (a facilities management company in Seattle, responsible for the claims-processing back-end for Washington State Delta Dental) in 1982: added 4 MB main memory to a Prime 750 system; changed the locks on the building and office doors, due to new security risk (mega-$ upgrade)…
  • #9 Source: “How to Create a Mind,” Ray Kurzweil, p. 256
  • #10 Source: “How to Create a Mind,” Ray Kurzweil, p. 259
  • #11 Source: “How to Create a Mind,” Ray Kurzweil, p. 258
  • #12 Source: “How to Create a Mind,” Ray Kurzweil, p. 254
  • #13 Clipped from Amazon sale page 20121116
  • #14 An example of what these power curves facilitate…Source https://developers.google.com/bigquery/docs/pricing#tableCaptured 2012118Also consider Amazon Web Services, Salesforce.com’sdatabase.com
  • #15 Image source: http://hadoop.apache.org/
  • #16 Image source: http://hadoop.apache.org/
  • #17 Image sources: http://hadoop.apache.org/http://www.slideshare.net/cloudera/tokyo-nosqlslidesonly?from=ss_embed
  • #18 Source https://cloud.google.com/files/BigQueryTechnicalWP.pdfLater in the same paper: “Dremel can scan 35 billion rows without an index in tens of seconds […] parallelize queries and run them on tens of thousands of servers simultaneously”
  • #19 Source https://cloud.google.com/files/BigQueryTechnicalWP.pdf
  • #20 Google Now as an example of a big data application context – a personal experience snapshot:Early morning: searched Google Maps on my iPad for the address to nearby town high school, where I was my driving daughter that evening for an eventLater, on my Google Nexus 7 tablet, Google Now presented a “card” with directions and traffic information to the school – from my current location, which it got from GPS or Wi-Fi network triangulationOne click away from turn-by-turn navigationAlso note Google Voice Search All at no cost to me (except for the data I gave Google in exchange for using the services…) This is a basic example – Google has much more in mind, and it’s not alone in this context – it aspires to use predictive analytics (and big data about you in the world…) to answer questions before you ask them
  • #21 Captured 20121105
  • #22 Source: http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/My point: this is supposed to be a simplification, relative to RDBMS?...
  • #23 Source: http://arnon.me/2012/11/nosql-landscape-diagrams/Another view of the NoSQL land-grab; these domains (except for “NewSQL”)all predated the “NoSQL” label
  • #24 NoSQL is sometimes also associated with open source DBMS, adding more confusion
  • #25 Snapshots:Government data: also see http://www.cityofboston.gov/open/ and other country-level servicesWolfram Alpha – captured 20121118: “Curated data: 10+ trillion pieces of data from primary sources with continuous updating”
  • #27 Google Knowledge Graph: http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html
  • #30 Reference to Kurzweil book: a timely (and optimistic) review of how we got here, and what may be next