1. Applying Semantics to Unstructured
Data (Big and Getting Bigger)
Wednesday, November 30, 2012
4:00 – 5:00
Bryan Bell
Vice President, Enterprise Solutions, Expert
System
Lynda Moulton,
Analyst & Consultant, LWM Technology Services
Peter O'Kelly
Principal Analyst, O'Kelly Associates
2. Overall Session Agenda
• Introduction and context-setting
• "Big Data" 101 for Business
• Semantics and the Big Data Opportunity
2
3. Big Data 101 Agenda
• Big data in context
• Recap
• Risks
• Recommendations
3
4. Big Data in Context
• What is “big data”?
– Unhelpfully, both “big data” and “NoSQL,” generally
considered a key part of the big data wave, are defined
more in terms of what they aren’t than what they are
– A typical big data definition (Wikipedia):
• “[…] data sets that grow so large that they become awkward to
work with using on-hand database management tools”
– Often associated with Gartner’s volume, variety (and
complexity), and velocity model
• Also value and veracity considerations
4
5. Big Data in Context
• Why is big data a big deal now?
– Commoditized hardware, software, and networking
• Capability and price/performance curves that continue to
defy all economic “laws”
• Cloud services with radical new capability/cost equations
– Maturation and uptake of related open source
software, especially Hadoop
• Powerful and often no- or low-cost
5
6. Big Data in Context
• Why is big data a big deal now (continued)?
– Market enthusiasm for “NoSQL” systems
– Useful and often “open source”/public domain data
sources and services
– Mainstreaming of semantic tools and techniques
6
14. Hadoop
• Hadoop is often considered central to big data
– Originating with Google’s MapReduce architecture,
Apache Hadoop is an open source architecture for
distributed processing on networks of commodity
hardware
– From Wikipedia:
• “’Map’ step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes
• ‘Reduce’ step: The master node then collects the answers to
all the sub-problems and combines them in some way to
form the output – the answer to the problem it was
originally trying to solve”
14
15. Hadoop
• Commercial application domains include (from
Wikipedia)
– Log and/or clickstream analysis of various kinds
– Marketing analytics
– Machine learning and/or sophisticated data mining
– Image processing
– Processing of XML messages
– Web crawling and/or text processing
– General archiving, including of relational/tabular data,
e.g. for compliance
15
16. Hadoop
• Hadoop is popular and rapidly evolving
– Most leading information management vendors
have embraced Hadoop
– There is now a Hadoop ecosystem
16
17. Meanwhile, Back in the Googleplex
• Dremel, BigQuery, Spanner, and other really
big data projects
17
23. NoSQL Perspectives
• The “NoSQL” meme confusingly conflates
– Document database requirements
• Best served by XML DBMS (XDBMS)
– Physical database model decisions on which only DBAs and
systems architects should focus
• And which are more complementary than competitive with DBMS
– Object databases, which have floundered for decades
• But with which some application developers are nonetheless
enamored, for minimized “impedance mismatch,” despite significant
information management compromises
– Semantic (e.g., RDF) models
• Also more complementary than competitive with RDBMS/XDBMS
• Also consider: the “traditional” DBMS players can leverage
the same underlying technology power curves
23
24. Data as a Service
• The (single source of) truth is out there?...
– High-quality data sources are being commoditized
– Value is shifting to the ability to discern and leverage conceptual
connections, not just to manage big databases
• Some resources and developments to explore
– Social networking graphs and activities
– Data.com (Salesforce.com)
– Data.gov
– Google Knowledge Graph
– Linked Data
– Microsoft Windows Azure Data Marketplace
– Wikidata.org
– Wolfram Alpha
24
25. Mainstreaming Semantics
• Tools and techniques applied in search of
more meaning, e.g.,
– Vocabulary management
– Disambiguation and auto-categorization
– Text mining and analysis
– Context and relationship analysis
• It’s still ideal to help people capture and apply
data and metadata in context
– Semantic tools/techniques are complementary
25
26. Mainstreaming Semantics
• The Semantic Web is still more vision than reality
– But Google, Microsoft, and Yahoo, and Yandex, for
example, are improving Web searches by capturing
and applying more metadata and relationships via
schema.org schemas in Web pages
– And Google’s Knowledge Graph is about “things, not
strings,” with, as of mid-2012, “500 million objects, as
well as more than 3.5 billion facts about and
relationships between these different objects”
26
27. Recap
• Commoditization and cloud
– Very significant new opportunities
• Hadoop and related frameworks
– Complementary to RDBMS and XDBMS
• NoSQL
– Likely headed for meme-bust…
• Data services
– Game-changing potential
• Semantic tools and techniques
– Rapidly gaining momentum
27
28. Risks
• The potential for an ever-expanding set of information silos
– Focus on minimized redundancy and optimized integration
• GIGO (garbage in, garbage out) at super-scale
– New opportunities for unprecedented self-inflicted damage, for
organizations that don’t model or query effectively
• Cognitive overreach
– The potential for information workers to create and act on
nonsensical queries based on poorly-designed and/or
misunderstood information models
• Skills gaps can create competitive disadvantages
– Modeling, query formulation, and data analysis
– Critical thinking and information literacy
28
29. Recommendations
• Aim high: big data is in many respects just
getting started…
– A lot of technology recycling but also
significant and disruptive innovation
• Work to build consensus among stake-
holders on the opportunities and risks
• Focus on human skills – e.g., critical
thinking and information literacy
– For now, an instance of the most creative and
powerful type of semantic big data processor
we know of is between your ears
29
Editor's Notes
At my employer (a facilities management company in Seattle, responsible for the claims-processing back-end for Washington State Delta Dental) in 1982: added 4 MB main memory to a Prime 750 system; changed the locks on the building and office doors, due to new security risk (mega-$ upgrade)…
Source: “How to Create a Mind,” Ray Kurzweil, p. 256
Source: “How to Create a Mind,” Ray Kurzweil, p. 259
Source: “How to Create a Mind,” Ray Kurzweil, p. 258
Source: “How to Create a Mind,” Ray Kurzweil, p. 254
Clipped from Amazon sale page 20121116
An example of what these power curves facilitate…Source https://developers.google.com/bigquery/docs/pricing#tableCaptured 2012118Also consider Amazon Web Services, Salesforce.com’sdatabase.com
Source https://cloud.google.com/files/BigQueryTechnicalWP.pdfLater in the same paper: “Dremel can scan 35 billion rows without an index in tens of seconds […] parallelize queries and run them on tens of thousands of servers simultaneously”
Google Now as an example of a big data application context – a personal experience snapshot:Early morning: searched Google Maps on my iPad for the address to nearby town high school, where I was my driving daughter that evening for an eventLater, on my Google Nexus 7 tablet, Google Now presented a “card” with directions and traffic information to the school – from my current location, which it got from GPS or Wi-Fi network triangulationOne click away from turn-by-turn navigationAlso note Google Voice Search All at no cost to me (except for the data I gave Google in exchange for using the services…) This is a basic example – Google has much more in mind, and it’s not alone in this context – it aspires to use predictive analytics (and big data about you in the world…) to answer questions before you ask them
Captured 20121105
Source: http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/My point: this is supposed to be a simplification, relative to RDBMS?...
Source: http://arnon.me/2012/11/nosql-landscape-diagrams/Another view of the NoSQL land-grab; these domains (except for “NewSQL”)all predated the “NoSQL” label
NoSQL is sometimes also associated with open source DBMS, adding more confusion
Snapshots:Government data: also see http://www.cityofboston.gov/open/ and other country-level servicesWolfram Alpha – captured 20121118: “Curated data: 10+ trillion pieces of data from primary sources with continuous updating”
Google Knowledge Graph: http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html
Reference to Kurzweil book: a timely (and optimistic) review of how we got here, and what may be next