Workshop_CITA2015

CITA’15 Workshop, August 2015
Semantic Enrichment of
Unstructured Datasets
Bebo White
SLAC National Accelerator Laboratory/
Stanford University
bebo@slac.stanford.edu

CITA’15 Workshop,August 2015
Workshop Schedule
• 08:30-10:30 - Session 1
• 10:30-11:00 - Morning Tea Break
• 11:30-12:30 - Session 2
• 12:30-14:00 - Lunch Break
• 14:00-15:30 - Session 3
• 15:30-16:00 - Afternoon Tea Break
• 16:00-17:00 - Session 4

Workshop Agenda (1/3)
• Overview of “Big Data Analytics”
• Goals
• Common challenges
• Examples and applications
• What is missing
• Big Data and Open Data
• Characteristics of open (and semantic) data
• Usage
• Challenges
• Processes

• Semantically describing date
• Ontologies and namespaces
• Data triples
• Tripliﬁcation
• Introduction to RDF(S)
• Case Study - FOAF
• Merging RDF data
• RDF tools

• PingER as a tripliﬁcation case study
• Introduction to project
• PingER LOD
• Data model and process
• PingER LOD “data bloating”
• How PingER LOD extends PingER
• Summary and lessons learned

Workshop Format
• A workshop, not a tutorial
• Goal is to introduce concepts and
terminology and provoke future research
• Must be very interactive - questions/
discussion at any time
• Individual and group exercises
• Length of workshop depends on involvement

“High-volume, -velocity, and -variety information assets
that demand cost-effective innovative forms of information
processing for enhanced insight and decision making”

• Volume?
• ~ data volume worldwide in 2013 = 3.5 ZB
(including 400 billion feature length HD movies)
• Velocity?
• Every 60 sec. on Facebook - 510K posted
comments; 293K status updates; 136K uploaded
photos
• 30 billion shares
• 20 million apps installed

• Variety?
• Any type of data both meaningful and
meaningless
• Veracity?
• How is trust established?
• What does “like” really mean?

Evaluating “theV’s”
• A recent survey conducted by Paradigm4 indicates
• variety, not volume, is the bigger challenge of
analyzing Big Data - 71% of respondents
• Data Scientists aren’t terribly concerned with
the “size” of the data being currently analyzed -
tools and systems are in place to work with
large datasets
• storing large amounts of structured (or semi-
structured) data is not the problem, analysis is

Common Challenges of
Harnessing Big Data
• Mining huge (?) datasets
• Shortages of Big Data experts
• Privacy, legal, and social issues
• Strategies for acquiring Big Data - a new
form of currency
• BUT

“The theory is that you pump Big Data into the
‘black box’ of an analytics engine - most likely
hidden on some unknown server in the cloud -
and you get back a continuous stream of
insights”

“When you have large amounts of data your
appetite for hypotheses tends to get larger.And
if it’s growing faster than the statistical strength
of the data, then many of your inferences are
likely to be false.They are likely to be ‘white
noise.’ We have to have error bars around our
predictions.”
-Michael Jordan

Why is “Bigger Data”
Better?
• Outliers or small
clusters
• Rare discrete values or
classes
• Missing values
• Rare events or objects

Big Data Analytics and
Data Science

Unstructured Data
• Does not have a pre-defined data model or
is not organized in a pre-defined manner
• Typically text-heavy, but may contain data
such as dates, numbers, and facts
• May result in irregularities and ambiguities
that make it difficult to understand using
traditional programs
(Ref: Wikipedia)

Typical Big Data Problem
• Iterate over a large number of records
• Extract something of interest from each
(MAP)
• Shufﬂe and sort immediate results
• Aggregate immediate results (REDUCE)
• Generate ﬁnal output

MapReduce Can Refer
to…
• The programming model
• The execution framework (aka “runtime”
• The speciﬁc implementation

MapReduce
Implementations
• Google has a proprietary implementation in C++
• Bindings in Java, Python
• Hadoop is an open-source implementation in Java
• Development led byYahoo!, now an Apache project
• Used in production atYahoo!, Facebook,Twitter,
LinkedIn, Netﬂix, etc.
• The de facto Big Data processing platform
• Lots of custom research implementations

An Interesting Example -
“Sentiment Analysis”
• Goal - gauging mood on social network
data
• Not a traditional survey or focus group
• Social sites operate 24/7
• Timeliness - not subject to time lags
• Useful to marketers, IT, customers, etc. - a
limited (not general) sector

Difﬁcult Comment
Analysis (1/2)
• False negatives - “crying” & “crap” (negative) vs.
“crying with joy” & “holy crap!” (positive)
• Relative sentiment - “I bought a Honda Accord” -
great for Honda, bad for Toyota
• Compound sentiment - “I love the phone but hate
the network”
• Conditional sentiment - “If someone doesn’t call
me back, I’m never doing business with them
again!”

Difﬁcult Comment
Analysis (2/2)
• Scoring sentiment - “I like it” vs.“I really like it” vs.
“I love it”
• Sentiment modiﬁers - “I bought an iPhone
today :-)” “Gotta love the telephone company ;-<“
• International/cultural sentiments
• Japanese - unique emoticons for crying - (;_;)
• Italians - effusive, grandiose
• British - drier, less effusive

Analyzing Signiﬁcant Correlations
Between Social Media Measures and
Sales

Why is this a good
analytical model/process
for MapReduce?

However
MapReduce is NOT the only way to do “Big
Data” analytics

There is a 5th “V”
VALUE
Despite it’s volume, veracity, etc., what does
it really give us?
How can we extract insight/knowledge?

Interesting stuff, but
• Who is it beneﬁting?
• Is it making us smarter or safer?

What I’m really interested in
is the intersection between
Big Data and Linked Data…

Looking back…
• One of the great (IMHO) insights in Web
2.0 was developing mashups
• Supported the process of converting data
to knowledge/insight
• Usually done in an ad hoc manner, e.g.,
“screen scraping”
• Sometimes done with APIs

Is it possible to do “Data
Programming?”
• Can processes extract from data pools the
same insights that humans do?
• How do humans process collections of
data?

• Big Data (and even “not so Big Data”) tends
to be unstructured data (e.g., lists, e-mails,
tweets, etc.)
• Therefore it tends to be “thin” rather than
“thick”
• “Thin” means very little (if any) context -
just data, little knowledge
• What can be added to change data from
“thin” to “thick?”

9 Steps to Extract Insight
from Unstructured Data (1/2)
1. Make sense of the disparate data sources*
2. Sign off on the method of analytics and ﬁnd a
clear way to present the results
3. Decide the technology stack for data ingestion
and storage
4. Keep information in a data lake until it has to be
stored in a data warehouse
5. Prepare the data for storage

9 Steps to Extract Insight
from Unstructured Data (2/2)
6. Retrieve useful information
7. Ontology evaluation*
8. Statistical modeling and execution
9. Obtain insight from the analysis and
visualize it*

Linked Data is similar to Metadata
but provides Context and Meaning

Linked Data
• Provides access to the semantics of data
items
• Based upon Semantic Web technologies and
ontologies
• Designed for machines ﬁrst and humans later
• Degree of structure in descriptions of things
is high

Linked Data Pros
• Far more “parseable” and “machine
processable” than raw unstructured data
• Enhances data descriptions for complex
analyses
• Can contribute to theVERACITY of our data
• Wide variety of discipline/data ontologies
available

Linked Data Cons
• Much harder to do than adding keyword
metadata
• Building efﬁcient processing applications
and parsers
• Implementing effective linked data stores

Linked Open Data
• LOD refers to data stores of Linked Data that are
published (made available online and accessed via URLs)
and free to use
• Open data means it must be available to all without
copyright or ownership
• There is an increasing trend towards “opening”
government data (US and UK, San Francisco and more)
and scientiﬁc results
• Provides unprecedented ability to build “mashup”
applications

A deliberate, conscious, structured
attempt to turn data into
knowledge

How do we do this?
• By deﬁning unambiguously the relationships
between data items
• By using a shared deﬁnition and meaning
mechanism
• By expressing the semantics and syntax
inherent in the data

CITA’15 Workshop,August 2015(Ref: Hlomani & Stacey)

A lot of new data from

SSB
(solar system body)
astro:Planet horo:Planet IAU:Planet

SSB
IAU:Planet
horo:Planet astro:Planet

Rebirth
Regeneration
Pluto
signiﬁes
signiﬁes prefSymbol

Methane
Methane Ice
Pluto
madeOf
madeOf prefSymbol

Pluto
Regeneration Methane
Methane IceRebirth
signiﬁes
signiﬁes madeOf
madeOf
prefSymbol
prefSymbol

Exercise - What about
this Pluto?

Fundamental Concepts
(1/2)
• Modeling - making sense of unorganized
information/data
• Formality/Informality - the degree to which
the meaning of a modeling language is given
independent of the particular speaker or
audience

Fundamental Concepts
(2/2)
• Commonality andVariability - how to
manage things in common and some with
important differences
• Expressivity - the ability of a modeling
language to express maximum variety in
the model

Tabular Data About Elizabethan
Literature and Music
ID Title Author Medium Year
1
As You
Like It
Shakespeare Play 1599
2 Hamlet Shakespeare Play 1604
3 Othello Shakespeare Play 1603
4
“Sonnet
78”
Shakespeare Poem 1609

Resource
(subject)
Value
(object)
Property
(predicate)
Subject has a property with value “object” (s,p,o)
Concept of a data triple

Ontology/Vocabulary
(1/2)
• Provides a common background and
understanding of a particular domain or ﬁeld
of study, and ensures a common ground
among those who study the information
• A way of organizing concepts, information,
and ideas that is meant to be universal
within the ﬁeld and allows for a common
language to be spoken

Ontology/Vocabulary
(2/2)
• A structural framework that allows
concepts to be laid out in a way that makes
sense
• Shows the connections and relationships
between concepts in a manner that is
generally accepted by the ﬁeld

Sample Triples
Subject Predicate Object
Row 2 Title Hamlet
Row 2 Year 1604
Row 4 Medium Poem

Sample Triples
Shakespeare wrote King Lear
Shakespeare wrote Macbeth
Anne Hathaway married Shakespeare
Shakespeare livedIn Stratford
Stratford isIn England
Macbeth setIn Scotland
England partOf UK
Scotland partOf UK

Shakespeare
AnneHathaway
KingLear
Stratford England
Macbeth Scotland
UK
married
wrote
wrote
livedIn
isIn
setIn
partOf
partOf

Linked Data Technology
Stack
• URIs - Universal Resource Indicators
(generalization of URL)
• HTTP - HyperText Transport Protocol
• RDF - Resource Description Framework/
Format
• RDFS/OWL - RDF Schema/Web Ontology
Language

Linked Data Principles
(1/2)
• Use URIs as names of things
• Anything, not just documents
• Information resources and non-information
resources
• Use HTTP URIs
• Globally unique names, distributed ownership
• Allows people to look up those names

Linked Data Principles
(2/2)
• Provide useful information in RDF
• When someone (or something) looks up
a URI
• Include RDF links to other URIs
• To enable discovery of related
information

Plays of Shakespeare
with Qnames
lit:Shakespeare lit:wrote lit:Hamlet
lit:Shakespeare lit:wrote lit:Othello
lit:Shakespeare lit:wrote lit:WintersTale
… … …

Geographical
Information as Qnames
geo:Scotland geo:partOf geo:UK
geo:England geo:partOf geo:UK
geo:Wales get:partOf geo:UK
… … …

Triples Referring to URIs
with aVariety of Namespaces
lit:Shakespeare lit:wrote lit:Hamlet
bio:AnneHathaway bio:married bio:Shakespeare
geo:Stratford geo:isIn geo:England
geo:England geo:partOf geo:UK

Reengineering process - data
to data triples (‘triplification’)
Define/acquire
data source
Meta-Model
Define/acquire
mapping description
Apply
reengineering
Data
Source
Data
Source
+
Mapping
RDF
Dataset

RDF and the Semantic
Web
• Supports the goal of the Semantic Web
• Web information/data should have exact
and unambiguous meaning
• Web information/data can be understood
and processing by computers
• Computers can integrate information/
data from multiple sources on the Web

What is RDF?
• Resource Description Framework
• Provides a model for data and a syntax so that
independent parties can exchange and use it
• Designed mainly to be read and understood by
computer processors, not humans
• Written in XML
• A W3C Recommendation
• Any XML processor or parser can use

Basic Ideas Behind RDF
• RDF uses Web identiﬁers (URIs) to identify
resources
• RDF describes resources with properties
and property values
• Everything is represented as triples
• The essence of RDF is the (s,p,o) triple

RDF Data Model
• Any expression in RDF is a collection of triples (subject, predicate,
object)
• A set of triples is called an RDF graph
• The nodes of an RDF graph are its subjects and objects
• Direction is important - always points to object
• An assertion of an RDF triple says the relationship (as indicated by
the predicate) holds between subject and object
• The meaning of an RDF graph is conjunction (AND) of the
statements corresponding to all the triples it contains
• RDF does not provide means to express negation (NOT) or
disjunction (OR)

RDF Design Goal
• Having a simple data model
• Having formal semantics and provable inference
• Using an extensible URI-based vocabulary
• Using an XML-based syntax
• Supporting use of XML Schema datatypes
• Allowing anyone to make statements about any
resource

Case Study - FOAF
• Friend-of-a-Friend
• A linked data description
of a person
• More than just a blog or
personal Web page

http://xmlns.com/foaf/0.1/Person http://www.slac.stanford.edu/~bebo/
http://www.slac.stanford.edu/~bebo/contact.rdf#bebowhite
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/homepage
http://xmlns.com/foaf/0.1/mbox http://xmlns.com/foaf/0.1/givenname http://xmlns.com/foaf/0.1/family_name
mailto:bebo@slac.stanford.edu Bebo White

FOAF Generator
http://linkeddatadeveloper.com/Projects/Linked-Data/Sample-Apps/FOAF-Generator/index.xhtml?view

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
<Me> a foaf:Person
; foaf:title "Prof"
; foaf:givenName "Bebo"
; foaf:familyName "White"
; foaf:name "Prof Bebo White"
; foaf:mbox_sha1sum "5f6f88bb7e8d15058006a3b03206d356f26eea18"
; foaf:homepage <http://www.bebowhite.com>
; foaf:weblog <>
; foaf:tipJar <>
; foaf:account <>
; foaf:account <>
; foaf:account <>
; foaf:workplaceHomepage <>
; foaf:schoolHomepage <>
; foaf:currentProject <>
; foaf:img <>
; foaf:publications <>
; foaf:made <>
.

Semantic Mashups
• A mashup application using Semantic Web
technologies inside
• Supplements Web 2.0 mashups by adding
access to semantic data sources
• Can be either client-side or server-side

Data integration architecture
for Linked Data mashups

Case Study: PingER
• PingER (Ping End-to-end Reporting)
• Uses the Internet ping facility to monitor performance of
Internet links worldwide
• Measures
• Short and long term RTT
• Packet loss percentages
• Jitter
• Lack of reachability (no response to ping)
• Throughput and quality of IP telephony (VoIP)

PingER Monitor Node
Format (original)
Monitor
Host
Nme
Monitor
Address
Remote
Name
Remote
Address
Bytes Time Xmt Rcv Min Avg Max
minos.slac.
stanford.edu
134.79.196.
100
www.lbl.gov 128.3.7.14 100 870393602 10 10 6 18 125

• Bytes - can be 100 or 1000 (min 100); number of bytes in
each ping packet
• Time - Unix Epochal time and is GMT (UDT)
• Xmt - number of ping packets sent
• Rcv - number of ping packets received
• Min - minimum response time for packets sent (in
milliseconds)
• Avg - average response time for packets sent (in milliseconds)
• Max - maximum response time for packets sent (in
milliseconds)

PingER Monitor Node
Format (revised)
• Same as original plus
• for each ping response the Sequence
number is recorded
• for each ping the RTT (round trip time) is
recorded

PingER Rules
• There should always be >7 tokens in the
line
• If <=7 tokens, site considered unreachable
• If no response to the pings are received,
only 8 tokens and Rcv (the 8th token) will
be 0

Possible Uses of PingER
Data
• Technical
• Economical
• Troubleshooting
• Collaboration
• Quantifying the impact of events
• Routing

Workshop Exercise
• Given a table of (unstructured) data
• Produce an RDF graph that reﬂects the
content in such a way that the information
intent is preserved but the data is now
available for RDF operations such as merging
with other linked datasets and RDF query
• Think of new applications that this
“tripliﬁcation” might add to the use of PingER
data and what parties might be interested

(Ref: Measurement Ontology for IP trafﬁc (MOI), European Telecommunications Union)

PingER LOD dataset
sizing
If 1 measurement = 235 bytes data
Total tripliﬁed datastore = (approx.)
235* #Measurements =
736,064,229,585 bytes = (approx.)
685.5 GB

What Did We Do?
Data in various formats
Data represented in
abstract format
Applications
Manipulate
Query
…
Map, Expose….

CITA’15 Workshop, August 2015
ThankYou!
Questions? Comments?
bebo@slac.stanford.edu

Workshop_CITA2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Workshop_CITA2015

Similar to Workshop_CITA2015 (20)

Workshop_CITA2015