seevl: Data-driven music discovery
Alexandre Passant, co-founder, CEO, MDG Web ltd
http://seevl.net // @seevl // alex@seevl.net // @terraces

LA SemWeb & WebSpeed Meet-up, 2 October 2012
Cross Campus, Santa Monica
a bit of backgroud...
• Knowledge Engineering
• Social Web & Enterprise 2.0
• Sensor Networks & Real-Time
architecture
dbpedia:Bad_Brains                         dbpedia:Hardcore_Punk



                              p:associatedActs       p:genre                    p:genre



:alex   foaf:topic_interest          dbpedia:Beastie_Boys                            dbpedia:Black_Flag_(band)



                                       p:currentMembers



                                     dbpedia:Adam_Yauch                          dbpedia:B._B._King




                                                    skos:subject          skos:subject



                                                 dbpedia:Category:American_vegatarians
dbpedia:Bad_Brains                         dbpedia:Hardcore_Punk



                              p:associatedActs       p:genre                    p:genre



:alex   foaf:topic_interest          dbpedia:Beastie_Boys                            dbpedia:Black_Flag_(band)



                                       p:currentMembers



                                     dbpedia:Adam_Yauch                          dbpedia:B._B._King




                                                    skos:subject          skos:subject



                                                 dbpedia:Category:American_vegatarians
Our approach: SLADE

• Semantic LAyer for Data Exploration
 • A framework to build data-driven apps
 • ETL from existing sources / APIs
 • Search, discovery, recommendations
 • Data access / API
 • Generic, config-based, domain-agnostic
The pipeline

                    Data-extraction
                         and
                     interlinking

                                        Entity-centric semantic knowledge base
Web data sources                           (artists, genres, labels, locations...)

                                                 Storage




                   REST-ful interface

                                        Search, discovery and recommendation
 seevl products                           engine, on-top of our graph-database
Challenges
• Some technical challenges faced when building
  SLADE and seevl.net
 • Data models: Chosing the right schemas
 • Data access: SPARQL or API or ... ?
 • Scalability: Caching and optimisation strategies
 • User Experience: User-centric design
data models
RDF since day one
• RDF ?
 • Agile model (ideal when iterating)
 • Intuitive aspect of graph modelling
 • Standard toolkits (SPARQL / HTTP)
• OWL? RDFS?
 • Minor use of inference (type, hierarchies)
Artist data
• Music Ontology
 • Label, Genres, Influences,Origins ...
 • Collaborations between artists
 • Activity period (add-on)
• Additional models/mappings
 • e.g. Bio Vocabulary (birth/death), FOAF...
Social activities
• SIOC & SIOC-actions
 • Social graph / sub-graph
 • Action-centric activities (like, listen)
• Inferring user’s taste profile
 • Top artist, genres, labels
 • Using latest actions
Similarity / Recsys
• Graph-based similarities
 • Data-driven recommendations
 • Ranking using weight-factors
 • Explanations / tracking
• The Similarity Ontology
 • Domain-agnostic
Provenance
• Keep trace of every statement in the ETL
 • Origin, type and time of extraction
• With a low number of additional triples
 • Introducing “data-slices”
 • Multiple slices (=subgraphs) per resource
 • Quick updates (DELETE / INSERT)
Provenance and graphs
GRAPH svl:seevl_id/wikipedia/facts/extract
{
    svl: seevl_id mo:genre svl:BntvuZAy .
    svl:seevl_id/wikipedia/extract dc:created
    “2012-10-25” ; rdfs:seeAlso
    wikipedia:Social_Distortion .
}
data access
SPARQL
• Pros
 • W3C Standard, Powerful
 • HTTP-based w/ SPARQL Protocol
 • SPARQL Update in 1.1
• Cons
 • Learning curve for non-RDF people
URI patterns + JSON-LD
 • Pre-defined URIs mapped to SPARQL
   query patterns, returning JSON-LD data
  • Search queries or resources description
  • Content-negotiation or ?_format=json
 • GET and POST
  • POST => SPARQL UPDATE
  • GET => SPARQL SELECT / ASK
JSON-LD

• JSON for Linking Data
 • The best of both worlds
 • JSON serialization, works with any parser
 • Additional semantics (URIs, typed links,
    etc.) with JSON-LD parsers
 • Use of context/mappings to avoid URIs
Search

• /entity/?property=value
    • JSON-LD mappings used in URI templates
    • Works with literals, dates, resources
    • Ranking algorithm / alpha-ranking
    • Patterns defined in a single config file
Search (text)
• /entity/?
  prefLabel=clash&type=artist&_sort=count_desc
• Translated into
    SELECT ?x WHERE {
        ?x a mo:artist ; skos:prefLabel ?x .
        ?x bif:contains “clash” .
    }
Search (relations)
• /entity/?genre=BntvuZAy&type=artist
• Translated into
   SELECT ?x WHERE {
       ?x a mo:artist ; mo:genre svl:BntvuZAy .
   }
Resource description
• Patterns mapped to resource URI to
  retrieve subset of the resource description
 • /entity/seevl_id/infos
 • /entity/seevl_id/facts
 • /entity/seevl_id/links
 • /entity/seevl_id/related(/related_id)
scalability
Is SPARQL fast enough?
• SPARQL is very powerful, but can be slow
 • Some simple queries may lead to deep
    graph patterns or transversal queries
    depending on the modelling
 • FILTERS (e.g. text and date based queries)
    are expensive
 • Not all triple-stores are equal
Splitting queries
• “List all resource sharing common
  property-values with the current one,
  whatever that property is”
 • Fits in a single SPARQL query
 • Doesn’t properly scale
• Becoming faster when splitting the query
  and recomposing results via internal scripts
SPARQL: splitting queries
                   Direct SPARQL       Property-slicing      Complete-slicing
                 Queries     Time    Queries       Time    Queries       Time
  Ramones          1        139.97     20         109.51     66         37.84
 Johnny Cash       1        257.81     30         152.60    135         75.35
     U2            1        155.53     22         122.91     70         44.03
  The Clash        1        146.43     20         110.84     79         42.61
 Bad Religion      1        104.08     23          86.49     97         47.35
The Aggrolites     1        145.92     13         114.52     28         28.33
 Janis Joplin      1        230.88     27         151.00     98         62.81
SPARQL + Redis
• Started by using Memcache to store query
  results (e.g. “?x genre $y”)
  • Good, but costly for the first user
• Then, materialising results in-memory using
  Redis as a key-value cache system
  • Low indexing time (few minute on laptop)
  • Increasing query-performance, real-time
SPARQL + Redis

• Redis
 • HSET to define entities (minimal data)
 • ZADD to store ordered sets of key-
    values, with our own ranking scheme
  • ZRANGE to retreive w/ correct order
• Everything in memory, instant query results
SPARQL + Redis
self.redis.hset(entity, 'uri', uri)
self.redis.hset(entity, 'prefLabel', prefLabel)
self.redis.hset(entity, 'description', description)
self.redis.zadd(‘genre:BntvuZAy’, entity, score)
...
self.redis.zrange(pattern, min, max, 'withscores')
user-experience
User-experience
• Interfaces for graph-based/semantic data
 • Don’t need to be ugly!
 • As long as they’re built for users first
• Focus on vertical-UX, rather than SemWeb-UX
 • Check best practices in the domain
 • Involve HCI / non-SemWeb people
take-away message
Lessons learnt
• Don’t reinvent the wheel, check existing
  stacks and use what fits for the job
• Make it simple for your developers, using
  REST-ful interfaces and design patterns
• Accept compromises, be pragmatic
• This of users / create persona who are not
  SemWeb-geeks when designing the UX
Questions?
http://seevl.net // @seevl
alex@seevl.net // @terraces

seevl: Data-driven music discovery

  • 1.
    seevl: Data-driven musicdiscovery Alexandre Passant, co-founder, CEO, MDG Web ltd http://seevl.net // @seevl // alex@seevl.net // @terraces LA SemWeb & WebSpeed Meet-up, 2 October 2012 Cross Campus, Santa Monica
  • 2.
    a bit ofbackgroud...
  • 3.
    • Knowledge Engineering •Social Web & Enterprise 2.0 • Sensor Networks & Real-Time
  • 7.
  • 8.
    dbpedia:Bad_Brains dbpedia:Hardcore_Punk p:associatedActs p:genre p:genre :alex foaf:topic_interest dbpedia:Beastie_Boys dbpedia:Black_Flag_(band) p:currentMembers dbpedia:Adam_Yauch dbpedia:B._B._King skos:subject skos:subject dbpedia:Category:American_vegatarians
  • 9.
    dbpedia:Bad_Brains dbpedia:Hardcore_Punk p:associatedActs p:genre p:genre :alex foaf:topic_interest dbpedia:Beastie_Boys dbpedia:Black_Flag_(band) p:currentMembers dbpedia:Adam_Yauch dbpedia:B._B._King skos:subject skos:subject dbpedia:Category:American_vegatarians
  • 10.
    Our approach: SLADE •Semantic LAyer for Data Exploration • A framework to build data-driven apps • ETL from existing sources / APIs • Search, discovery, recommendations • Data access / API • Generic, config-based, domain-agnostic
  • 11.
    The pipeline Data-extraction and interlinking Entity-centric semantic knowledge base Web data sources (artists, genres, labels, locations...) Storage REST-ful interface Search, discovery and recommendation seevl products engine, on-top of our graph-database
  • 12.
    Challenges • Some technicalchallenges faced when building SLADE and seevl.net • Data models: Chosing the right schemas • Data access: SPARQL or API or ... ? • Scalability: Caching and optimisation strategies • User Experience: User-centric design
  • 13.
  • 14.
    RDF since dayone • RDF ? • Agile model (ideal when iterating) • Intuitive aspect of graph modelling • Standard toolkits (SPARQL / HTTP) • OWL? RDFS? • Minor use of inference (type, hierarchies)
  • 15.
    Artist data • MusicOntology • Label, Genres, Influences,Origins ... • Collaborations between artists • Activity period (add-on) • Additional models/mappings • e.g. Bio Vocabulary (birth/death), FOAF...
  • 17.
    Social activities • SIOC& SIOC-actions • Social graph / sub-graph • Action-centric activities (like, listen) • Inferring user’s taste profile • Top artist, genres, labels • Using latest actions
  • 19.
    Similarity / Recsys •Graph-based similarities • Data-driven recommendations • Ranking using weight-factors • Explanations / tracking • The Similarity Ontology • Domain-agnostic
  • 21.
    Provenance • Keep traceof every statement in the ETL • Origin, type and time of extraction • With a low number of additional triples • Introducing “data-slices” • Multiple slices (=subgraphs) per resource • Quick updates (DELETE / INSERT)
  • 22.
    Provenance and graphs GRAPHsvl:seevl_id/wikipedia/facts/extract { svl: seevl_id mo:genre svl:BntvuZAy . svl:seevl_id/wikipedia/extract dc:created “2012-10-25” ; rdfs:seeAlso wikipedia:Social_Distortion . }
  • 23.
  • 24.
    SPARQL • Pros •W3C Standard, Powerful • HTTP-based w/ SPARQL Protocol • SPARQL Update in 1.1 • Cons • Learning curve for non-RDF people
  • 25.
    URI patterns +JSON-LD • Pre-defined URIs mapped to SPARQL query patterns, returning JSON-LD data • Search queries or resources description • Content-negotiation or ?_format=json • GET and POST • POST => SPARQL UPDATE • GET => SPARQL SELECT / ASK
  • 26.
    JSON-LD • JSON forLinking Data • The best of both worlds • JSON serialization, works with any parser • Additional semantics (URIs, typed links, etc.) with JSON-LD parsers • Use of context/mappings to avoid URIs
  • 27.
    Search • /entity/?property=value • JSON-LD mappings used in URI templates • Works with literals, dates, resources • Ranking algorithm / alpha-ranking • Patterns defined in a single config file
  • 28.
    Search (text) • /entity/? prefLabel=clash&type=artist&_sort=count_desc • Translated into SELECT ?x WHERE { ?x a mo:artist ; skos:prefLabel ?x . ?x bif:contains “clash” . }
  • 30.
    Search (relations) • /entity/?genre=BntvuZAy&type=artist •Translated into SELECT ?x WHERE { ?x a mo:artist ; mo:genre svl:BntvuZAy . }
  • 33.
    Resource description • Patternsmapped to resource URI to retrieve subset of the resource description • /entity/seevl_id/infos • /entity/seevl_id/facts • /entity/seevl_id/links • /entity/seevl_id/related(/related_id)
  • 36.
  • 37.
    Is SPARQL fastenough? • SPARQL is very powerful, but can be slow • Some simple queries may lead to deep graph patterns or transversal queries depending on the modelling • FILTERS (e.g. text and date based queries) are expensive • Not all triple-stores are equal
  • 38.
    Splitting queries • “Listall resource sharing common property-values with the current one, whatever that property is” • Fits in a single SPARQL query • Doesn’t properly scale • Becoming faster when splitting the query and recomposing results via internal scripts
  • 39.
    SPARQL: splitting queries Direct SPARQL Property-slicing Complete-slicing Queries Time Queries Time Queries Time Ramones 1 139.97 20 109.51 66 37.84 Johnny Cash 1 257.81 30 152.60 135 75.35 U2 1 155.53 22 122.91 70 44.03 The Clash 1 146.43 20 110.84 79 42.61 Bad Religion 1 104.08 23 86.49 97 47.35 The Aggrolites 1 145.92 13 114.52 28 28.33 Janis Joplin 1 230.88 27 151.00 98 62.81
  • 40.
    SPARQL + Redis •Started by using Memcache to store query results (e.g. “?x genre $y”) • Good, but costly for the first user • Then, materialising results in-memory using Redis as a key-value cache system • Low indexing time (few minute on laptop) • Increasing query-performance, real-time
  • 41.
    SPARQL + Redis •Redis • HSET to define entities (minimal data) • ZADD to store ordered sets of key- values, with our own ranking scheme • ZRANGE to retreive w/ correct order • Everything in memory, instant query results
  • 42.
    SPARQL + Redis self.redis.hset(entity,'uri', uri) self.redis.hset(entity, 'prefLabel', prefLabel) self.redis.hset(entity, 'description', description) self.redis.zadd(‘genre:BntvuZAy’, entity, score) ... self.redis.zrange(pattern, min, max, 'withscores')
  • 43.
  • 44.
    User-experience • Interfaces forgraph-based/semantic data • Don’t need to be ugly! • As long as they’re built for users first • Focus on vertical-UX, rather than SemWeb-UX • Check best practices in the domain • Involve HCI / non-SemWeb people
  • 45.
  • 46.
    Lessons learnt • Don’treinvent the wheel, check existing stacks and use what fits for the job • Make it simple for your developers, using REST-ful interfaces and design patterns • Accept compromises, be pragmatic • This of users / create persona who are not SemWeb-geeks when designing the UX
  • 47.