Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Integration in a Big Data Context

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 37 Ad

Data Integration in a Big Data Context

Download to read offline

Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.

Data is being generated all around us – from our smart phones tracking our movement through a city to the city itself sensing various properties and reacting to various conditions. However, to maximise the potential from all this data, it needs to be combined and coerced into models that enable analysis and interpretation. In this talk I will give an overview of the techniques that I have developed for data integration: integrating streams of sensor data with background contextual data and supporting multiple interpretations of linking data together. At the end of the talk I will overview the work I will be conducting in the Administrative Data Research Centre for Scotland.

Advertisement
Advertisement

More Related Content

Similar to Data Integration in a Big Data Context (20)

More from Alasdair Gray (20)

Advertisement
Advertisement

Data Integration in a Big Data Context

  1. 1. Data Integration in a Big Data Context Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair
  2. 2. Data Linkage and Querying 2 September 2015 UBDC Seminar Linking it all together! 2
  3. 3. Big Data 2 September 2015 UBDC Seminar 3 Volume VelocityVariety http://i.kinja-img.com/gawker- media/image/upload/lvzm0afp8kik5dctxiya.jpg
  4. 4. Purpose: Extracting Value 2 September 2015 UBDC Seminar 4 http://senderocorp.com/images/uploads/bigdata_v9.png Volume Velocity Variety Veracity Value VisualizationAnalytics Big Data Technology
  5. 5. Big Data 2 September 2015 UBDC Seminar 5 Volume VelocityVariety
  6. 6. Big Data: Volume More data than you can process  Scalable processing  Relative term  WSN query processing 2 September 2015 UBDC Seminar 6 Volume VelocityVariety
  7. 7. Big Data: Variety Many sources of data  Heterogeneous  Formats  Models  Reconcile meaning 2 September 2015 UBDC Seminar 7 Volume VelocityVariety
  8. 8. Big Data: Velocity Data constantly generated  Real-time processing  Contextualise 2 September 2015 UBDC Seminar 8 Volume VelocityVariety
  9. 9. RDF: An Integration Dream 2 September 2015 UBDC Seminar 9 http://www.w3.org/TR/rdf11-primer/
  10. 10. 2 September 2015 UBDC Seminar 10 https://www.flickr.com/photos/mobilestreetlife/4179063482 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!” Frank van Harmelen
  11. 11. Solent Use Case  Busy shipping channel  Two major ports  Complex tidal and wave patterns 2 September 2015 UBDC Seminar 11
  12. 12. Estuarine Flooding  Financial implications  Damage  Loss of business  Personal factors  Emotional impact Flood prediction  Locations  Severity Requires correlating  Sea-state data  Weather forecasts  Details of sea defences Response Planning  Evacuation routes  Personnel deployment  … Requires more data  Traffic reports  Shipping  … 2 September 2015 UBDC Seminar 12 Image: http://www.metro.co.uk/
  13. 13. Flood defences data (database) Flood Detection “Detect overtopping events in the Solent region” sea-level > sea-defence •Sea-level: sensors •Defence heights: databases 2 September 2015 UBDC Seminar 13 Real-time sensor data Wave, Wind, Tide
  14. 14. Meteorological forecasts Response Planning “Provide contextual information” • Web feeds • Other sources: maps, models • Real-time merging of datasets 2 September 2015 UBDC Seminar 14 Other sources: Maps, models, …
  15. 15. Abstract Problem Stored data Sensor Network Integrator 2 September 2015 15 Sensor Network Stored data service Streaming data service Streaming data service UBDC Seminar
  16. 16. Data source Data stream Query capabilities Data access Types of Heterogeneity Stored data Sensor Network Integrator 2 September 2015 16 Sensor Network Stored data service Streaming data service Streaming data service Data semantics UBDC Seminar
  17. 17. Querying Approach  Use ontologies as common model Requires:  Representation of RDF stream  Expressing continuous queries over an RDF stream  Establishing mappings between ontology models and data source schemas  Accessing data sources through queries over ontology model 2 September 2015 17UBDC Seminar
  18. 18. RDF Stream  Named graph  Continuously updating  Triples annotated with timestamp 2 September 2015 18 STREAM http://www.semsorgrid4env.eu/ccometeo.srdf ... ... ( <ssg4e:Obs1, rdf:type, cd:Observation>, ti ), ( <ssg4e:Obs1, cd:observationResult, “34.5”>, ti ), ( <ssg4e:Obs2, rdf:type, cd:Observation>, ti+1 ), ( <ssg4e:Obs2, cd:observationResult,”20.3”>, ti+1 ), ... ... cd:Observation xsd:double cd:observationResult UBDC Seminar
  19. 19. SPARQLStream PREFIX cd: <http://www.semsorgrid4env.eu/ontologies/CoastalDefences.owl#> PREFIX sb: <http://www.w3.org/2009/SSN- XG/Ontologies/SensorBasis.owl#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> RSTREAM SELECT ?windspeed ?windts FROM STREAM <http://www.semsorgrid4env.eu/ccometeo.srdf> [ NOW – 1 MINUTE TO NOW STEP 5 MINUTES ] WHERE { ?WindObs a cd:Observation; cd:observationResult ?windspeed; cd:observationResultTime ?windts; cd:observedProperty ?windProperty; cd:featureOfInterest ?windFeature. ?windFeature a cd:Feature; cd:locatedInRegion cd:SolentCCO. ?windProperty a cd:WindSpeed. } 2 September 2015 19 cd:Observation xsd:double cd:observationResult cd:Feature cd:featureOfInterest cd:Property cd:observedProperty cd:Region cd:locatedInRegion “Every 5 minutes give me with the wind speed observations over the last minute in the Solent Region ” UBDC Seminar
  20. 20. Initial Display 2 September 2015 UBDC Seminar 20
  21. 21. Sensor Data 2 September 2015 UBDC Seminar 21
  22. 22. Sea-state Forecast Model 2 September 2015 UBDC Seminar 22
  23. 23. Drug Discovery Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases” 2 September 2015 UBDC Seminar 23
  24. 24. Open PHACTS Discovery Platform 2 September 2015 UBDC Seminar 24 Drug Discovery Platform Apps Domain API Interactive responses Production quality integration platform Method Calls Standard Web Technologies
  25. 25. API Hits 2 September 2015 UBDC Seminar 26 0 10 20 30 40 50 60 Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 June 2013 July 2013 Aug 2013 Sept 2013 Oct 2013 Nov 2013 Dec 2013 Jan 2014 Feb 2014 Mar 2014 Apr 2014 May 2014 June 2014 July 2014 Aug 2014 Sept 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015 May 2015 June 2015 NoofHits Millions Month Public launch of 1.2 API 1.3 API 1.4 API 1.5 API
  26. 26. Open PHACTS Data 2 September 2015 UBDC Seminar 27
  27. 27. Multiple Identities P12047 X31045 GB:29384 2 September 2015 UBDC Seminar Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 28 Are these the same thing?
  28. 28. Gleevec®: Imatinib Mesylate 2 September 2015 UBDC Seminar 29 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  29. 29. Gleevec®: Imatinib Mesylate 2 September 2015 UBDC Seminar 30 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Are these records the same? It depends upon your task!
  30. 30. UBDC Seminar 31 skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Structure Lens 2 September 2015 I need to perform an analysis, give me details of the active compound in Gleevec.
  31. 31. UBDC Seminar 32 skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Name Lens 2 September 2015 Which targets are known to interact with Gleevec?
  32. 32. What is a Scientific Lens? A lens defines a conceptual view over the data  Specifies operational equivalence conditions Consists of:  Identifier (URI)  Title (dct:title)  Description (dct:description)  Documentation link (dcat:landingPage)  Creator (pav:createdBy)  Timestamp (pav:createdOn)  Equivalence rules (bdb:linksetJustification) 2 September 2015 UBDC Seminar 33
  33. 33. Administrative Data Research Network UBDC Seminar Administrative Data Service 372 September 2015
  34. 34. ADRC-Scotland UBDC Seminar  Co-located with Farr Institute, Scottish Government and NHS.  Universities of Aberdeen, Dundee, Edinburgh, Glasgow, Herriot-Watt, St Andrews and Stirling.  Expertise in administrative data and public engagement, linkage, law and relevant computer science techniques.  Provide research support, facilities, training 382 September 2015
  35. 35. Research Focus UBDC Seminar http://www.gov.scot/Resource/0044/00442276-39  Schools, colleges and universities  The criminal and justice system  Social work services  Social welfare  Housing system  Transport system  Health system  Historical administrative data 392 September 2015
  36. 36. Data Matching UBDC Seminar Messy data Probabilistic matches Schema matching John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 402 September 2015
  37. 37. Summary  RDF eases data integration  Working on RDF stream extensions  Data is complex and messy  Requires flexibility in linking  Equivalence depends upon context  Lenses provide support for operation equivalence 2 September 2015 UBDC Seminar 41 www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair

Editor's Notes

  • About Me
    Working in data integration for over a decade
    Special focus on
    Practical real-world problems
    Streaming data
    Lots of domains: Astronomy, Biology, Chemistry, Physics, Environmental science, Pharmacology, Social science, Health Informatics
  • Data from heterogeneous sources: discover relevant sources; different temporal modalities; different data models and representations
    Interlink data: common representation, align data models/schemas, identify common entities
    Query decomposition across distributed sources
    Efficient in-network processing: Save energy, increase network lifetime
    Enable new insights through novel user interfaces
  • Veracity is a cross-cutting issue
  • Deriving value from the data

    My research focuses on the top part: bringing data together
  • WSN typically resource constrained
    48k memory
    limited energy
  • Integrated with background knowledge
  • Identify things with URIs
    Reuse URIs
    Explicit meaning to relationships
    Links between datasets
    Infer hidden meaning
  • They give us a common syntax

    Rest of the talk focuses on my work to address these challenges
  • Strait of water separating Isle of Wight from English mainland

    Two high tides -> increased opportunities for getting ships in and out -> better for business
    Complex tidal pattern
    Non-standard models
  • Environmental decision support systems
    Flood emergency response:
    real-time data mash-ups
    real-time data linkage
  • Overtopping: a wave or tide exceeds the height of the sea defence: simplified as threshold in graph

    Sensor data provides current sea-state conditions
    National Flood and Coastal Defences Database (NFCDD) provides height of sea walls, etc

    Lots of forms of heterogeneity in the system
  • Contextual Data
    Weather feed provides predicted wind speed and direction,
    contextual streaming data
    Maps -> contextual visual data

    Report data in a form understandable to the user, ontology
  • User requires correlation of data from variety of sources

    Sources wrapped by a service

    Integrator includes DQP

    Queries, or data requests sent to data services
  • Data source: stored or streaming
    Data stream: acquire or receive. Control of data rate
    Query capabilities:
    Query evaluator
    Language
    Data access: pull or push
    Semantic Heterogeneity, e.g. temperature: air, sea, …
  • Previous streaming extensions to SPARQL have problems
  • Requires two triples to represent information
    Problem being addressed in W3C RDF Stream Processing Group
  • 1 of 83 business driver questions
    Took a team of 5 experienced researchers 6 hours to manually gather the answer
    Start of the project couldn’t be answered by a computer system
    6 months in 30s with prototype
    now subsecond
  • A platform for integrated pharmacology data
    Relied upon by pharma companies
    Public domain, commercial, and private data sources
    Provides domain specific API
    Making it easy to build multiple drug discovery applications: examples developed in the project
  • Actively being used for different purposes
    Public launch April 2013
    Averaging 20 million hits a month from the start of 2015
    38 million in the last 30 days

    Heavy usage from pharma, academia, and biotech
    500+ registered users
  • Over 3 billion triples
    Hosted on beefy hardware; data in memory (aim)
  • Concept appears in multiple datasets, each with its own identifier
    This talk is about supporting the multiple identities that exist
    Rather than define a single approach, we want to support the use of multiple identifiers
  • Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases  Different results

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Are these records the same? It depends on what you are doing with the data!
    Each captures a subtly different view of the world

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Interested in physiochemical properties of Gleevec
  • Interested in biomedical and pharmacological properties

    sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Lens enables certain relationships and disables others
    Alters links between the data
  • Default lens matches structures
    Only get data back associated with the structure entered with

    Really want all information about Ibuprofen
    Need a different lens
  • ESRC funded network
    Coordinating Administrative Data Service (ADS) – led by University of Essex
    Four Administrative Data Research Centres (ADRCs), one in each UK country
    England – led by University of Southampton
    Northern Ireland – led by Queens Uni Belfast
    Scotland – led by University of Edinburgh
    Wales – led by Swansea University

  • Social science example from ADRC Scotland
    Looking to apply lenses to support different interactions
  • Bird habitat monitoring, Coastal monitoring, Glacier movement, Farms, Volcanoes…

    Cost effective monitoring, high spatial/temporal resolution

    What is the underlying technology/software?
  • Trade-off of capabilities vs QoS vs Lifetime

    Every system performed their own bespoke evaluations, how do you compare?

×