Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications
Upcoming SlideShare
Loading in...5
×
 

Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications

on

  • 2,323 views

Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013 ...

Presentation at the AAAI 2013 Fall Symposium on Semantics for Big Data, Arlington, Virginia, November 15-17, 2013

Additional related material at: http://wiki.knoesis.org/index.php/Smart_Data
Related paper at: http://www.knoesis.org/library/resource.php?id=1903

Abstract: We discuss the nature of Big Data and address the role of semantics in analyzing and processing Big Data that arises in the context of Physical-Cyber-Social Systems. We organize our research around the five V's of Big Data, where four of the Vs are harnessed to produce the fifth V - value. To handle the challenge of Volume, we advocate semantic perception that can convert low-level observational data to higher-level abstractions more suitable for decision-making. To handle the challenge of Variety, we resort to the use of semantic models and annotations of data so that much of the intelligent processing can be done at a level independent of heterogeneity of data formats and media. To handle the challenge of Velocity, we seek to use continuous semantics capability to dynamically create event or situation specific models and recognize new concepts, entities and facts. To handle Veracity, we explore the formalization of trust models and approaches to glean trustworthiness. The above four Vs of Big Data are harnessed by the semantics-empowered analytics to derive Value for supporting practical applications transcending physical-cyber-social continuum.

Statistics

Views

Total Views
2,323
Views on SlideShare
2,322
Embed Views
1

Actions

Likes
0
Downloads
14
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications Big Data Research: Sensor, Social, and Cyber-Physical Systems
  • Relevance of our research to Big Data and PCS applications by organizing it around the 5 V’sRole in overcoming … challenge Volume Variety Both by combining probabilistic and logical knowledge
  • Size, rate of flow/accumulation and change, (syntactic and semantic) heterogeneity, trustworthiness, end-use(develop techniques to harness data to derive value in the presence of these challenges)
  • Huge amount of raw data generated by continuous monitoring => actionable nuggets for decision makingMJFoxFoundation Parkinson disease challenge : Diagnosis and progression-------Embarrassingly parallel computations (Map-Reduce programming model) can be implemented on distributed fault-tolerant architectures/systems (HDFS + Hadoop) Using redundant storage and computations  answer for homogeneous data------Semantics-based approaches needed to deal with variety or to transcend abstraction levels--------Check engine light signals/alerts : on detecting -> anomaly / problem => for further analysis / action--------
  • What does semantic perception entail?Making sense of large amounts of low level data and communicating it in a meaningful waye.g. Ranges, aggregate/statistical measures GOAL: “Buzz it up!”---------------------Semantic Perception: Converting Sensory Observations to Abstractions Using perception cycle and domain models: derive explanation, determine focus to disambiguate and discriminate for taking actionsHybrid reasoning: interleaved abductive and deductive components[**complex domain models reflecting comorbidities : high-fidelity models**] [**Gleaning Patterns from data**] [**Personalization**]
  • Saffir Simpson Hurricane Wind ScaleHurricane/Typoon/Cyclone(5 catergories) / Tropical storm / Tropical depression vs Tsunami
  • ---------------------------ParkinsonMild(person) = Tremor(person) ∧ PoorBalance(person)ParkinsonModerate(person) = MoveSlow(person) ∧ PoorSleep(person) ∧ MonotoneSpeech(person)ParkinsonAdvanced(person) = Fall(person)----------------------------Loss of speech / food intake impossible / lack of balance => is there value in continuous monitoring? => Signatures for proactive control?----------------------------Dataset Characteristics: 8 weeks of data from 5 sensors on a smart phone, collected for 16 patients resulting in ~12 GB (with lot of missing data).
  • cardiologist evaluates the risk based on periodic monitoring data (+ human sensed health info inputs)--------------------------------------------Reduce preventable readmissions: 25% patients readmitted 30 day after discharge 50% patients readmitted after 60mo
  • EVIDENCE-BASED Approach to diagnosis, treatment and controlEnvironmental: CO, CO2, NO, pollen counts, mold, dust, smoke, etc.Physiological: Wheezometer (breathing), heart rate, etc25 million people in the U.S. are diagnosed with asthma (7 million are children)1.300 million people suffering from asthma worldwide2.Asthma related healthcare costs alone are around $50 billion a year2.155,000 hospital admissions and 593,000 emergency department visits in 20063.
  • Current predictions and long-term planning
  • Point of this slide: correlations
  • An Efficient Bit Vector Approach to Semantics-Based Machine Perception in Resource-Constrained Devices.Resources: memory, cpu, power, …Healthcare use-case – privacy, mobility, cheap onboard sensors, personalization, power, convenience-considerations dominateAbstracting and summarizing multimodal machine sensed observations + human observations for actionable and human accessible situational awareness and decision making---------Characteristics of a big data problem
  • perception cycle contains interleaved iterative execution of two primary phasesExplanation (abductive)translating low-level signals into high-level abstractions inference to the best explanationDiscrimination (declarative)focusing attention on those properties that will help distinguish between multiple possible explanationsused to intelligently task sensors and collect additional observations (rather than brute force approach of blindly collecting all observations)-----------------------Ask human relevant questions
  • Solving information overload problem – improving relevance (both recall and precision) E.g., in the context of important/unfolding events, disaster scenarios, … learn to rank and select relevant hashtags for improved crawling and filtering-------------Use keywords to carve out a relevant model of the domain for scalable and more focused information crawling, disambiguation and extraction in the face of rapidly unfolding event------------Leveraging Semantics for Detection of Event-Descriptors on Twitter
  • Use seed keywords and tweets to carve out a relevant model from Wikipedia pages : DoozerTrack dynamically unfolding events
  • Syntactic : different data formatsSemantic :Conceptual modelsSemantic : multimodal sensing + different conceptual models--------------Complementary and corroborative information => complete and reliable/robust;---------------------------“Semantics Empowered Web 3.0” book
  • Variery challenge: Sources of heterogeneity (Addl:UOM, table captions)Use text-basedmetadata to help mediate
  • Semantics at different levels of detail and developed in stages : ---------------------Ease of use by domain expertsFaster and wider adoption, promoting evolutionLow upfront cost to supportShallow semantics has wider applicability to a range of documents/data and appeal to a broader communityBottom-line: “Learn to Walk before we Run”------------------------------------------------------Controlled vocabularies traceability (provenance)---------Past Research: We have dealt with top-down UMLS ontology vs bottom-up facts from Pubmed in HPCO (Literature-based discovery -> LBD)-----------------------------RECALL: materials and process specs typically describe: composition, processing, testing, and packaging of materialFormalizing a procedure (a process or a test) as an aggregation of characteristic/parameter-value pairs = LOD  Eventually allows combining and comparing specs==============================Biomaterials use case: Gold surface affinity of peptide sequence
  • Use case: Materials and Process specsCompact structures for sharing information : Minimize duplication
  • AMS 4928Nhttp://www.youtube.com/watch?v=D8U4G5kcpcMhttp://www.ndt-ed.org/EducationResources/CommunityCollege/Materials/Mechanical/Mechanical.htmMost structural materials are anisotropic, which means that their material properties vary with orientation.In products such as sheet and plate, the rolling direction is called the longitudinal direction, the width of the product is called the (long) transverse direction, and the thickness is called the short transverse direction.
  • In content extraction from tables, a human extractor formalizes the data using “predefined” tables, and a wizard then generates LOD from it.Human Extractor is responsible for gleaning the semantics (manual part)Wizard responsible for the mechanical translation (automatic part)==================The yardstick of success is the extent to which regular parts of the table can be automatically assimilated and translated, while leaving more complex parts for manual guidance.
  • Event, disease, human comprehensible features …--------------Slow traffic vs reason for it (accident vs tree fall): semantics to data : sensors monitoring traffic space-----------Cardiology use case – how a patient is feeling – giddy, depressed, etc.
  • Idea : Glean statistical correlations from data (PGM) and enrich/validate it using symbolic knowledge (manually curated) orient undirected links, delete conflicting links, + complement nodes and links Explicit declarative knowledge obviates the need to generate it, especially in the context of sparse/skewed data PLUS it will be relaible------------Structure learning uncovers qualitative conditional dependencies integrate with declarative information using progressively expressive graphical models : same abstraction levelParameter learning using refined structure to estimate better fitting model
  • ---------------------discovering “unexpected” correlations, and then seeking a transparent basis for them, seems worthy of pursuit. For instance, consider the controversies surrounding assertions such as ‘smoking causes cancer’, ‘high debt causes low growth’, ‘low growth causes high debt’, and ‘religious fanaticism breeds terrorists’.
  • e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
  • e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
  • e.g., tides and ebbs caused by the alignment of earth, sun and moon, around full moon and new moon; “anomalous” orbits of Solar system planets w.r.t. the “circular” motion of stars in geocentric theory (‘planet’ is ‘wanderer’ in Greek) explained by heliocentrism and theory of gravitation, (Copernicus) correlation of time period and distance of planets (Kepler)and the “anomalous” precision of Mercury’s orbit clarified by General Theory of Relativity; (Einstein) C-peptide protein can be used to estimate insulin produced by a patient’s pancreas => ANOMALY (Copernicus) and REGULARITY (Kepler) => CAUSE (Newton)=> (Newtonian Mechanics) => (General Theory of Relativity)Bold claims all the time in politicsBeer vs diaper; Walmart’s hurricanes vspoptarts ---------------------(4) Stress/spicy foods are correlated with peptic ulcers, but the latter are caused by Helicobacter Pyrolias demonstrated by Nobel Prize winning works of Marshall and Warren.ORIENTATION UNCLEAR: ‘high debt causes low growth’, ‘low growth causes high debt’, ------------------(5) Since the 1950s, both the atmospheric Carbon Dioxide level and obesity levels have increased sharply. (6) Pavlovian learning induced conditional reflex, and some of the financial market moves, seem to be classic cases of correlation turning into causation! ---------PARADOXES : THE SEEDS OF PROGRESSZeno’s paradox, Hydrostatic paradox, light speed constant in all reference frames, CBR, Expanding universe, …
  • Different forms of trust; What features contribute to trust; how do we combine trust; Trust propagation: aggregation and chaining;Application-specific basis / AxiomaticEcommerce examples : risk tolerance (propensity to trust) + trustworthiness = trust
  • complementary and corroboratory
  • Biggest hurdle in ML : Significant training datasetTraining bigdata: tweets with emotion hashtags (provided by the tweet creator)Learn domain model to associate emotion hashtags with tweet content Glean/predict emotions from “untagged” tweets using this model
  • EMR
  • Semantic Perception : Hybrid Abductive/Deductive Reasoning (Volume)Cost-benefit trade-off and Continuum of Semantic models to manage Heterogeneity (Variety)Hybrid Knowledge Representation and Reasoning : Probabilisitc + Logical : structure + parameter estimation (Variety)

Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications Presentation Transcript

  • Semantics-empowered Big Data Processing for PCS Applications Krishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, OH-45435
  • Outline • 5 V’s of Big Data Research • Semantic Perception for Scalability • Lightweight semantics to manage heterogeneity – Cost-benefit trade-off and continuum • Hybrid Knowledge Representation and Reasoning – Anomaly, Correlation, Causation 211/15/2013 Prasad
  • 5V’s of Big Data Research Volume Velocity Variety Veracity Value 11/15/2013 Prasad 3 Big Data => Smart Data
  • Volume : Assorted Examples Check engine light analogy 11/15/2013 Prasad 4
  • Volume : Semantic Perception 11/15/2013 Prasad 5
  • Weather Use Case 11/15/2013 Prasad 6
  • Parkinson’s Disease Use Case 11/15/2013 Prasad 7
  • Heart Failure Use Case 11/15/2013 Prasad 8
  • Asthma Use Case 11/15/2013 Prasad 9
  • Traffic Use Case 11/15/2013 Prasad 10
  • Slow moving traffic Link Description Scheduled Event Scheduled Event 511.org 511.org Schedule Information 511.org Traffic Monitoring 11 Heterogeneity in a Physical-Cyber-Social System
  • Volume with a Twist Resource-constrained reasoning on mobile- devices 11/15/2013 Prasad 12
  • * based on Neisser’s cognitive model of perception Observe Property Perceive Feature Explanation Discrimination 1 2 Perception Cycle* that exploits background knowledge / domain models Abstracting raw data for human comprehension Focus generation for disambiguation and action (incl. human in the loop) Prior Knowledge 13
  • Virtues of Our Approach to Semantic Perception Blends simplicity, effectiveness, and scalability. • Declarative specification of explanation and discrimination; • With applications (e.g., to healthcare) that are of contemporary relevance and interdisciplinary; • Using encodings/algorithms that are significant (asymptotic order of magnitude gain) and necessary (“tractable” due to time/memory reduction for typical problem sizes); and • Prototyped using extant PCs and mobile devices.
  • O(n3) < x < O(n4) O(n) Efficiency Improvement • Problem size increased from 10’s to 1000’s of nodes • Time reduced from minutes to milliseconds • Complexity growth reduced from polynomial to linear Evaluation on a mobile device 15
  • Volume and Velocity • Lightweight semantics-based Adaptive/Continuous Filtering Disaster response use-case • Building domain models dynamically 11/15/2013 Prasad 16
  • Dynamic Model Creation Continuous Semantics 17
  • Variety Syntactic and semantic heterogeneity • in textual and sensor data, • in (legacy) materials data • in (long tail) geosciences data 11/15/2013 Prasad 18
  • Variety (What?): Materials/Geosciences Use Case • Structured Data (e.g., relational) • Semi-structured, Heterogeneous Documents (e.g., Publications and technical specs, which usually include text, numerics, maps and images) • Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries) 1911/15/2013 Prasad
  • Variety (How?/Why?): Granularity of Semantics & Applications • Lightweight semantics: File and document-level annotation to enable discovery and sharing • Richer semantics: Data-level annotation and extraction for semantic search and summarization • Fine-grained semantics: Data integration, interoperability and reasoning in Linked Open Data Cost-benefit trade-off and continuum 20
  • Challenges Associated with Typical Spreadsheet/Table • Meant for human consumption • Irregular : – Not simple rectangular grid • Heterogeneous – All rows not interpreted similarly • Complex – Meaning of each row and each column context dependent • Footnotes modify meaning of entries (esp. in materials and process specifications) 2111/15/2013 Prasad
  • 22
  • Practical Semi-Automatic Content Extraction • DESIGN: Develop regular data structures that can be used to formalize tabular information. – Provide a natural expression of data – Provide semantics to data, thereby removing potential ambiguities – Enable automatic translation • USE: Manual population of regular tables and automatic translation into LOD 2311/15/2013 Prasad
  • Variety (What?) : Sensor Data Use Case Develop/learn domain models to exploit complementary and corroborative information • To relate patterns in multimodal data to “situation” • To integrate machine sensed and human sensed data 11/15/2013 Prasad 24
  • Variety: Hybrid KRR Blending data-driven models with declarative knowledge – Data-driven: Bottom-up, correlation- based, statistical – Declarative: Top- down, causal/taxonomical, logical – Refine structure to better estimate parameters E.g., Traffic Analytics using PGMs + KBs 11/15/2013 Prasad 25
  • Variety (Why?): Hybrid KRR Data can help compensate for our overconfidence in our own intuitions and reduce the extent to which our desires distort our perceptions. -- David Brooks of New York Times However, inferred correlations require clear justification that they are not coincidental, to inspire confidence. 11/15/2013 Prasad 26
  • • Correlations due to common cause or origin • Coincidental due to data skew or misrepresentation • Coincidental new discovery • Strong correlation vs causation • Anomalous and accidental • Correlation turning into causations Correlations vs Causation vs Anomalies 11/15/2013 Prasad 27
  • • Correlations Due to common cause or origin – E.g., Planets: Copernicus > Kepler > Newton > Einstein • Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians! • Coincidental new discovery – E.g., Hurricanes and Strawberry Pop-Tarts Sales • Strong correlation vs causation – E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers • Anomalous and accidental – E.g., CO2 levels and Obesity • Correlation turning into causations – E.g., Pavlovian learning: conditional reflex Correlations vs Causation vs Anomalies 11/15/2013 Prasad 28
  • • Correlations Due to common cause or origin – E.g., Planets: Copernicus > Kepler > Newton > Einstein • Coincidental due to data skew or misrepresentation – E.g., Tall policy claims made by politicians! • Coincidental new discovery – E.g., Hurricanes and Strawberry Pop-Tarts Sales • Strong correlation vs causation – E.g., Spicy foods vs Helicobacter Pyroli : Stomach Ulcers • Anomalous and accidental – E.g., CO2 levels and Obesity • Correlation turning into causations – E.g., Pavlovian learning: conditional reflex Correlations vs Causation vs Anomalies 11/15/2013 Prasad 29
  • Veracity Lot of existing work on Trust ontologies, metrics and models, and on Provenance tracking • Homogeneous data: Statistical techniques • Heterogeneous data: Semantic models 11/15/2013 Prasad 30
  • Veracity Machine sensing: objective, quantitative, but prone to environmental effects, battery life, … Human sensing: subjective, qualitative, but prone to bias, perceptual errors, rumors, … Open problem: Improving trustworthiness by combining machine sensing and human sensing – E.g., 2002 Überlingen mid-air collision :Pilot incorrectly using Traffic controller advice over electronic TCAS system recommendation 11/15/2013 Prasad 31
  • (More on) Value Learning domain models from “big data” for prediction E.g., Harnessing Twitter "Big Data" for Automatic Emotion Identification 11/15/2013 Prasad 32
  • (More on) Value Discovering gaps and enriching domain models using data E.g., Data driven knowledge acquisition method for domain knowledge enrichment in the healthcare 11/15/2013 Prasad 33
  • Conclusions • Glimpse of our research organized around the 5 V’s of Big Data • Discussed role in harnessing Value – Semantic Perception (Volume) – Continuum of Semantic models to manage Heterogeneity (Variety) – Hybrid KRR: Probabilistic + Logical (Variety) – Continuous Semantics (Velocity) – Trust Models (Veracity) 3411/15/2013 Prasad
  • 35 thank you, and please visit us at http://knoesis.org/ Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA Kno.e.sis 11/15/2013 Prasad Special Thanks to: Pramod Anantharam and Cory Henson