Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data
Upcoming SlideShare
Loading in...5
×
 

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

on

  • 2,749 views

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"...

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"
WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.
http://wiki.knoesis.org/index.php/Seminar_on_Data_Mining#Semantics_empowered_Understanding.2C_Analysis_and_Mining_of_Nontraditional_and_Unstructured_Data

Statistics

Views

Total Views
2,749
Views on SlideShare
2,746
Embed Views
3

Actions

Likes
3
Downloads
100
Comments
0

2 Embeds 3

http://www.slideshare.net 2
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Microblogs are one of the most powerful ways of talking of CSD
  • Implicit social context created by people responding to other messages. In this example we are showing how the system can identify that its is Nariman and not Hareemane
  • In the scenario, what techniques and technlologies are being brought together? Semantic + Social Computing + Mobile Web
  • Users are shown two images along with labels. Labels gotten from GI or similar data source. Users add relationships. When 2 users agree, the labels are tagged with this relationship. Multiple relationships, using ML techniques, the system will learn .

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data Presentation Transcript

  • 1
  • Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data
    WSU & AFRL Window-on-Science Seminar on Data Mining
    Amit P. Sheth,
    LexisNexis Ohio Eminent Scholar
    Director, Kno.e.sis center, Wright State University
    knoesis.org
    Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers
  • Data & Knowledge Ecosystem
    3
    Situational Awareness
    Decision Support
    Insight
    Knowledge Discovery
    Analysis (eg Patterns)
    Understanding & Perception
    Data Mining
    Integration
    Search
    Browsing
    Multimedia Data
    Structured,
    Semistructured
    Unstructured
    Data
    Textual Data: Scientific Literature, Web Pages, News, Blogs,
    Reports, Wiki, Forums, Comments, Tweets
    Experimental Data
    Observational Data
    Transactional Data
  • Some examples of R&D we have done
    Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis)
    Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge
    Semantic Integration, Analysis and Decision Support over Sensor Data
    Extracting taxonomy/domain model from Wikipedia
    Discovering Hidden Relationships (insights) in Community Created Content (Wikipedia)
    4
  • Understanding User Generated Content (on Social Networking Sites)*
    What are people talking about
    How people write
    Why people write
    With application to
    • Artist Popularity Ranking
    • Advertisement on Social Media
    • Identifying Social Signals – spatio-temporal-thematic analysis of Citizen Sensor Data
    5
    * MeenaNagarajan
  • Search
    Integration
    Analysis
    Discovery
    Question
    Answering
    Situational
    Awareness
    Domain Models
    Patterns / Inference / Reasoning
    RDB
    Relationship Web
    Meta data / Semantic Annotations
    Metadata Extraction
    Multimedia Content and Web data
    Text
    Sensor Data
    Structured and Semi-structured data
  • Insider threat demo (semantic search/querying, ranking, …)
    7
  • Knowledge Discovery from Scientific Literature
    CarticRamakrishnan
  • 9
    What Knowledge Discovery is NOT
    Search
    Keyword-in-document-out
    Keywords are fully specified features of expected outcome
    Searching for prospective mining sites
    Mining
    Know where to look
    Underspecified characteristics of what is sought are available
    Patterns
    CarticRamakrishnan
  • 10
    What is knowledge discovery?
    “knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther
    “discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan
    Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge
    CarticRamakrishnan
  • Element of surprise – Swanson’s discoveries
    Stress
    ?
    Swanson’s
    Discoveries
    Magnesium
    Migraine
    Calcium
    Channel
    Blockers
    Spreading Cortical Depression
    11 possible associations found
    PubMed
    Associations Discovered based on keyword searches
    followed by manually analysis of text to establish possible relevant relationships
    11
  • Knowledge Discovery over text
    Text
    Assigning interpretation to text
    Semantic metadata
    in the form of
    semi-structured data
    Extraction of
    Semantics
    from text
    Semantic Metadata
    Guided
    Knowledge Explorations
    Semantic Metadata
    Guided
    Knowledge Discovery
    Triple-based
    Semantic
    Search
    Semantic
    browser
    Subgraph
    discovery
    12
    CarticRamakrishnan
  • Information Extraction via Ontology assisted text mining – Relationship extraction
    4733
    documents
    9284
    documents
    5
    documents
    UMLS
    Semantic Network
    complicates
    Biologically
    active substance
    affects
    causes
    causes
    Disease or Syndrome
    Lipid
    affects
    instance_of
    instance_of
    ???????
    Fish Oils
    Raynaud’s Disease
    MeSH
    PubMed
    13
    CarticRamakrishnan
  • Background knowledge and Data used
    UMLS – A high level schema of the biomedical domain
    136 classes and 49 relationships
    Synonyms of all relationship – using variant lookup (tools from NLM)
    49 relationship + their synonyms = ~350 verbs
    MeSH
    22,000+ topics organized as a forest of 16 trees
    Used to query PubMed
    PubMed
    Over 16 million abstract
    Abstracts annotated with one or more MeSH terms
    14
  • Method – Parse Sentences in PubMed
    SS-Tagger (University of Tokyo)
    SS-Parser (University of Tokyo)
    • Entities (MeSH terms) in sentences occur in modified forms
    • “adenomatous” modifies “hyperplasia”
    • “An excessive endogenous or exogenous stimulation” modifies “estrogen”
    • Entities can also occur as composites of 2 or more other entities
    • “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
    (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
    15
    CarticRamakrishnan
  • Method – Identify entities and relationships in Parse Tree
    Modifiers
    TOP
    Modified entities
    Composite Entities
    S
    VP
    UMLS ID
    T147
    NP
    VBZ
    induces
    NP
    PP
    NP
    NP
    NN
    estrogen
    IN
    by
    JJ
    excessive
    PP
    DT
    the
    ADJP
    NN
    stimulation
    MeSHID
    D004967
    IN
    of
    JJ
    adenomatous
    NN
    hyperplasia
    NP
    JJ
    endogenous
    JJ
    exogenous
    CC
    or
    MeSHID
    D006965
    NN
    endometrium
    DT
    the
    MeSHID
    D004717
    16
  • Representation – Resulting RDF
    Modifiers
    Modified entities
    Composite Entities
    17
  • 18
    Preliminary Results
    Swanson’s discoveries – Associations between Migraine and Magnesium [Hearst99]
    • stress is associated withmigraines
    • stress can lead to loss of magnesium
    • calcium channel blockersprevent some migraines
    • magnesiumis a natural calcium channel blocker
    • spreading cortical depression (SCD) is implicated in some migraines
    • high levels of magnesiuminhibit SCD
    • migraine patients have highplatelet aggregability
    • magnesium can suppressplatelet aggregability
    Data sets generated using these entities (marked red above) as boolean keyword queries against pubmed
    Bidirectional breadth-first search used to find paths in resulting RDF
  • Paths between Migraine and Magnesium
    Paths are considered interesting if they have one or more named relationship
    Other thanhasPart or hasModifiers in them
    19
    CarticRamakrishnan
  • An example of such a path
    CONCLUSION
    • Rules over parse trees are able to extract structure from sentences
    • Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships
    • Swanson’s discovery can be automated – if recall can be improved – what hurts recall?
    20
  • Unsupervised Joint Extraction of Compound Entities and Relationship
    Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth
    "Unsupervised Discovery of Compound Entities for Relationship Extraction"
    EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns
  • Joint Extraction approach
    governor
    dependent
    Dependency parse – Stanford Parser
    amod = adjectival modifier
    nsubjpass = nominal subject in passive voice
    22
  • Algorithm
    Relationship head
    Subject head
    Object head
    Object head
    23
    CarticRamakrishnan
  • 24
    Preliminary results
    CarticRamakrishnan
  • 25
    Extracted Triples
  • Semantic Metadata Guided Knowledge Explorations and Discovery
  • 27
    Results
    CarticRamakrishnan
  • Hypothesis Driven retrieval of Scientific Literature
    affects
    Migraine
    Magnesium
    Stress
    isa
    inhibit
    Patient
    Calcium Channel
    Blockers
    Complex
    Query
    Supporting
    Document
    sets
    retrieved
    Keyword query: Migraine[MH] + Magnesium[MH]
    PubMed
    28
  • 29
    Applications
    Triple-based semantic search
    Semantic Browser
  • 30
    Knowledge Discovery = Extraction + Heuristic Aggregation
    Undiscovered Public
    Knowledge
  • Understanding, Analyzing, Mining
    Social Media
    MeenaNagarajan, Karthik Gomadam
  • mumbai, india
  • november 26, 2008
  • another chapter in the war against civilization
  • and
  • the world saw it
    Through the eyes of the people
  • the world read it
    Through the words of the people
  • PEOPLE told their stories to PEOPLE
  • A powerful new era in
    Information dissemination had taken firm ground
  • Making it possible for us to
    create a global network of citizens
    Citizen Sensors –
    Citizens observing, processing, transmitting, reporting
  • Geocoder
    (Reverse Geo-coding)
    Address to location database
    18 Hormusji Street, Colaba
    VasantVihar
    Image Metadata
    latitude: 18° 54′ 59.46″ N,
    longitude: 72° 49′ 39.65″ E
    Structured Meta Extraction
    Nariman House
    Income Tax Office
    Identify and extract information from tweets
    Spatio-Temporal Analysis
  • Research Challenge #1
    Spatio Temporal and Thematic analysis
    What else happened “near” this event location?
    What events occurred “before” and “after” this event?
    Any message about “causes” for this event?
  • Spatial Analysis….
    Which tweets originated from an address near 18.916517°N 72.827682°E?
  • Which tweets originated during Nov 27th 2008,from 11PM to 12 PM
  • Giving us
    Tweets originated from an address near 18.916517°N, 72.827682°E during time interval27th Nov 2008 between 11PM to 12PM?
  • Research Challenge #2:Understanding and Analyzing Casual Text
    Casual text
    Microblogs are often written in SMS style language
    Slangs, abbreviations
  • Understanding Casual Text
    Not the same as news articles or scientific literature
    Grammatical errors
    Implications on NL parser results
    Inconsistent writing style
    Implications on learning algorithms that generalize from corpus
  • Nature of Microblogs
    Additional constraint of limited context
    Max. of x chars in a microblog
    Context often provided by the discourse
    Entity identification and disambiguation
    Pre-requisite to other sophisticated information analytics
  • NL understanding is hard to begin with..
    Not so hard
    “commando raid appears to be nigh at Oberoinow”
    Oberoi = Oberoi Hotel, Nigh = high
    Challenging
    new wing, live fire @ taj 2nd floor on iDesi TV stream
    Fire on the second floor of the Taj hotel, not on iDesi TV
  • Research Opportunities
    NER, disambiguation in casual, informal text is a budding area of research
    Another important area of focus: Combining information of varied quality from a
    corpus (statistical NLP),
    domain knowledge (tags, folksonomies, taxonomies, ontologies),
    social context (explicit and implicit communities)
  • Social Context surrounding content
    Social context in which a message appears is also an added valuable resource
    Post 1:
    “Hareemane Househostages said by eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj”
    Follow up post
    that is Nariman House, not (Hareemane)
  • Understanding content … informal text
    I say: “Your music is wicked”
    What I really mean: “Your music is good”
    54
  • Urban Dictionary
    Sentiment expression: Rocks
    Transliterates to: cool, good
    Structured text (biomedical literature)
    Semantic Metadata: Smile is a Track
    Lil transliterates to Lilly Allen
    Lilly Allen is an Artist
    MusicBrainz Taxonomy
    Informal Text (Social Network chatter)
    Artist: Lilly Allen
    Track: Smile
    Your smile rocks Lil
    Multimedia Content and Web data
    Web Services
  • Example: Pulse of a Community
    Imagine millions of such informal opinions
    Individual expressions to mass opinions
    “Popular artists” lists from MySpace comments
    Lilly Allen
    Lady Sovereign
    Amy Winehouse
    Gorillaz
    Coldplay
    Placebo
    Sting
    Kean
    Joss Stone
  • What Drives the Spatio-Temporal-Thematic Analysis and Casual Text Understanding
    Semantics with the help of
    Domain Models
    Domain Models
    Domain Models(ontologies, folksonomies)
  • Domain Knowledge: A key driver
    Places that are nearby ‘Nariman house’
    Spatial query
    Messages originated around this place
    Temporal analysis
    Messages about related events / places
    Thematic analysis
  • Research Challenge #3But Where does the Domain Knowledge come from?
    Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…)
    Community driven knowledge extraction
    How to create models that are “socially scalable”?
    How to organically grow and maintain this model?
  • Building models…seed word to hierarchy creation using WIKIPEDIA
    Query: “cognition”
  • Identifying relationships: Hard, harder than many hard things
    But NOT that Hard, When WE do it
  • Games with a purpose
    Get humans to give their solitaire time
    Solve real hard computational problems
    Image tagging, Identifying part of an image
    Tag a tune, Squigl, Verbosity, and Matchin
    Pioneered by Luis Von Ahn
  • OntoLablr
    Relationship Identification Game
    • leads to
    • causes
    Explosion
    Traffic congestion
  • How do you get comprehensive situational awareness by merging “human sensing” and “machine sensing”?
    64
  • Research Challenge #4: Semantic Sensor Web
  • Semantically Annotated O&M
    <swe:component name="time">
    <swe:Time definition="urn:ogc:def:phenomenon:time" uom="urn:ogc:def:unit:date-time">
    <sa:swe rdfa:about="?time" rdfa:instanceof="time:Instant">
    <sa:sml rdfa:property="xs:date-time"/>
    </sa:swe>
    </swe:Time>
    </swe:component>
    <swe:component name="measured_air_temperature">
    <swe:Quantity definition="urn:ogc:def:phenomenon:temperature“ uom="urn:ogc:def:unit:fahrenheit">
    <sa:swe rdfa:about="?measured_air_temperature“ rdfa:instanceof=“senso:TemperatureObservation">
    <sa:swe rdfa:property="weather:fahrenheit"/>
    <sa:swe rdfa:rel="senso:occurred_when" resource="?time"/>
    <sa:swe rdfa:rel="senso:observed_by" resource="senso:buckeye_sensor"/>
    </sa:sml>
    </swe:Quantity>
    </swe:component>
    <swe:value name=“weather-data">
    2008-03-08T05:00:00,29.1
    </swe:value>
  • Semantic Sensor ML – Adding Ontological Metadata
    Domain
    Ontology
    Person
    Company
    Spatial
    Ontology
    Coordinates
    Coordinate System
    Temporal
    Ontology
    Time Units
    Timezone
    67
    Mike Botts, "SensorML and Sensor Web Enablement,"
    Earth System Science Center, UAB Huntsville
  • 68
    Semantic Query
    Semantic Temporal Query
    Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries
    Supported semantic query operators include:
    contains: user-specified interval falls wholly within a sensor reading interval (also called inside)
    within: sensor reading interval falls wholly within the user-specified interval (inverse of contains or inside)
    overlaps: user-specified interval overlaps the sensor reading interval
    Example SPARQL query defining the temporal operator ‘within’
  • Kno.e.sis’ Semantic Sensor Web
    69
  • Semantic Sensor Web demo (online)
    Semantic Sensor Web demo (local)
    70
  • Synthetic but realistic scenario
    an image taken from a raw satellite feed
    71
  • an image taken by a camera phone with an associated label, “explosion.”
    Synthetic but realistic scenario
    72
  • Textual messages (such as tweets) using STT analysis
    Synthetic but realistic scenario
    73
  • Correlating to get
    Synthetic but realistic scenario
  • Create better views (smart mashups)
  • Extracting Social Signals
    what are the important topics of discussions and concerns in different parts of the world on a particular day
    how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack)
    how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on).
    Twitris Demo
    76
  • A few more things
    Use of background knowledge
    Event extraction from text
    time and location extraction
    Such information may not be present
    Someone from Washington DC can tweet about Mumbai
    Scalable semantic analytics
    Subgraph and pattern discovery
    Meaningful subgraphs like relevant and interesting paths
    Ranking paths
  • The Sum of the Parts
    Spatio-Temporal analysis
    Find out where and when
    + Thematic
    What and how
    + Semantic Extraction from text, multimedia and sensor data
    - tags, time, location, concepts, events
    + Semantic models & background knowledge
    Making better sense of STT
    Integration
    + Semantic Sensor Web
    The platform
    = Situational Awareness
  • KNO.E.SIS as a case study of world class research based higher education environment
    http://knoesis.org
    79
  • Kno.e.sis Center Labs (3rd Floor, Joshi)
    Amit Sheth
    Semantic Science Lab
    Semantic Web Lab
    Service Research Lab
    TK Prasad
    Metadata and Languages Lab
    Shaojun Wang
    Statistical Machine Learning
    Pascal Hitzler
    Formal Semantics & Reasoning lab
    Michael Raymer
    • Bioinformatics Lab
    Guozhu Dong
    • Data Mining Lab
    Keke Chen
    • Data Intensive Analysis and Computing Lab
  • Kno.e.sis Members – a subset
  • Exceptional students
    Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants.
    Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation.
    “BTW, Meena is an absolute find.  If all of your other students are as talented, you are very lucky.  …  I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!]
    “It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the KnoesisCenter.” [Dr. AlpersCaglayan – looking to hire Kno.e.sis grads]
  • Funding, Collaboration, etc
    UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo!
    NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google
    70% Federal, 19% State, 11% Industry
    Students intern at the bestIndustry labs & national labs
    Graduates very successful
    83
  • Interested in more background?
    Semantics-Empowered Social Computing
    Semantic Sensor Web
    Traveling the Semantic Web through Space, Theme and Time
    Relationship Web: Blazing Semantic Trails between Web Resources
    Text Mining, Workflow Management, Semantic Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy
    Contact/more details: amit @ knoesis.org
    Special thanks: Karthik Gomadam, MeenaNagarajan, Christopher Thomas
    Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).