Your SlideShare is downloading. ×
Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

2,242
views

Published on

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data," …

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"
WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.
http://wiki.knoesis.org/index.php/Seminar_on_Data_Mining#Semantics_empowered_Understanding.2C_Analysis_and_Mining_of_Nontraditional_and_Unstructured_Data

Published in: Technology, Education

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,242
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
107
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Microblogs are one of the most powerful ways of talking of CSD
  • Implicit social context created by people responding to other messages. In this example we are showing how the system can identify that its is Nariman and not Hareemane
  • In the scenario, what techniques and technlologies are being brought together? Semantic + Social Computing + Mobile Web
  • Users are shown two images along with labels. Labels gotten from GI or similar data source. Users add relationships. When 2 users agree, the labels are tagged with this relationship. Multiple relationships, using ML techniques, the system will learn .
  • Transcript

    • 1. 1
    • 2. Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data
      WSU & AFRL Window-on-Science Seminar on Data Mining
      Amit P. Sheth,
      LexisNexis Ohio Eminent Scholar
      Director, Kno.e.sis center, Wright State University
      knoesis.org
      Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers
    • 3. Data & Knowledge Ecosystem
      3
      Situational Awareness
      Decision Support
      Insight
      Knowledge Discovery
      Analysis (eg Patterns)
      Understanding & Perception
      Data Mining
      Integration
      Search
      Browsing
      Multimedia Data
      Structured,
      Semistructured
      Unstructured
      Data
      Textual Data: Scientific Literature, Web Pages, News, Blogs,
      Reports, Wiki, Forums, Comments, Tweets
      Experimental Data
      Observational Data
      Transactional Data
    • 4. Some examples of R&D we have done
      Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis)
      Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge
      Semantic Integration, Analysis and Decision Support over Sensor Data
      Extracting taxonomy/domain model from Wikipedia
      Discovering Hidden Relationships (insights) in Community Created Content (Wikipedia)
      4
    • 5. Understanding User Generated Content (on Social Networking Sites)*
      What are people talking about
      How people write
      Why people write
      With application to
      • Artist Popularity Ranking
      • 6. Advertisement on Social Media
      • 7. Identifying Social Signals – spatio-temporal-thematic analysis of Citizen Sensor Data
      5
      * MeenaNagarajan
    • 8. Search
      Integration
      Analysis
      Discovery
      Question
      Answering
      Situational
      Awareness
      Domain Models
      Patterns / Inference / Reasoning
      RDB
      Relationship Web
      Meta data / Semantic Annotations
      Metadata Extraction
      Multimedia Content and Web data
      Text
      Sensor Data
      Structured and Semi-structured data
    • 9. Insider threat demo (semantic search/querying, ranking, …)
      7
    • 10. Knowledge Discovery from Scientific Literature
      CarticRamakrishnan
    • 11. 9
      What Knowledge Discovery is NOT
      Search
      Keyword-in-document-out
      Keywords are fully specified features of expected outcome
      Searching for prospective mining sites
      Mining
      Know where to look
      Underspecified characteristics of what is sought are available
      Patterns
      CarticRamakrishnan
    • 12. 10
      What is knowledge discovery?
      “knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther
      “discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan
      Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge
      CarticRamakrishnan
    • 13. Element of surprise – Swanson’s discoveries
      Stress
      ?
      Swanson’s
      Discoveries
      Magnesium
      Migraine
      Calcium
      Channel
      Blockers
      Spreading Cortical Depression
      11 possible associations found
      PubMed
      Associations Discovered based on keyword searches
      followed by manually analysis of text to establish possible relevant relationships
      11
    • 14. Knowledge Discovery over text
      Text
      Assigning interpretation to text
      Semantic metadata
      in the form of
      semi-structured data
      Extraction of
      Semantics
      from text
      Semantic Metadata
      Guided
      Knowledge Explorations
      Semantic Metadata
      Guided
      Knowledge Discovery
      Triple-based
      Semantic
      Search
      Semantic
      browser
      Subgraph
      discovery
      12
      CarticRamakrishnan
    • 15. Information Extraction via Ontology assisted text mining – Relationship extraction
      4733
      documents
      9284
      documents
      5
      documents
      UMLS
      Semantic Network
      complicates
      Biologically
      active substance
      affects
      causes
      causes
      Disease or Syndrome
      Lipid
      affects
      instance_of
      instance_of
      ???????
      Fish Oils
      Raynaud’s Disease
      MeSH
      PubMed
      13
      CarticRamakrishnan
    • 16. Background knowledge and Data used
      UMLS – A high level schema of the biomedical domain
      136 classes and 49 relationships
      Synonyms of all relationship – using variant lookup (tools from NLM)
      49 relationship + their synonyms = ~350 verbs
      MeSH
      22,000+ topics organized as a forest of 16 trees
      Used to query PubMed
      PubMed
      Over 16 million abstract
      Abstracts annotated with one or more MeSH terms
      14
    • 17. Method – Parse Sentences in PubMed
      SS-Tagger (University of Tokyo)
      SS-Parser (University of Tokyo)
      • Entities (MeSH terms) in sentences occur in modified forms
      • 18. “adenomatous” modifies “hyperplasia”
      • 19. “An excessive endogenous or exogenous stimulation” modifies “estrogen”
      • 20. Entities can also occur as composites of 2 or more other entities
      • 21. “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
      (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
      15
      CarticRamakrishnan
    • 22. Method – Identify entities and relationships in Parse Tree
      Modifiers
      TOP
      Modified entities
      Composite Entities
      S
      VP
      UMLS ID
      T147
      NP
      VBZ
      induces
      NP
      PP
      NP
      NP
      NN
      estrogen
      IN
      by
      JJ
      excessive
      PP
      DT
      the
      ADJP
      NN
      stimulation
      MeSHID
      D004967
      IN
      of
      JJ
      adenomatous
      NN
      hyperplasia
      NP
      JJ
      endogenous
      JJ
      exogenous
      CC
      or
      MeSHID
      D006965
      NN
      endometrium
      DT
      the
      MeSHID
      D004717
      16
    • 23. Representation – Resulting RDF
      Modifiers
      Modified entities
      Composite Entities
      17
    • 24. 18
      Preliminary Results
      Swanson’s discoveries – Associations between Migraine and Magnesium [Hearst99]
      • stress is associated withmigraines
      • 25. stress can lead to loss of magnesium
      • 26. calcium channel blockersprevent some migraines
      • 27. magnesiumis a natural calcium channel blocker
      • 28. spreading cortical depression (SCD) is implicated in some migraines
      • 29. high levels of magnesiuminhibit SCD
      • 30. migraine patients have highplatelet aggregability
      • 31. magnesium can suppressplatelet aggregability
      Data sets generated using these entities (marked red above) as boolean keyword queries against pubmed
      Bidirectional breadth-first search used to find paths in resulting RDF
    • 32. Paths between Migraine and Magnesium
      Paths are considered interesting if they have one or more named relationship
      Other thanhasPart or hasModifiers in them
      19
      CarticRamakrishnan
    • 33. An example of such a path
      CONCLUSION
      • Rules over parse trees are able to extract structure from sentences
      • 34. Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships
      • 35. Swanson’s discovery can be automated – if recall can be improved – what hurts recall?
      20
    • 36. Unsupervised Joint Extraction of Compound Entities and Relationship
      Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth
      "Unsupervised Discovery of Compound Entities for Relationship Extraction"
      EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns
    • 37. Joint Extraction approach
      governor
      dependent
      Dependency parse – Stanford Parser
      amod = adjectival modifier
      nsubjpass = nominal subject in passive voice
      22
    • 38. Algorithm
      Relationship head
      Subject head
      Object head
      Object head
      23
      CarticRamakrishnan
    • 39. 24
      Preliminary results
      CarticRamakrishnan
    • 40. 25
      Extracted Triples
    • 41. Semantic Metadata Guided Knowledge Explorations and Discovery
    • 42. 27
      Results
      CarticRamakrishnan
    • 43. Hypothesis Driven retrieval of Scientific Literature
      affects
      Migraine
      Magnesium
      Stress
      isa
      inhibit
      Patient
      Calcium Channel
      Blockers
      Complex
      Query
      Supporting
      Document
      sets
      retrieved
      Keyword query: Migraine[MH] + Magnesium[MH]
      PubMed
      28
    • 44. 29
      Applications
      Triple-based semantic search
      Semantic Browser
    • 45. 30
      Knowledge Discovery = Extraction + Heuristic Aggregation
      Undiscovered Public
      Knowledge
    • 46. Understanding, Analyzing, Mining
      Social Media
      MeenaNagarajan, Karthik Gomadam
    • 47. mumbai, india
    • 48. november 26, 2008
    • 49. another chapter in the war against civilization
    • 50. and
    • 51.
    • 52.
    • 53. the world saw it
      Through the eyes of the people
    • 54. the world read it
      Through the words of the people
    • 55. PEOPLE told their stories to PEOPLE
    • 56. A powerful new era in
      Information dissemination had taken firm ground
    • 57. Making it possible for us to
      create a global network of citizens
      Citizen Sensors –
      Citizens observing, processing, transmitting, reporting
    • 58. Geocoder
      (Reverse Geo-coding)
      Address to location database
      18 Hormusji Street, Colaba
      VasantVihar
      Image Metadata
      latitude: 18° 54′ 59.46″ N,
      longitude: 72° 49′ 39.65″ E
      Structured Meta Extraction
      Nariman House
      Income Tax Office
      Identify and extract information from tweets
      Spatio-Temporal Analysis
    • 59. Research Challenge #1
      Spatio Temporal and Thematic analysis
      What else happened “near” this event location?
      What events occurred “before” and “after” this event?
      Any message about “causes” for this event?
    • 60. Spatial Analysis….
      Which tweets originated from an address near 18.916517°N 72.827682°E?
    • 61. Which tweets originated during Nov 27th 2008,from 11PM to 12 PM
    • 62. Giving us
      Tweets originated from an address near 18.916517°N, 72.827682°E during time interval27th Nov 2008 between 11PM to 12PM?
    • 63. Research Challenge #2:Understanding and Analyzing Casual Text
      Casual text
      Microblogs are often written in SMS style language
      Slangs, abbreviations
    • 64. Understanding Casual Text
      Not the same as news articles or scientific literature
      Grammatical errors
      Implications on NL parser results
      Inconsistent writing style
      Implications on learning algorithms that generalize from corpus
    • 65. Nature of Microblogs
      Additional constraint of limited context
      Max. of x chars in a microblog
      Context often provided by the discourse
      Entity identification and disambiguation
      Pre-requisite to other sophisticated information analytics
    • 66. NL understanding is hard to begin with..
      Not so hard
      “commando raid appears to be nigh at Oberoinow”
      Oberoi = Oberoi Hotel, Nigh = high
      Challenging
      new wing, live fire @ taj 2nd floor on iDesi TV stream
      Fire on the second floor of the Taj hotel, not on iDesi TV
    • 67. Research Opportunities
      NER, disambiguation in casual, informal text is a budding area of research
      Another important area of focus: Combining information of varied quality from a
      corpus (statistical NLP),
      domain knowledge (tags, folksonomies, taxonomies, ontologies),
      social context (explicit and implicit communities)
    • 68. Social Context surrounding content
      Social context in which a message appears is also an added valuable resource
      Post 1:
      “Hareemane Househostages said by eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj”
      Follow up post
      that is Nariman House, not (Hareemane)
    • 69. Understanding content … informal text
      I say: “Your music is wicked”
      What I really mean: “Your music is good”
      54
    • 70. Urban Dictionary
      Sentiment expression: Rocks
      Transliterates to: cool, good
      Structured text (biomedical literature)
      Semantic Metadata: Smile is a Track
      Lil transliterates to Lilly Allen
      Lilly Allen is an Artist
      MusicBrainz Taxonomy
      Informal Text (Social Network chatter)
      Artist: Lilly Allen
      Track: Smile
      Your smile rocks Lil
      Multimedia Content and Web data
      Web Services
    • 71. Example: Pulse of a Community
      Imagine millions of such informal opinions
      Individual expressions to mass opinions
      “Popular artists” lists from MySpace comments
      Lilly Allen
      Lady Sovereign
      Amy Winehouse
      Gorillaz
      Coldplay
      Placebo
      Sting
      Kean
      Joss Stone
    • 72. What Drives the Spatio-Temporal-Thematic Analysis and Casual Text Understanding
      Semantics with the help of
      Domain Models
      Domain Models
      Domain Models(ontologies, folksonomies)
    • 73. Domain Knowledge: A key driver
      Places that are nearby ‘Nariman house’
      Spatial query
      Messages originated around this place
      Temporal analysis
      Messages about related events / places
      Thematic analysis
    • 74. Research Challenge #3But Where does the Domain Knowledge come from?
      Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…)
      Community driven knowledge extraction
      How to create models that are “socially scalable”?
      How to organically grow and maintain this model?
    • 75. Building models…seed word to hierarchy creation using WIKIPEDIA
      Query: “cognition”
    • 76. Identifying relationships: Hard, harder than many hard things
      But NOT that Hard, When WE do it
    • 77. Games with a purpose
      Get humans to give their solitaire time
      Solve real hard computational problems
      Image tagging, Identifying part of an image
      Tag a tune, Squigl, Verbosity, and Matchin
      Pioneered by Luis Von Ahn
    • 78. OntoLablr
      Relationship Identification Game
      Explosion
      Traffic congestion
    • 80. How do you get comprehensive situational awareness by merging “human sensing” and “machine sensing”?
      64
    • 81. Research Challenge #4: Semantic Sensor Web
    • 82. Semantically Annotated O&M
      <swe:component name="time">
      <swe:Time definition="urn:ogc:def:phenomenon:time" uom="urn:ogc:def:unit:date-time">
      <sa:swe rdfa:about="?time" rdfa:instanceof="time:Instant">
      <sa:sml rdfa:property="xs:date-time"/>
      </sa:swe>
      </swe:Time>
      </swe:component>
      <swe:component name="measured_air_temperature">
      <swe:Quantity definition="urn:ogc:def:phenomenon:temperature“ uom="urn:ogc:def:unit:fahrenheit">
      <sa:swe rdfa:about="?measured_air_temperature“ rdfa:instanceof=“senso:TemperatureObservation">
      <sa:swe rdfa:property="weather:fahrenheit"/>
      <sa:swe rdfa:rel="senso:occurred_when" resource="?time"/>
      <sa:swe rdfa:rel="senso:observed_by" resource="senso:buckeye_sensor"/>
      </sa:sml>
      </swe:Quantity>
      </swe:component>
      <swe:value name=“weather-data">
      2008-03-08T05:00:00,29.1
      </swe:value>
    • 83. Semantic Sensor ML – Adding Ontological Metadata
      Domain
      Ontology
      Person
      Company
      Spatial
      Ontology
      Coordinates
      Coordinate System
      Temporal
      Ontology
      Time Units
      Timezone
      67
      Mike Botts, "SensorML and Sensor Web Enablement,"
      Earth System Science Center, UAB Huntsville
    • 84. 68
      Semantic Query
      Semantic Temporal Query
      Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries
      Supported semantic query operators include:
      contains: user-specified interval falls wholly within a sensor reading interval (also called inside)
      within: sensor reading interval falls wholly within the user-specified interval (inverse of contains or inside)
      overlaps: user-specified interval overlaps the sensor reading interval
      Example SPARQL query defining the temporal operator ‘within’
    • 85. Kno.e.sis’ Semantic Sensor Web
      69
    • 86. Semantic Sensor Web demo (online)
      Semantic Sensor Web demo (local)
      70
    • 87. Synthetic but realistic scenario
      an image taken from a raw satellite feed
      71
    • 88. an image taken by a camera phone with an associated label, “explosion.”
      Synthetic but realistic scenario
      72
    • 89. Textual messages (such as tweets) using STT analysis
      Synthetic but realistic scenario
      73
    • 90. Correlating to get
      Synthetic but realistic scenario
    • 91. Create better views (smart mashups)
    • 92. Extracting Social Signals
      what are the important topics of discussions and concerns in different parts of the world on a particular day
      how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack)
      how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on).
      Twitris Demo
      76
    • 93. A few more things
      Use of background knowledge
      Event extraction from text
      time and location extraction
      Such information may not be present
      Someone from Washington DC can tweet about Mumbai
      Scalable semantic analytics
      Subgraph and pattern discovery
      Meaningful subgraphs like relevant and interesting paths
      Ranking paths
    • 94. The Sum of the Parts
      Spatio-Temporal analysis
      Find out where and when
      + Thematic
      What and how
      + Semantic Extraction from text, multimedia and sensor data
      - tags, time, location, concepts, events
      + Semantic models & background knowledge
      Making better sense of STT
      Integration
      + Semantic Sensor Web
      The platform
      = Situational Awareness
    • 95. KNO.E.SIS as a case study of world class research based higher education environment
      http://knoesis.org
      79
    • 96. Kno.e.sis Center Labs (3rd Floor, Joshi)
      Amit Sheth
      Semantic Science Lab
      Semantic Web Lab
      Service Research Lab
      TK Prasad
      Metadata and Languages Lab
      Shaojun Wang
      Statistical Machine Learning
      Pascal Hitzler
      Formal Semantics & Reasoning lab
      Michael Raymer
      • Bioinformatics Lab
      Guozhu Dong
      • Data Mining Lab
      Keke Chen
      • Data Intensive Analysis and Computing Lab
    • Kno.e.sis Members – a subset
    • 97. Exceptional students
      Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants.
      Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation.
      “BTW, Meena is an absolute find.  If all of your other students are as talented, you are very lucky.  …  I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!]
      “It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the KnoesisCenter.” [Dr. AlpersCaglayan – looking to hire Kno.e.sis grads]
    • 98. Funding, Collaboration, etc
      UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo!
      NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google
      70% Federal, 19% State, 11% Industry
      Students intern at the bestIndustry labs & national labs
      Graduates very successful
      83
    • 99. Interested in more background?
      Semantics-Empowered Social Computing
      Semantic Sensor Web
      Traveling the Semantic Web through Space, Theme and Time
      Relationship Web: Blazing Semantic Trails between Web Resources
      Text Mining, Workflow Management, Semantic Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy
      Contact/more details: amit @ knoesis.org
      Special thanks: Karthik Gomadam, MeenaNagarajan, Christopher Thomas
      Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).