Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data


Published on

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"
WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Microblogs are one of the most powerful ways of talking of CSD
  • Implicit social context created by people responding to other messages. In this example we are showing how the system can identify that its is Nariman and not Hareemane
  • In the scenario, what techniques and technlologies are being brought together? Semantic + Social Computing + Mobile Web
  • Users are shown two images along with labels. Labels gotten from GI or similar data source. Users add relationships. When 2 users agree, the labels are tagged with this relationship. Multiple relationships, using ML techniques, the system will learn .
  • Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

    1. 1. 1<br />
    2. 2. Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data<br />WSU & AFRL Window-on-Science Seminar on Data Mining<br />Amit P. Sheth,<br />LexisNexis Ohio Eminent Scholar<br />Director, Kno.e.sis center, Wright State University<br /><br />Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers<br />
    3. 3. Data & Knowledge Ecosystem<br />3<br />Situational Awareness<br />Decision Support<br />Insight<br />Knowledge Discovery<br />Analysis (eg Patterns)<br />Understanding & Perception<br />Data Mining<br />Integration<br />Search<br />Browsing<br />Multimedia Data<br />Structured,<br />Semistructured<br />Unstructured<br />Data<br />Textual Data: Scientific Literature, Web Pages, News, Blogs, <br /> Reports, Wiki, Forums, Comments, Tweets <br />Experimental Data<br />Observational Data<br />Transactional Data<br />
    4. 4. Some examples of R&D we have done<br />Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis)<br />Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge<br />Semantic Integration, Analysis and Decision Support over Sensor Data<br />Extracting taxonomy/domain model from Wikipedia<br />Discovering Hidden Relationships (insights) in Community Created Content (Wikipedia)<br />4<br />
    5. 5. Understanding User Generated Content (on Social Networking Sites)*<br />What are people talking about<br />How people write<br />Why people write<br />With application to <br /><ul><li>Artist Popularity Ranking
    6. 6. Advertisement on Social Media
    7. 7. Identifying Social Signals – spatio-temporal-thematic analysis of Citizen Sensor Data</li></ul>5<br />* MeenaNagarajan<br />
    8. 8. Search<br />Integration<br />Analysis<br />Discovery<br />Question <br /> Answering<br />Situational <br /> Awareness<br />Domain Models<br />Patterns / Inference / Reasoning<br />RDB<br />Relationship Web<br />Meta data / Semantic Annotations<br />Metadata Extraction<br />Multimedia Content and Web data<br />Text<br />Sensor Data<br />Structured and Semi-structured data<br />
    9. 9. Insider threat demo (semantic search/querying, ranking, …)<br />7<br />
    10. 10. Knowledge Discovery from Scientific Literature<br />CarticRamakrishnan<br />
    11. 11. 9<br />What Knowledge Discovery is NOT <br />Search<br />Keyword-in-document-out <br />Keywords are fully specified features of expected outcome<br />Searching for prospective mining sites<br />Mining <br />Know where to look<br />Underspecified characteristics of what is sought are available<br />Patterns<br />CarticRamakrishnan<br />
    12. 12. 10<br />What is knowledge discovery?<br />“knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther <br />“discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan<br />Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge<br />CarticRamakrishnan<br />
    13. 13. Element of surprise – Swanson’s discoveries<br />Stress<br />?<br />Swanson’s <br />Discoveries<br />Magnesium<br />Migraine<br />Calcium <br />Channel <br />Blockers<br />Spreading Cortical Depression<br />11 possible associations found<br />PubMed<br />Associations Discovered based on keyword searches <br />followed by manually analysis of text to establish possible relevant relationships<br />11<br />
    14. 14. Knowledge Discovery over text<br />Text<br />Assigning interpretation to text <br />Semantic metadata <br />in the form of<br />semi-structured data<br />Extraction of <br />Semantics <br />from text<br />Semantic Metadata <br />Guided <br />Knowledge Explorations <br />Semantic Metadata <br />Guided <br />Knowledge Discovery<br />Triple-based<br />Semantic <br />Search<br />Semantic<br />browser<br />Subgraph<br />discovery<br />12<br />CarticRamakrishnan<br />
    15. 15. Information Extraction via Ontology assisted text mining – Relationship extraction<br />4733 <br />documents<br />9284 <br />documents<br />5 <br />documents<br />UMLS <br />Semantic Network<br />complicates<br />Biologically <br />active substance<br />affects<br />causes<br />causes<br />Disease or Syndrome<br />Lipid<br />affects<br />instance_of<br />instance_of<br />???????<br />Fish Oils<br />Raynaud’s Disease<br />MeSH<br />PubMed<br />13<br />CarticRamakrishnan<br />
    16. 16. Background knowledge and Data used<br />UMLS – A high level schema of the biomedical domain<br />136 classes and 49 relationships<br />Synonyms of all relationship – using variant lookup (tools from NLM)<br />49 relationship + their synonyms = ~350 verbs<br />MeSH <br />22,000+ topics organized as a forest of 16 trees<br />Used to query PubMed<br />PubMed <br />Over 16 million abstract<br />Abstracts annotated with one or more MeSH terms<br />14<br />
    17. 17. Method – Parse Sentences in PubMed<br />SS-Tagger (University of Tokyo)<br />SS-Parser (University of Tokyo)<br /><ul><li> Entities (MeSH terms) in sentences occur in modified forms
    18. 18. “adenomatous” modifies “hyperplasia”
    19. 19. “An excessive endogenous or exogenous stimulation” modifies “estrogen”
    20. 20. Entities can also occur as composites of 2 or more other entities
    21. 21. “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”</li></ul>(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) ) <br />15<br />CarticRamakrishnan<br />
    22. 22. Method – Identify entities and relationships in Parse Tree<br />Modifiers<br />TOP<br />Modified entities<br />Composite Entities<br />S<br />VP<br />UMLS ID<br /> T147<br />NP<br />VBZ<br />induces<br />NP<br />PP<br />NP<br />NP<br />NN<br />estrogen<br />IN<br />by<br />JJ<br />excessive<br />PP<br />DT<br />the<br />ADJP<br />NN<br />stimulation<br />MeSHID<br />D004967<br />IN<br />of<br />JJ<br />adenomatous<br />NN<br />hyperplasia<br />NP<br />JJ<br />endogenous<br />JJ<br />exogenous<br />CC<br />or<br />MeSHID<br />D006965<br />NN<br />endometrium<br />DT<br />the<br />MeSHID<br />D004717<br />16<br />
    23. 23. Representation – Resulting RDF<br />Modifiers<br />Modified entities<br />Composite Entities<br />17<br />
    24. 24. 18<br />Preliminary Results<br /> Swanson’s discoveries – Associations between Migraine and Magnesium [Hearst99]<br /><ul><li>stress is associated withmigraines
    25. 25. stress can lead to loss of magnesium
    26. 26. calcium channel blockersprevent some migraines
    27. 27. magnesiumis a natural calcium channel blocker
    28. 28. spreading cortical depression (SCD) is implicated in some migraines
    29. 29. high levels of magnesiuminhibit SCD
    30. 30. migraine patients have highplatelet aggregability
    31. 31. magnesium can suppressplatelet aggregability</li></ul>Data sets generated using these entities (marked red above) as boolean keyword queries against pubmed<br />Bidirectional breadth-first search used to find paths in resulting RDF <br />
    32. 32. Paths between Migraine and Magnesium<br />Paths are considered interesting if they have one or more named relationship<br />Other thanhasPart or hasModifiers in them<br />19<br />CarticRamakrishnan<br />
    33. 33. An example of such a path<br />CONCLUSION<br /><ul><li>Rules over parse trees are able to extract structure from sentences
    34. 34. Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships
    35. 35. Swanson’s discovery can be automated – if recall can be improved – what hurts recall?</li></ul>20<br />
    36. 36. Unsupervised Joint Extraction of Compound Entities and Relationship<br />Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth <br />&quot;Unsupervised Discovery of Compound Entities for Relationship Extraction&quot;<br />EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns <br />
    37. 37. Joint Extraction approach<br />governor<br />dependent<br />Dependency parse – Stanford Parser<br />amod = adjectival modifier<br />nsubjpass = nominal subject in passive voice<br />22<br />
    38. 38. Algorithm<br />Relationship head<br />Subject head<br />Object head<br />Object head<br />23<br />CarticRamakrishnan<br />
    39. 39. 24<br />Preliminary results<br />CarticRamakrishnan<br />
    40. 40. 25<br />Extracted Triples<br />
    41. 41. Semantic Metadata Guided Knowledge Explorations and Discovery<br />
    42. 42. 27<br />Results<br />CarticRamakrishnan<br />
    43. 43. Hypothesis Driven retrieval of Scientific Literature <br />affects<br />Migraine<br />Magnesium<br />Stress<br />isa<br />inhibit<br />Patient<br />Calcium Channel <br />Blockers<br />Complex <br />Query<br />Supporting<br />Document <br />sets<br />retrieved<br />Keyword query: Migraine[MH] + Magnesium[MH]<br />PubMed<br />28<br />
    44. 44. 29<br />Applications<br />Triple-based semantic search<br />Semantic Browser<br />
    45. 45. 30<br />Knowledge Discovery = Extraction + Heuristic Aggregation<br />Undiscovered Public <br />Knowledge<br />
    46. 46. Understanding, Analyzing, Mining <br />Social Media<br />MeenaNagarajan, Karthik Gomadam<br />
    47. 47. mumbai, india<br />
    48. 48. november 26, 2008<br />
    49. 49. another chapter in the war against civilization<br />
    50. 50. and<br />
    51. 51.
    52. 52.
    53. 53. the world saw it<br />Through the eyes of the people <br />
    54. 54. the world read it<br />Through the words of the people <br />
    55. 55. PEOPLE told their stories to PEOPLE<br />
    56. 56. A powerful new era in <br />Information dissemination had taken firm ground<br />
    57. 57. Making it possible for us to<br />create a global network of citizens<br />Citizen Sensors – <br />Citizens observing, processing, transmitting, reporting<br />
    58. 58. Geocoder<br />(Reverse Geo-coding)<br />Address to location database<br />18 Hormusji Street, Colaba<br />VasantVihar<br />Image Metadata<br />latitude: 18° 54′ 59.46″ N, <br />longitude: 72° 49′ 39.65″ E<br />Structured Meta Extraction<br />Nariman House<br />Income Tax Office<br />Identify and extract information from tweets<br />Spatio-Temporal Analysis<br />
    59. 59. Research Challenge #1<br />Spatio Temporal and Thematic analysis<br />What else happened “near” this event location?<br />What events occurred “before” and “after” this event?<br />Any message about “causes” for this event?<br />
    60. 60. Spatial Analysis….<br />Which tweets originated from an address near 18.916517°N 72.827682°E? <br />
    61. 61. Which tweets originated during Nov 27th 2008,from 11PM to 12 PM <br />
    62. 62. Giving us<br />Tweets originated from an address near 18.916517°N, 72.827682°E during time interval27th Nov 2008 between 11PM to 12PM?<br />
    63. 63. Research Challenge #2:Understanding and Analyzing Casual Text<br />Casual text<br />Microblogs are often written in SMS style language<br />Slangs, abbreviations<br />
    64. 64. Understanding Casual Text<br />Not the same as news articles or scientific literature<br />Grammatical errors<br />Implications on NL parser results<br />Inconsistent writing style<br />Implications on learning algorithms that generalize from corpus<br />
    65. 65. Nature of Microblogs<br />Additional constraint of limited context<br />Max. of x chars in a microblog<br />Context often provided by the discourse<br />Entity identification and disambiguation<br />Pre-requisite to other sophisticated information analytics<br />
    66. 66. NL understanding is hard to begin with..<br />Not so hard<br />“commando raid appears to be nigh at Oberoinow”<br />Oberoi = Oberoi Hotel, Nigh = high<br />Challenging<br />new wing, live fire @ taj 2nd floor on iDesi TV stream<br />Fire on the second floor of the Taj hotel, not on iDesi TV<br />
    67. 67. Research Opportunities<br />NER, disambiguation in casual, informal text is a budding area of research<br />Another important area of focus: Combining information of varied quality from a <br />corpus (statistical NLP), <br />domain knowledge (tags, folksonomies, taxonomies, ontologies), <br />social context (explicit and implicit communities)<br />
    68. 68. Social Context surrounding content<br />Social context in which a message appears is also an added valuable resource<br />Post 1: <br />“Hareemane Househostages said by eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj”<br />Follow up post<br />that is Nariman House, not (Hareemane)<br />
    69. 69. Understanding content … informal text<br />I say: “Your music is wicked” <br />What I really mean: “Your music is good” <br />54<br />
    70. 70. Urban Dictionary<br />Sentiment expression: Rocks <br />Transliterates to: cool, good<br />Structured text (biomedical literature)<br />Semantic Metadata: Smile is a Track<br />Lil transliterates to Lilly Allen<br />Lilly Allen is an Artist<br />MusicBrainz Taxonomy<br />Informal Text (Social Network chatter)<br />Artist: Lilly Allen<br />Track: Smile<br /> Your smile rocks Lil<br />Multimedia Content and Web data<br />Web Services<br />
    71. 71. Example: Pulse of a Community<br />Imagine millions of such informal opinions<br />Individual expressions to mass opinions<br />“Popular artists” lists from MySpace comments<br />Lilly Allen <br />Lady Sovereign <br />Amy Winehouse<br />Gorillaz<br />Coldplay<br />Placebo<br />Sting<br />Kean<br />Joss Stone <br />
    72. 72. What Drives the Spatio-Temporal-Thematic Analysis and Casual Text Understanding<br />Semantics with the help of<br />Domain Models<br />Domain Models<br />Domain Models(ontologies, folksonomies)<br />
    73. 73. Domain Knowledge: A key driver<br />Places that are nearby ‘Nariman house’<br />Spatial query<br />Messages originated around this place<br />Temporal analysis<br />Messages about related events / places<br />Thematic analysis<br />
    74. 74. Research Challenge #3But Where does the Domain Knowledge come from?<br />Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…)<br />Community driven knowledge extraction <br />How to create models that are “socially scalable”?<br />How to organically grow and maintain this model?<br />
    75. 75. Building models…seed word to hierarchy creation using WIKIPEDIA<br />Query: “cognition”<br />
    76. 76. Identifying relationships: Hard, harder than many hard things <br />But NOT that Hard, When WE do it<br />
    77. 77. Games with a purpose<br />Get humans to give their solitaire time <br />Solve real hard computational problems<br />Image tagging, Identifying part of an image<br /> Tag a tune, Squigl, Verbosity, and Matchin<br />Pioneered by Luis Von Ahn<br />
    78. 78. OntoLablr<br />Relationship Identification Game<br /><ul><li>leads to
    79. 79. causes</li></ul>Explosion<br />Traffic congestion<br />
    80. 80. How do you get comprehensive situational awareness by merging “human sensing” and “machine sensing”?<br />64<br />
    81. 81. Research Challenge #4: Semantic Sensor Web<br />
    82. 82. Semantically Annotated O&M<br />&lt;swe:component name=&quot;time&quot;&gt;<br /> &lt;swe:Time definition=&quot;urn:ogc:def:phenomenon:time&quot; uom=&quot;urn:ogc:def:unit:date-time&quot;&gt;<br /> &lt;sa:swe rdfa:about=&quot;?time&quot; rdfa:instanceof=&quot;time:Instant&quot;&gt;<br /> &lt;sa:sml rdfa:property=&quot;xs:date-time&quot;/&gt;<br /> &lt;/sa:swe&gt;<br /> &lt;/swe:Time&gt;<br />&lt;/swe:component&gt;<br />&lt;swe:component name=&quot;measured_air_temperature&quot;&gt;<br /> &lt;swe:Quantity definition=&quot;urn:ogc:def:phenomenon:temperature“ uom=&quot;urn:ogc:def:unit:fahrenheit&quot;&gt;<br /> &lt;sa:swe rdfa:about=&quot;?measured_air_temperature“ rdfa:instanceof=“senso:TemperatureObservation&quot;&gt;<br /> &lt;sa:swe rdfa:property=&quot;weather:fahrenheit&quot;/&gt;<br /> &lt;sa:swe rdfa:rel=&quot;senso:occurred_when&quot; resource=&quot;?time&quot;/&gt;<br /> &lt;sa:swe rdfa:rel=&quot;senso:observed_by&quot; resource=&quot;senso:buckeye_sensor&quot;/&gt;<br /> &lt;/sa:sml&gt; <br /> &lt;/swe:Quantity&gt;<br />&lt;/swe:component&gt;<br />&lt;swe:value name=“weather-data&quot;&gt;<br /> 2008-03-08T05:00:00,29.1<br />&lt;/swe:value&gt;<br />
    83. 83. Semantic Sensor ML – Adding Ontological Metadata<br />Domain<br />Ontology<br />Person<br />Company<br />Spatial<br />Ontology<br />Coordinates<br />Coordinate System<br />Temporal<br />Ontology<br />Time Units<br />Timezone<br />67<br />Mike Botts, &quot;SensorML and Sensor Web Enablement,&quot; <br />Earth System Science Center, UAB Huntsville<br />
    84. 84. 68<br />Semantic Query<br />Semantic Temporal Query<br />Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries<br />Supported semantic query operators include:<br />contains: user-specified interval falls wholly within a sensor reading interval (also called inside)<br />within: sensor reading interval falls wholly within the user-specified interval (inverse of contains or inside)<br />overlaps: user-specified interval overlaps the sensor reading interval<br />Example SPARQL query defining the temporal operator ‘within’<br />
    85. 85. Kno.e.sis’ Semantic Sensor Web<br />69<br />
    86. 86. Semantic Sensor Web demo (online)<br />Semantic Sensor Web demo (local)<br />70<br />
    87. 87. Synthetic but realistic scenario<br />an image taken from a raw satellite feed<br />71<br />
    88. 88. an image taken by a camera phone with an associated label, “explosion.” <br />Synthetic but realistic scenario<br />72<br />
    89. 89. Textual messages (such as tweets) using STT analysis<br />Synthetic but realistic scenario<br />73<br />
    90. 90. Correlating to get<br />Synthetic but realistic scenario<br />
    91. 91. Create better views (smart mashups)<br />
    92. 92. Extracting Social Signals<br />what are the important topics of discussions and concerns in different parts of the world on a particular day<br />how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack)<br />how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on).<br />Twitris Demo<br />76<br />
    93. 93. A few more things<br />Use of background knowledge<br />Event extraction from text<br />time and location extraction <br />Such information may not be present<br />Someone from Washington DC can tweet about Mumbai<br />Scalable semantic analytics<br />Subgraph and pattern discovery<br />Meaningful subgraphs like relevant and interesting paths<br />Ranking paths <br />
    94. 94. The Sum of the Parts<br />Spatio-Temporal analysis<br />Find out where and when<br />+ Thematic <br />What and how<br />+ Semantic Extraction from text, multimedia and sensor data<br /> - tags, time, location, concepts, events<br />+ Semantic models & background knowledge<br />Making better sense of STT<br />Integration<br /> + Semantic Sensor Web<br />The platform<br /> = Situational Awareness<br />
    95. 95. KNO.E.SIS as a case study of world class research based higher education environment<br /><br />79<br />
    96. 96. Kno.e.sis Center Labs (3rd Floor, Joshi)<br />Amit Sheth<br />Semantic Science Lab<br />Semantic Web Lab<br />Service Research Lab<br />TK Prasad<br />Metadata and Languages Lab<br />Shaojun Wang<br />Statistical Machine Learning<br />Pascal Hitzler<br />Formal Semantics & Reasoning lab<br />Michael Raymer<br /><ul><li>Bioinformatics Lab</li></ul>Guozhu Dong<br /><ul><li>Data Mining Lab</li></ul>Keke Chen<br /><ul><li>Data Intensive Analysis and Computing Lab</li></li></ul><li>Kno.e.sis Members – a subset<br />
    97. 97. Exceptional students<br />Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants.<br />Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation.<br />“BTW, Meena is an absolute find.  If all of your other students are as talented, you are very lucky.  …  I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!]<br />“It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the KnoesisCenter.” [Dr. AlpersCaglayan – looking to hire Kno.e.sis grads]<br />
    98. 98. Funding, Collaboration, etc<br />UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo!<br />NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google<br /> 70% Federal, 19% State, 11% Industry<br />Students intern at the bestIndustry labs & national labs<br />Graduates very successful<br />83<br />
    99. 99. Interested in more background?<br />Semantics-Empowered Social Computing<br />Semantic Sensor Web <br />Traveling the Semantic Web through Space, Theme and Time <br />Relationship Web: Blazing Semantic Trails between Web Resources <br />Text Mining, Workflow Management, Semantic Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy<br />Contact/more details: amit @<br />Special thanks: Karthik Gomadam, MeenaNagarajan, Christopher Thomas<br />Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).<br />