Smart datamining semtechbiz 2013 report


Published on

A conference report of SemTechBiz 2013 in San Francisco, from a datamining and knowledge-management point of view. It covers several companies with their automatic algorithms to extract data from cleverly discovered crowed-curated data sources, or using UI tools to leverage existing utility to lure user help mark up the data...

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Smart datamining semtechbiz 2013 report

  1. 1. Utilizing Crowd-sourcedData for KnowledgeExtractionA Themed Report of SemTechBiz San Francisco 2013.06
  2. 2. SummaryPeople found great use of external data to helpextract knowledge, build modelsThese valuable data are generated by crowds butharvested by mining algorithms and/or UI toolsLOD to enrich attributes and synonyms (WalmartLabs),NLP on recipes and build deep models ( tools to markup content (Google)
  3. 3. SemTechBiz SF 13SemTechBiz2013 in San Francisco is still the largest in theworld on semantic web related technologiesWith many new comers from various industriesAn indicator of the technologies entering prime timeHas up to 7 parallel talks – broad coverage and interestsNow a 2nd tier conference in my humble opinionDiluted to 3 times/locations: US West + US East + EU / yearAttendees: 1200 in 2011, 800 in 2012, 600 in 2013Now missing elite researchers and/or top executivesMore practical, real-world, business, startups, less academic
  4. 4. Context and ScopeThis is a themed report on building knowledge-baseand/or semantic modelsThe theme title is decided post-conference due to theobvious similarity among all relevant presentations
  5. 5. @WalmartLabsUsing heterogeneous dataConnectPeopleandProducts
  6. 6. @WalmartLabs• Color search and presentation: WordNet!“Red Shirt”• Intent? Linked Data can help, on related products too.“Green Lantern”• DVD or Halloween costume? Time/news is thy friend.“Dark Knight”
  7. 7. External Data by @WalmartLabsVast amount of external data sets: WordNet, Dbpedia, LODcloud, Twitter stream, third-party prices (crawled), productdescriptions, user click streams (web logs)…
  8. 8. appcrawlr
  9. 9. TipSense TechnologiesA platform for pulling statistically significantknowledge from unstructured semantic data setsTransforming vast amounts of unstructured andsemi-structured content into a fully annotatedconceptual model.Conceptual entity recognitionContextualized content fingerprintingConcepts/topic model, sentiment analysis
  10. 10. Whisk.comKeynote: Understanding RecipesUK startup @nickholzherr on collectingrecipe ingredients, enriching withsemantics, recommending dishes and help orderingfrom stores.Wrapper induction, NLP for data collectionCoping with missing info, noises, vague dataModel flavor profiles, portion changingChallenges and opportunitiesLeftovers, geo-data, local shopping, coupons…
  11. 11. BloomReach.Search
  12. 12. Understanding IntentsEntity, Relationship MiningBuilt database of millions of conceptsShallow ontology modeling via entity and attributeextraction/miningRich semantics (units, colors, patterns, cities…)Concept propagation (tagging by training on userweblogs)
  13. 13. Product Annotation
  14. 14. Network of Concepts
  15. 15. Google WebmasterTools: MarkupStructured Data
  16. 16. Structured Data MarkupNot something entirely new: Rich SnippetWe experimented it 2 years ago (extension ofSemantic Job Search proposal)Supporting more types nowAn ecosystem no one afford to loseGoogle leveraged the SEO utility to gain morestructured data (free labor)
  17. 17. OthersGannett (News)Use a combination of auto-tagging and rules to match newsarticles with an evolving taxonomy (low-tech, but works )ISS (Intelligent Software Solutions)Complex Event Processing (in “expressive” language)Fuzzy matching with patterns with Bayesian NetworksSemantic Search and Automatic question answeringGoogle now answers (factoid questions)E.g. “What did Steve Jobs die?”, “What is the height of Mt.Everest”, “Who is the CEO of Apple?”
  18. 18. Closely Related toKnowledge AcquisitionSimilar Underlying Use Cases, Datasets andTechnologies
  19. 19. Query Interpretation@SemTechBiz“Red Shirt”Shirt (Red)Red ~=Crimson, scarlet, ruby, cherry, rose, …T-shirt a Shirt?@ProjectHalo“Dead Duck”Bird (dead)Dead ~= notalive, gone, expired, killed,…Beijing Duck a Duck?Build structured queries from natural languagesDisambiguation Query expansion
  20. 20. Intent & Process@ SemTechBiz“Eco-friendly gift for dad”Need products as giftsRelated to “dad”, “father”Expand “eco-friendly” toclose related conceptsWeigh purchases/viewsduring special event(Christmas, Father’s Day)*@ Project Halo“How do we feel the senseof heat?”Need sentences on feelingRelated to “heat/hot”Expand “heat”, “sense” torelated conceptsWeigh on signaltransmission in neuron*The Process of gettingsomething done* Learned from past user activities
  21. 21. Abstract ConceptConcrete Instances@ SemTechBiz“Eco-friendly” (gift)Mine related productreview sites and blogs~=Organic, Recycled, Solar, Reclaimed, …@ Project Halo“Feeling” (heat)Mine related biologicalsites, books, tutorials~=Sense, Experience, Feel, Temperature Sensation, …Build abstract concept, entity, instancenetworks/graphs
  22. 22. Ranking Support@ SemTechBiz2013Products related to “Gift”Recipes for “SweetSeafood”Apps that are “Free, Prettyand Fun”@ Project HaloConcepts related to “Feel”Sentences on “RedProducer”Creatures that can be “botha prey and a predator”Scoring algorithm to return themost relevant results
  23. 23. Modeling@ SemTechBiz2013“Flavor” model (Whisk)“Special Occasion” learning(BloomSearch)“Cooking” process(ingredients, portion, left-over, purchase…)@ Project Halo“Function” model in AURA“Neural signaltransmission”“Mitosis” event(steps, components, temporal process, result…)From Facts, Relations toCasual and Deep Models
  24. 24. Crowd-sourcing@ SemTechBiz2013Use webmasters togenerate structuredmarkups(Author, Category, Title, Price, Rating, …)@ Project HaloUse students to generatemetadata forsentences, questions andanswers(Relevance, UT, Type, Chapter, Exact/Various, …)Crowd-Sourcing works, if it has a limitedquantity and can be done cheaplyGoogle provides other utility (incentives for SEO) to lure webmastersProject Halo need figure out our game plan
  25. 25. Summary of Use of(Big, Wild) Data@SemTechParse vague user query into beststructured queries for databasesUnderstand user’s underlyingintentLink concept entity to concreteentitiesRank apps, products …Deep, contextual models(flavor, time and location…)Use crowds directly for free@ProjectHaloTranslate Find-A-Value and othersimple questions into complex IRqueriesUnderstand sentence’s purposeRelate category/class toinstancesRank answers, evidence…Deep contextual models(location, process, events…)Need leverage crowd cheaply
  26. 26. Many Different DataSources and Techniques
  27. 27. One Thing in Common
  28. 28. What Can We Learn?