Advertisement
Advertisement

More Related Content

Advertisement

Similar to Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011(20)

Advertisement

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

  1. Manichean Progress: Positive and Negative States of the Art in Web-Scale Data Lewis Shepherd Microsoft Institute for Advanced Technology in Government
  2. My cautionary personal note on Data “If all others accepted the lie which the Party imposed - if all records told the same tale - then the lie passed into history and became truth. 'Who controls the past' ran the Party slogan, 'controls the future: who controls the present controls the past.’” George Orwell, Nineteen Eighty-Four
  3. Murray Feshbach, Demographer & Revolutionary Spark • Following many years of continuous decline, infant mortality in the Soviet Union started inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974. The TsSU continued to print the infant mortality series for a few years after the alarming reversal of the long-term trend, but it stopped open publication of the data in 1975. • Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980 depicting the deteriorating state of public health in the USSR and--with what later proved to be an accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet Union was continuing to rise. • The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial changes in public health policies. • [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye Khozyaystvo, 1987 • The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of bad news. In the case of infant mortality, as in many similar cases, the data on adverse developments were simply deleted from the open literature. • It took an alarming and well-publicized American report to alert higher authorities to the critical situation and to introduce remedies. Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007
  4. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists” • Elizabeth Warren: The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer’s apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight. • …when she was working on the Consumer Finance Protection Board, she was thinking hard about what role technology could play in building a truly 21st century regulatory agency, and in my books, that will have to mean what I've been calling "algorithmic regulation.“ Forbes.com / G+ / Nov. 3, 2011 (emphasis added) https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1 4
  5. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists” • My feeling is that someone who is likely to have a major influence on regulating the data scientists on Wall Street is a good person to put on a list like this. Yes, I do want them regulated, and this was a way of giving Elizabeth Warren a push. I do think that if anyone will help stand up for the rest of us, she will. And I wanted a chance to plant a few ideas about how that regulation ought to happen (algorithmically, in the same way that Google manages search quality.) Blog Comment / Nov. 4, 2011 (emphasis added) http://ctovision.com/2011/11/the-worlds-7-most-powerful-data- scientists/#IDComment217149604 5
  6. Breaking down Data Barriers Semantic Knowledge for Commodity Computing Evelyne Viegas, Microsoft Research, USA Li Ding, Rensselaer Polytechnic Institute Natasa Milic-Frayling, Microsoft Research, UK Haixun Wang, Microsoft Research, Asia Kuansan Wang, Microsoft Research, USA
  7. Vision – Enable Next Generation Experiences by working with academia, stakeholders from industry, government, and consumers/innovators to make sense of data DATA > INFORMATION > KNOWLEDGE > INTELLIGENCE
  8. Data/Information • To help explore the data value chain, Microsoft’s collaborations provide access to data that enables: – Innovation – By having access to real world data, researchers can unveil new analysis or research directions based on shared assets and explore new questions – Science – By allowing wider use of data, repeatability of experiments can be performed and data misrepresentations or faulty results avoided – Training – real-world large-scale data is a powerful tool for training the next generation of data analysts and researchers • Cloud-based services: Web Language and Query Language Models – Used to research topics such as human speech, spelling, information extraction, learning, and machine translation.
  9. It’s a data-driven world – Spell Checking – Machine Translation – Search queries + click through – Online games skill matching – … Data logs behaviours in more reliable ways than demographic studies or surveys to study/predict trends (Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly susceptible to the data size used to develop them (Norvig, 2008) – it is the size of data, not the sophistication of the algorithms that ultimately play the central role in modern NLP
  10. Data has become a first class citizen IT’S A DATA-DRIVEN WORLD
  11. Data for Open Innovation - Challenges With web users becoming producers of information, leaving the footprint of their lives in digital trails, it is becoming easier for “data snoopers” to reconstruct the identity of an individual or an organization by cross linking information from different sources
  12. A Face Is Exposed for Searcher No. 4417749 “Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears” said, Lauren Weinstein, a privacy advocate. The New York Times, Aug 2006 Thelma Arnold's identity was betrayed by the records of her Web searches
  13. Web N-gram Services Access to up to petabytes of real world data http://research.microsoft.com/web-ngram Leading technology in Search, Machine Translation, Speech, Learning, …
  14. Web N-Gram in Public Beta Web data has structure… …and that counts (e.g. Body, Title, Anchor) Exploring Web Scale Language models for Search Query Processing, in WWW’2010
  15. Applications Examples using Web Ngram Services
  16. Word Breaking 16
  17. Multi-word Tag Cloud from Government Dataset Titles Ref: Dr. Li Ding, Rensselaer Polytechnic Institute
  18. Query Segmentation Body: Title: Anchor:
  19. Big Data and Machine Learning at the rescue of Machine Translation Audio/Speech Motion/Gestures
  20. Text: Paraphrasing in English http://labs.microsofttranslator.com/thesaurus/
  21. Sentence: “many are dismayed by his behaviour”
  22. Audio: Search Over Audio http://www.msravs.com/audiosearch_demo/ http://labs.microsofttranslator.com/thesaurus/
  23. Meaning of Utterances: Search Over Audio http://www.msravs.com/audiosearch_demo/
  24. Gestures: Kinect SDK http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk
  25. It’s now a Knowledge World From Patterns to Meanings
  26. Semantics as the study of Meaning • Data semantics – extract and map from structured and semi-structured sources into ontologies • Lexical semantics – identify/learn concepts, roles from sentences (e.g. Powerset; MindNet) • Statistical semantics – discover meaning from patterns of use (e.g. concept similarity) • Computational semantics – automate the process of constructing and reasoning with meaning representations • Semantic web – linked data via URI, common graph structure with RDF, inferences via ontologies and OWL • Formal semantics – in linguistics? in logic?
  27. Probase : A Knowledge Base for Text Understanding http://research.microsoft.com/en-us/projects/probase/ WordNet Wikipedia Freebase Probase Feline; Felid; Adult male; Man; TV episode; Creative work; Musical Animal; Pet; Species; Mammal; Gossip; Gossiper; Domesticated animals; Cats; recording; Organism classification; Dated Small animal; Thing; Mammalian Gossipmonger; Rumormonger; Felines; Invasive animal species; location; Musical release; Book; Musical species; Small pet; Animal species; Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced album; Film character; Publication; Carnivore; Domesticated animal; Woman; Adult female; genomes; Animals described in Character species; Top level domain; Companion animal; Exotic pet; Stimulant; Stimulant drug; 1758; Animal; Domesticated animal; ... Vertebrate; ... Excitant; Tracked vehicle; ... Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client; Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization; computing providers; Companies Software developer; Architectural Manufacturer; Industry leader; based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large IBM N/A York; Multinational companies; Programming language designer; company; Fortune 500 company; Software companies of the United Computer manufacturer/brand; Technology company; Supplier; States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company; Contractors; ... Processor manufacturer; ... Technology company; ... Instance of: Cognitive function; Employer; Written work; Musical Knowledge; Cultural factor; Communication; Auditory recording; Musical artist; Musical album; Cultural barrier; Cognitive process; communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical; Cognitive ability; Cultural Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject; difference; Ability; Characteristic; Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast Attribute of: Film; Area; Book; Textual matter; genre; Periodical subject; Video game Publication; Magazine; Country; content descriptor; ... Work; Program; Media; City; ...
  28. Probase has a big concept space 2.7 M concepts automatically Probase: harnessed from 1.68 billion pages 2 K concepts Freebase: built by community effort 120 K concepts Cyc: 25 years human labor
  29. Uncertainty Probase vs. Freebase Correctness is a Knowledge is probability. black and white. Live with dirty Clean up data. everything. Dirty data is very Dirty data is useful. unusable.
  30. What’s in your mind when you see the word ‘apple’ 6000 5000 4000 3000 2000 concepts 1000 0
  31. When the machine sees ‘apple’ and ‘pear’ together
  32. Probase Internals artist painter Born Died … Movement Picasso 1881 1973 … Cubism art painting Year Type … Guernica 1937 Oil on Canvas …
  33. Probase search
  34. Interim Product: Academic Search http://academic.research.microsoft.com/
  35. Zentity 2.0– Research Output Platform New Features: Default web UI with CSS support Pivot Viewer (defacto browser) and custom ASP.Net controls Open Data Protocol Flexible data model enables many scenarios and can be easily extended over time A semantic computing platform to store and expose relationships between digital assets http://research.microsoft.com/zentity/
  36. Pattern Discovery and Semantic Interpretation: Graph of Co-occurring Flickr Tags
  37. Pattern Discovery and Semantic Interpretation: Graph of Co-occurring Flickr Tags
  38. Pattern Discovery and Sociological Interpretation: ‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
  39. Pattern Discovery and Sociological Interpretation: ‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
  40. Semantics of Network Patterns: NodeXL http://nodexl.codeplex.com INTRODUCTION TECHNIQUES AND METRICS USER RESEARCH PRODUCT GROUP ENGAGEMENT FURTHER WORK TWITTER NodeXL Graph “Bing” at 2:30 AM Monday, July 12, 2010
  41. From Pattern to Meaning: Email  Validation of pattern analysis requires human input.  Meaning can be considered globally accepted or strictly contextual, generally understood or individually constructed.
  42. Summary  The challenge is not so much in the standards for representations (isn’t this just still syntax?) and pattern discovery but really in the interpretation and validation of that interpretation.  ‘Meaning’ has different connotations in different context  The challenge is in determining and addressing the right level of granularity.
  43. Thank you • Evelyne Viegas, Microsoft Research, USA • Li Ding, Rensselaer Polytechnic Institute • Natasa Milic-Frayling, Microsoft Research, UK • Haixun Wang, Microsoft Research, Asia • Kuansan Wang, Microsoft Research, USA Lewis Shepherd lewiss@microsoft.com @lewisshepherd

Editor's Notes

  1. [Dumais, UMAP 2009]
  2. Scaling to Very Very Large Corpora for Natural Language DisambiguationStatistical learning as the ultimate agile development tool.
  3. Here youcan see why making content types (such as title and anchor text) available to the research community is better than body, as they are more similar to users’ queries.Details can be seen in the WWW paper.
  4. The service is Public, can be used for non commercial purposes. This means that it has now been extended to researchers worldwide as part of its public beta launch which happened at WWW, Raleigh NC.What you see here is an application developed at WWW, within 8 hours of the public launch where Dr. Li Ding from Rensselaer Polytechnic Institute used the web n-gram service on a government dataset of titles to build a multi-word tag cloud, thus providing more relevant information.As an example compare on the left: critical and habitat as separate tokens and on the right (multi-word tag), critical-habitat.
  5. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
  6. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
Advertisement