Manichean Progress:Positive and NegativeStates of the Artin Web-Scale DataLewis ShepherdMicrosoft Institute  for Advanced ...
My cautionary personal note on Data “If all others accepted the lie which the Party imposed - if all records told the same...
Murray Feshbach,      Demographer & Revolutionary Spark•   Following many years of continuous decline, infant mortality in...
Tim O’Reilly       Government as a Platform Evangelist   on “The World’s 7 Most Powerful Data Scientists”• Elizabeth Warre...
Tim O’Reilly      Government as a Platform Evangelist  on “The World’s 7 Most Powerful Data Scientists”• My feeling is tha...
Breaking down Data BarriersSemantic Knowledge for CommodityComputing Evelyne Viegas, Microsoft Research, USA Li Ding, Rens...
Vision – Enable Next Generation Experiences byworking with academia, stakeholders fromindustry, government, andconsumers/i...
Data/Information • To help explore the data value chain, Microsoft’s collaborations   provide access to data that enables:...
It’s a data-driven world    –   Spell Checking    –   Machine Translation    –   Search queries + click through    –   Onl...
Data has become a first class citizenIT’S A DATA-DRIVEN WORLD
Data for Open Innovation - ChallengesWith web users becoming producers ofinformation, leaving the footprint of their lives...
A Face Is Exposed for Searcher No. 4417749                                                       “Search query data can co...
Web N-gram ServicesAccess to up to petabytes of real world data           http://research.microsoft.com/web-ngramLeading t...
Web N-Gram in Public Beta  Web data has  structure…  …and that counts  (e.g. Body, Title, Anchor)Exploring Web Scale Langu...
Applications Examples using Web         Ngram Services
Word Breaking                16
Multi-word Tag Cloud from Government            Dataset Titles      Ref: Dr. Li Ding, Rensselaer Polytechnic Institute
Query SegmentationBody:             Title:                        Anchor:
Big Data and Machine Learning        at the rescue of    Machine Translation      Audio/Speech     Motion/Gestures
Text:   Paraphrasing in English   http://labs.microsofttranslator.com/thesaurus/
Sentence:“many are dismayed by hisbehaviour”
Audio: Search             Over Audiohttp://www.msravs.com/audiosearch_demo/http://labs.microsofttranslator.com/thesaurus/
Meaning of Utterances:  Search Over Audiohttp://www.msravs.com/audiosearch_demo/
Gestures:      Kinect SDKhttp://research.microsoft.com/en-us/um/redmond/projects/kinectsdk
It’s now a Knowledge WorldFrom Patterns to Meanings
Semantics as the study of Meaning• Data semantics – extract and map from structured and    semi-structured sources into on...
Probase : A Knowledge Base for Text                         Understanding                                  http://research...
Probase has a big concept space                  2.7 M concepts                    automatically    Probase:     harnessed...
Uncertainty  Probase              vs. FreebaseCorrectness is a       Knowledge is  probability.       black and white. Liv...
What’s in your mind when you see the             word ‘apple’60005000400030002000                              concepts100...
When the machine sees ‘apple’ and         ‘pear’ together
Probase Internals               artist               painter                           Born Died …   Movement             ...
Probase search
Interim Product: Academic Searchhttp://academic.research.microsoft.com/
Zentity 2.0– Research Output Platform                                                               New Features: Default ...
Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags
Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags
Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr  Flickr users who commented on Marc_Smit...
Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr  Flickr users who commented on Marc_Smit...
Semantics of Network Patterns:                          NodeXL                          http://nodexl.codeplex.comINTRODUC...
From Pattern to Meaning:Email Validation of pattern analysis  requires human input. Meaning can be considered  globally ...
Summary The challenge is not so much in the standards for  representations (isn’t this just still syntax?) and pattern  d...
Thank you•   Evelyne Viegas, Microsoft Research, USA•   Li Ding, Rensselaer Polytechnic Institute•   Natasa Milic-Frayling...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011
Upcoming SlideShare
Loading in …5
×

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

1,072 views

Published on

Discussion of current Microsoft Research projects and prospects which help drive open innovation and agile experimentation via cloud-based services; and projects which aim at advancing the state-of-the-art in knowledge representation and reasoning under uncertainty at web scale. I also begin by discussing potential malign implications of mass automated implementations of linked-data systems, as functions of what governments (and users of public data) can/should/shouldn’t do in promoting mass activity.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,072
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • [Dumais, UMAP 2009]
  • Scaling to Very Very Large Corpora for Natural Language DisambiguationStatistical learning as the ultimate agile development tool.
  • Here youcan see why making content types (such as title and anchor text) available to the research community is better than body, as they are more similar to users’ queries.Details can be seen in the WWW paper.
  • The service is Public, can be used for non commercial purposes. This means that it has now been extended to researchers worldwide as part of its public beta launch which happened at WWW, Raleigh NC.What you see here is an application developed at WWW, within 8 hours of the public launch where Dr. Li Ding from Rensselaer Polytechnic Institute used the web n-gram service on a government dataset of titles to build a multi-word tag cloud, thus providing more relevant information.As an example compare on the left: critical and habitat as separate tokens and on the right (multi-word tag), critical-habitat.
  • At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
  • At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
  • Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

    1. 1. Manichean Progress:Positive and NegativeStates of the Artin Web-Scale DataLewis ShepherdMicrosoft Institute for Advanced Technology inGovernment
    2. 2. My cautionary personal note on Data “If all others accepted the lie which the Party imposed - if all records told the same tale - then the lie passed into history and became truth. Who controls the past ran the Party slogan, controls the future: who controls the present controls the past.’” George Orwell, Nineteen Eighty-Four
    3. 3. Murray Feshbach, Demographer & Revolutionary Spark• Following many years of continuous decline, infant mortality in the Soviet Union started inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974. The TsSU continued to print the infant mortality series for a few years after the alarming reversal of the long-term trend, but it stopped open publication of the data in 1975.• Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980 depicting the deteriorating state of public health in the USSR and--with what later proved to be an accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet Union was continuing to rise.• The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial changes in public health policies.• [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye Khozyaystvo, 1987• The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of bad news. In the case of infant mortality, as in many similar cases, the data on adverse developments were simply deleted from the open literature.• It took an alarming and well-publicized American report to alert higher authorities to the critical situation and to introduce remedies. Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007
    4. 4. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists”• Elizabeth Warren: The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer’s apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight.• …when she was working on the Consumer Finance Protection Board, she was thinking hard about what role technology could play in building a truly 21st century regulatory agency, and in my books, that will have to mean what Ive been calling "algorithmic regulation.“ Forbes.com / G+ / Nov. 3, 2011 (emphasis added) https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1 4
    5. 5. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists”• My feeling is that someone who is likely to have a major influence on regulating the data scientists on Wall Street is a good person to put on a list like this. Yes, I do want them regulated, and this was a way of giving Elizabeth Warren a push. I do think that if anyone will help stand up for the rest of us, she will. And I wanted a chance to plant a few ideas about how that regulation ought to happen (algorithmically, in the same way that Google manages search quality.) Blog Comment / Nov. 4, 2011 (emphasis added) http://ctovision.com/2011/11/the-worlds-7-most-powerful-data- scientists/#IDComment217149604 5
    6. 6. Breaking down Data BarriersSemantic Knowledge for CommodityComputing Evelyne Viegas, Microsoft Research, USA Li Ding, Rensselaer Polytechnic Institute Natasa Milic-Frayling, Microsoft Research, UK Haixun Wang, Microsoft Research, Asia Kuansan Wang, Microsoft Research, USA
    7. 7. Vision – Enable Next Generation Experiences byworking with academia, stakeholders fromindustry, government, andconsumers/innovators to make sense of data DATA > INFORMATION > KNOWLEDGE > INTELLIGENCE
    8. 8. Data/Information • To help explore the data value chain, Microsoft’s collaborations provide access to data that enables: – Innovation – By having access to real world data, researchers can unveil new analysis or research directions based on shared assets and explore new questions – Science – By allowing wider use of data, repeatability of experiments can be performed and data misrepresentations or faulty results avoided – Training – real-world large-scale data is a powerful tool for training the next generation of data analysts and researchers • Cloud-based services: Web Language and Query Language Models – Used to research topics such as human speech, spelling, information extraction, learning, and machine translation.
    9. 9. It’s a data-driven world – Spell Checking – Machine Translation – Search queries + click through – Online games skill matching – … Data logs behaviours in more reliable ways than demographic studies or surveys to study/predict trends(Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly susceptible to the data size used to develop them(Norvig, 2008) – it is the size of data, not the sophistication of the algorithms that ultimately play the central role in modern NLP
    10. 10. Data has become a first class citizenIT’S A DATA-DRIVEN WORLD
    11. 11. Data for Open Innovation - ChallengesWith web users becoming producers ofinformation, leaving the footprint of their lives indigital trails, it is becoming easier for “datasnoopers” to reconstruct the identity of anindividual or an organization by cross linkinginformation from different sources
    12. 12. A Face Is Exposed for Searcher No. 4417749 “Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears” said, Lauren Weinstein, a privacy advocate. The New York Times, Aug 2006Thelma Arnolds identity was betrayed by the records of her Web searches
    13. 13. Web N-gram ServicesAccess to up to petabytes of real world data http://research.microsoft.com/web-ngramLeading technology in Search, Machine Translation, Speech, Learning, …
    14. 14. Web N-Gram in Public Beta Web data has structure… …and that counts (e.g. Body, Title, Anchor)Exploring Web Scale Language models forSearch Query Processing, in WWW’2010
    15. 15. Applications Examples using Web Ngram Services
    16. 16. Word Breaking 16
    17. 17. Multi-word Tag Cloud from Government Dataset Titles Ref: Dr. Li Ding, Rensselaer Polytechnic Institute
    18. 18. Query SegmentationBody: Title: Anchor:
    19. 19. Big Data and Machine Learning at the rescue of Machine Translation Audio/Speech Motion/Gestures
    20. 20. Text: Paraphrasing in English http://labs.microsofttranslator.com/thesaurus/
    21. 21. Sentence:“many are dismayed by hisbehaviour”
    22. 22. Audio: Search Over Audiohttp://www.msravs.com/audiosearch_demo/http://labs.microsofttranslator.com/thesaurus/
    23. 23. Meaning of Utterances: Search Over Audiohttp://www.msravs.com/audiosearch_demo/
    24. 24. Gestures: Kinect SDKhttp://research.microsoft.com/en-us/um/redmond/projects/kinectsdk
    25. 25. It’s now a Knowledge WorldFrom Patterns to Meanings
    26. 26. Semantics as the study of Meaning• Data semantics – extract and map from structured and semi-structured sources into ontologies• Lexical semantics – identify/learn concepts, roles from sentences (e.g. Powerset; MindNet)• Statistical semantics – discover meaning from patterns of use (e.g. concept similarity)• Computational semantics – automate the process of constructing and reasoning with meaning representations• Semantic web – linked data via URI, common graph structure with RDF, inferences via ontologies and OWL• Formal semantics – in linguistics? in logic?
    27. 27. Probase : A Knowledge Base for Text Understanding http://research.microsoft.com/en-us/projects/probase/ WordNet Wikipedia Freebase Probase Feline; Felid; Adult male; Man; TV episode; Creative work; Musical Animal; Pet; Species; Mammal; Gossip; Gossiper; Domesticated animals; Cats; recording; Organism classification; Dated Small animal; Thing; Mammalian Gossipmonger; Rumormonger; Felines; Invasive animal species; location; Musical release; Book; Musical species; Small pet; Animal species; Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced album; Film character; Publication; Carnivore; Domesticated animal; Woman; Adult female; genomes; Animals described in Character species; Top level domain; Companion animal; Exotic pet; Stimulant; Stimulant drug; 1758; Animal; Domesticated animal; ... Vertebrate; ... Excitant; Tracked vehicle; ... Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client; Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization; computing providers; Companies Software developer; Architectural Manufacturer; Industry leader; based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large IBM N/A York; Multinational companies; Programming language designer; company; Fortune 500 company; Software companies of the United Computer manufacturer/brand; Technology company; Supplier; States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company; Contractors; ... Processor manufacturer; ... Technology company; ... Instance of: Cognitive function; Employer; Written work; Musical Knowledge; Cultural factor; Communication; Auditory recording; Musical artist; Musical album; Cultural barrier; Cognitive process; communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical; Cognitive ability; CulturalLanguage cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject; difference; Ability; Characteristic; Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast Attribute of: Film; Area; Book; Textual matter; genre; Periodical subject; Video game Publication; Magazine; Country; content descriptor; ... Work; Program; Media; City; ...
    28. 28. Probase has a big concept space 2.7 M concepts automatically Probase: harnessed from 1.68 billion pages 2 K concepts Freebase: built by community effort 120 K concepts Cyc: 25 years human labor
    29. 29. Uncertainty Probase vs. FreebaseCorrectness is a Knowledge is probability. black and white. Live with dirty Clean up data. everything.Dirty data is very Dirty data is useful. unusable.
    30. 30. What’s in your mind when you see the word ‘apple’60005000400030002000 concepts1000 0
    31. 31. When the machine sees ‘apple’ and ‘pear’ together
    32. 32. Probase Internals artist painter Born Died … Movement Picasso 1881 1973 … Cubism art painting Year Type … Guernica 1937 Oil on Canvas …
    33. 33. Probase search
    34. 34. Interim Product: Academic Searchhttp://academic.research.microsoft.com/
    35. 35. Zentity 2.0– Research Output Platform New Features: Default web UI with CSS support Pivot Viewer (defacto browser) and custom ASP.Net controls Open Data Protocol Flexible data model enables many scenarios and can be easily extended over timeA semantic computing platform to store andexpose relationships between digital assets http://research.microsoft.com/zentity/
    36. 36. Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags
    37. 37. Pattern Discovery and Semantic Interpretation:Graph of Co-occurring Flickr Tags
    38. 38. Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
    39. 39. Pattern Discovery and Sociological Interpretation:‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
    40. 40. Semantics of Network Patterns: NodeXL http://nodexl.codeplex.comINTRODUCTIONTECHNIQUES ANDMETRICSUSER RESEARCHPRODUCT GROUPENGAGEMENTFURTHER WORK TWITTER NodeXL Graph “Bing” at 2:30 AM Monday, July 12, 2010
    41. 41. From Pattern to Meaning:Email Validation of pattern analysis requires human input. Meaning can be considered globally accepted or strictly contextual, generally understood or individually constructed.
    42. 42. Summary The challenge is not so much in the standards for representations (isn’t this just still syntax?) and pattern discovery but really in the interpretation and validation of that interpretation. ‘Meaning’ has different connotations in different context The challenge is in determining and addressing the right level of granularity.
    43. 43. Thank you• Evelyne Viegas, Microsoft Research, USA• Li Ding, Rensselaer Polytechnic Institute• Natasa Milic-Frayling, Microsoft Research, UK• Haixun Wang, Microsoft Research, Asia• Kuansan Wang, Microsoft Research, USALewis Shepherd lewiss@microsoft.com @lewisshepherd

    ×