Data Cloud - Yury Lifshits - Yahoo! Research


Published on

In this talk we address two questions:

1) How to use structured data in web search?
2) How to gather structured data?

For the first question we identify valuable classes of data, present query classes that can benefit from structured data and describe architecture that combines keyword search with structured search.

For the second question we present Data Cloud: An ecosystem of data publishers, search engine (data cloud) and data consumers. We show connection form Data Cloud Strategy to classic notion in economics: network effect in two-sided markets. At the end of the talk an early demo implementation will be presented.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data Cloud - Yury Lifshits - Yahoo! Research

    1. 1. Data Cloud Yury Lifshits Yahoo! Research
    2. 2. My Beliefs <ul><li>The key challenge in web search is structured search </li></ul><ul><ul><ul><li>Part 1: What is structured search? </li></ul></ul></ul><ul><li>The key challenge in structured search is collecting data </li></ul><ul><ul><ul><li>Part 2: Data distribution & idea of Data Cloud </li></ul></ul></ul><ul><ul><ul><li>Part 3: Demo: numeric data distribution </li></ul></ul></ul><ul><li>The key challenge in collecting data is incentive design </li></ul><ul><ul><ul><li>Part 4: Economics of data distribution </li></ul></ul></ul>
    3. 3. <ul><li>Structured </li></ul><ul><li>Search </li></ul>
    4. 11. Data <ul><li>Structured data </li></ul><ul><li>Entity unit: </li></ul><ul><li>Identifier </li></ul><ul><li>Metadata: </li></ul><ul><ul><li>Explicit key-value pairs </li></ul></ul><ul><ul><li>Relational properties </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><li>Semi-structured data </li></ul><ul><li>Content unit: </li></ul><ul><li>Body: text, video, audio, or image </li></ul><ul><li>Metadata: </li></ul><ul><ul><li>Explicit key-value pairs </li></ul></ul><ul><ul><li>Relational properties </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><li>Data = data of entities + data of content </li></ul>
    5. 12. Structured Search <ul><li>Factoid search </li></ul><ul><li>“ what's the value of property X of object Y“ </li></ul><ul><li>Entity hubs </li></ul><ul><ul><li>Domain hubs </li></ul></ul><ul><li>Structured object search </li></ul><ul><ul><li>&quot;all concerts this weekend in SF under 20$ sorted by popularity&quot; </li></ul></ul><ul><ul><li>Time focus </li></ul></ul><ul><ul><li>Ranking focus </li></ul></ul><ul><ul><li>Relations focus </li></ul></ul><ul><li>Structured content search </li></ul><ul><ul><li>&quot;all videos with Tom Brady&quot; </li></ul></ul><ul><ul><li>“ all comments and blog posts about Bing&quot; </li></ul></ul>
    6. 13. Yury’s Wishlist <ul><li>Business-generated data </li></ul><ul><li>Products, services, news, wishlists, contact data </li></ul><ul><li>Reality stream, sensors </li></ul><ul><li>Where what have happened </li></ul><ul><li>Expert knowledge </li></ul><ul><li>Glossary, issues, typical solutions, object databases, related objects graph </li></ul><ul><li>Events </li></ul><ul><li>Sport, concerts, education, corporate, community, private </li></ul><ul><li>Market graph & signals </li></ul><ul><li>Like, interested, use, following, want to buy; votes and ratings </li></ul>
    7. 14. Search as a Platform App 4 Classic search App 1 App 2 App 3 Structured Data Web index Post analysis Query analysis
    8. 15. <ul><li>Data Cloud </li></ul>How to collect all structured data in one place?
    9. 16. Data Producers <ul><li>People: forums, wiki, mail groups, blogs, social networks </li></ul><ul><li>Enterprizes: product profiles, corporate news, professional content </li></ul><ul><li>Sensors: GPS modules, web cameras, traffic sensors, RFID </li></ul><ul><li>Transactional data </li></ul>
    10. 17. Data Distributors <ul><li>Data distributor is any technical solution to accumulate , organize and provide access to structured and semi-structured data </li></ul><ul><li>Data publisher : the original distributor of some data </li></ul><ul><li>Data retailer : a consumer-facing distributor of some data </li></ul>
    11. 18. Data Consumers <ul><li>Humans </li></ul><ul><ul><li>Email </li></ul></ul><ul><ul><li>Aggregators: news, friend feeds, RSS readers </li></ul></ul><ul><ul><li>Search </li></ul></ul><ul><ul><li>Browsing / random walks </li></ul></ul><ul><li>Intelligence projects </li></ul><ul><ul><li>Recommendation systems </li></ul></ul><ul><ul><li>Trend mining </li></ul></ul>
    12. 19. Data Cloud <ul><li>Data Cloud is a centralized fully-functional data distribution service </li></ul><ul><li>Success metric for data cloud strategy = the total “value” of data on the cloud </li></ul>
    13. 20. To-Cloud Solutions <ul><li>Extraction </li></ul><ul><ul><li>, “web tables” </li></ul></ul><ul><li>Semantic markup, data APIs </li></ul><ul><ul><li>Yahoo! SearchMonkey </li></ul></ul><ul><li>Feeds </li></ul><ul><ul><li>Yahoo! Shopping </li></ul></ul><ul><ul><li>,, Facebook Connect </li></ul></ul><ul><li>Direct publishing </li></ul>
    14. 21. On-Cloud Solutions <ul><li>Ontology maintenance </li></ul><ul><ul><li>Freebase </li></ul></ul><ul><li>Normalization, de-duplication, antispam </li></ul><ul><li>Named entity recognition, metadata inference, ranking </li></ul><ul><li>Data recycling (cross-references) </li></ul><ul><ul><li>Amazon Public Data Sets </li></ul></ul><ul><ul><li>Viral license </li></ul></ul><ul><li>Hosted search </li></ul><ul><ul><li>Yahoo! BOSS </li></ul></ul>
    15. 22. From-Cloud Solutions <ul><li>Search, audience </li></ul><ul><ul><li>Y! SearchMonkey, Google Base </li></ul></ul><ul><li>Data API, dump access, update stream </li></ul><ul><li>Custom notifications </li></ul><ul><ul><li> </li></ul></ul><ul><li>Data cloud as a primary backend </li></ul><ul><li>Access control </li></ul><ul><ul><li>Ad distribution. (AT&T and Yahoo! Local deal) </li></ul></ul>
    16. 23. <ul><li>Demo: </li></ul><ul><li> </li></ul>Joint work with Paul Tarjan
    17. 25. Import <ul><li>Crawl numbers from the web </li></ul><ul><ul><ul><li>URL + XPath + regex </li></ul></ul></ul><ul><li>Create “numbr pages” </li></ul><ul><li>Update their values every hour </li></ul><ul><li>Keep the history </li></ul><ul><li>Anyone can create a numbr </li></ul><ul><li> </li></ul>
    18. 26. Export <ul><li>Embed code </li></ul><ul><li>Graphs </li></ul><ul><li>Search & browse </li></ul><ul><li>RSS </li></ul>
    19. 27. <ul><li>Economics of Data Distribution </li></ul>Joint work with Ravi Kumar and Andrew Tomkins
    20. 28. Network Effect in Two-Sided Markets <ul><li>Two sided market = every product serves consumers of two types A and B </li></ul><ul><li>Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa </li></ul><ul><li>Examples: operating systems, credit cards, e-commerce marketplaces </li></ul><ul><li>Two-sided network effects: A theory of information product design </li></ul><ul><li>G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne </li></ul>
    21. 29. Basic model <ul><li>Distributors D1, … Dk </li></ul><ul><li>Producer/consumer joins only one distributor </li></ul><ul><li>Initial shares (p1,c1) … (pk,ck) </li></ul><ul><li>New consumer selects a distributor with a probability proportional to pi </li></ul><ul><li>New producer selects a distributor with probability proportional to ci </li></ul>
    22. 30. Basic model a1 a4 a2 a3 a1 a4 a3 a2
    23. 31. Market Shares Dynamics <ul><li>Theorem 1 </li></ul><ul><ul><li>Market shares will stabilize </li></ul></ul><ul><li>Theorem 2 </li></ul><ul><ul><li>With super-liner preference rule </li></ul></ul><ul><ul><li>one of distributors will tip </li></ul></ul><ul><li>Theorem 3 </li></ul><ul><ul><li>With sub-liner preference rule </li></ul></ul><ul><ul><li>market shares will flatten </li></ul></ul>
    24. 32. External Factor <ul><li>Preference rule with external factor: </li></ul><ul><li>ei+ci/(c1+…+ck) </li></ul><ul><li>Theorem 4 </li></ul><ul><ul><ul><li>Market shares will stabilize on </li></ul></ul></ul><ul><ul><ul><li>e1 : e2 : … : ek </li></ul></ul></ul>
    25. 33. Coalition Data Cloud
    26. 34. Coalitions <ul><li>Theorem 5 </li></ul><ul><ul><li>If all market shares are below 1/sqrt(k) </li></ul></ul><ul><ul><li>coalition (sharing data) is profitable for </li></ul></ul><ul><ul><li>all distributors </li></ul></ul><ul><li>Corollary </li></ul><ul><ul><li>Coalitions are not monotone </li></ul></ul><ul><ul><li>Example: 5 : 4 : 1 : 1 </li></ul></ul>
    27. 35. Model Variations <ul><li>Same-side network effect </li></ul><ul><li>Different p-to-c and c-to-p rules </li></ul><ul><li>Multi-homing (overlapping audiences) </li></ul><ul><li>n^2 vs. nlog n revenue models </li></ul><ul><li>Mature market: newcomer rate = departing rate </li></ul><ul><li>Diverse market (many types of producers and consumers) </li></ul><ul><li>Newcoming and departing distributors </li></ul><ul><li>Directed coalitions </li></ul>
    28. 36. <ul><li>Challenges </li></ul>
    29. 37. Marketing <ul><li>Data demand? </li></ul><ul><li>Data offerings? </li></ul><ul><li>Requirements for distribution technology? </li></ul>
    30. 38. Incentive design <ul><li>Incentives for data sharing? </li></ul><ul><li>Centralized or distributed? </li></ul><ul><ul><ul><ul><li>For profit or non-profit? </li></ul></ul></ul></ul><ul><li>Data licensing and ownership? </li></ul><ul><li>Monetizing data cloud? </li></ul>
    31. 39. More Challenges <ul><li>Prototyping: </li></ul><ul><li>Data marketplace: open data & data demand </li></ul><ul><li>Search plugins: related objects, glossaries, object timelines </li></ul><ul><li>Publishing tools for structured data </li></ul><ul><li>Data client: structured news, bookmarking, notifications </li></ul><ul><li>Tech design: </li></ul><ul><li>Access management </li></ul><ul><li>Namespace design </li></ul><ul><li>User interface: </li></ul><ul><li>Structured search UI </li></ul><ul><li>Discovery UI </li></ul>
    32. 40. <ul><li>Thanks! </li></ul><ul><li>Follow my research: </li></ul><ul><li> </li></ul><ul><li> </li></ul>