Scale, Structure, and Semantics


Published on

Keynote at 2012 Semantic Technology and Business Conference

Scale, Structure, and Semantics
Daniel Tunkelang, LinkedIn

Science fiction has a mixed track record when it comes to anticipating technological innovations. While Jules Verne fared well with with his predictions of submarine and space technology, artificial intelligence hasn't produced anything like Arthur C. Clarke's HAL 9000.

Instead, we've managed to elicit intelligence from machines through unexpected means. Search engines have achieved remarkable success in organizing the world's information by crawling the web, indexing documents, and exploiting link structure to establish authoritativeness. At LinkedIn, we apply large-scale analytics to terabytes of semistructured data to deliver products and insights that serve our 150M+ members. Semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data.

In this talk, I will describe how LinkedIn's huge and rich graph of relationship data that powers the products our users love. I believe that the lessons we have learned apply broadly to other semantic applications. While quantity and quality of data are the key challenges to delivering a semantically rich experience, the key is to create the right ecosystem that incents people to give you good data, which then forms the basis for great data products.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Two icons of artificial intelligence from science fiction: the HAL 9000 computer from 2001: A Space Odyssey and the android Data from Star Trek: The Next Generation. Both exceed human beings in their ability to assimilate knowledge and to reason using that knowledge. Both interact with human beings in natural language.Despite all of our technological advances, the closest we have come to this vision is talking to Siri. An improvement on the 1960s ELIZA program for sure, but still a baby step.
  • In 1945, Vannevar Bush put forth his vision of amemex (a portmanteau of "memory" and "index”) as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility". The memex would provide an "enlarged intimate supplement to one's memory”. The concept of the memex influenced the development of hypertext systems,eventually leading to the creation of the World Wide Web and personal knowledge base software.
  • A pure embodiment of AI vision: Cyc was started in 1984 as an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning.Typical pieces of knowledge represented in the database are "Every tree is a plant" and "Plants die eventually". When asked whether trees die, the inference engine can draw the obvious conclusion and answer the question correctly. The knowledge base contains over one million human-defined assertions, rules or common sense ideas. These are formulated in a language based on predicate calculus.
  • Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people (and machines) to access common information more effectively.Freebase is a wonderful resource, and search engines are starting to use it as a structured data resource. But using Freebase for structured queries is a lot trickier than using Google for free-text queries, largely because Freebase is incomplete in unpredictable ways. In particular, Freebase has difficulties with Null, nothing, unknown or N/A values. For example, in the results for "fires of unknown cause", there is no way to tell whether the cause of the fire is really unknown or the data is missing.
  • Wolfram Alpha is an answer engine developed by Wolfram Research. It is an online service that answers factual queries directly by computing the answer from structured data.Wolfram Alpha is impressive. It’s no wonder that Wolfram Alpha serves as the back end for many Siri queries.Unfortunately, its natural-language interface is brittle. As we can see from these two queries, it can roughly report the number of software engineers in the San Francisco Bay Area, but not the number of software companies. Nobody is perfect. But what is disconcerting is that the system does nothing to suggest that the latter answer is less reliable than the former. Does the system know how to answer the second question? There is no way for the user to be sure, other than perhaps by trial and error eventually leading to resolution or frustrated resignation. This is a communication problem.
  • Deep Blue was a chess-playing computer developed by IBM. In 1997, the machine defeated world champion Garry Kasparov in a match.What was its secret sauce? Could it think? Did it learn to play chess and represent that wisdom in a knowledge base? Not really – to borrow a line from Toy Story, it won by using brute force with style. It was a massively parallel system (by 1997 standards) made with special-purpose chips.
  • A decade later, IBM did it again. IBM researchers decided to build a system to beat humans at a more modern game than chess – namely the Jeopardy! television quiz show featuring trivia in history, pop culture, sports,, etc. Moreover, many Jeopardy questions (or “answers”, since the gimmick of the game is that the question-answer process is inverted) involve word play, which would seem particularly challenging for a machine.Like Deep Blue, Watson is all about computation. Its knowledge base is mined from 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia. It uses a server cluster with 720 cores and relies on parallel processing to parse questions and search its knowledge base for candidate responses.In February 2011, Watson defeated former Jeopardy! champions Ken Jennings and Brad Rutter in a televised match.
  • Watson’s achievement was impressive. But let’s put things in perspective. Even a plain old search engines do pretty well at Jeopardy. The comparison isn’t entirely fair: in judging the search engines, we are only requiring that they return pages on which the answer should appear, not giving specific actual answers. One can try various simple strategies for going further. Like getting the answer from the title of the first hit—which with the top search engines actually does succeed about 20% of the time.Still, the point should be clear. None of these strategies are using sophisticated semantic representations. Computation – brute force with style -- is the big winner.
  • In 2009, Google researchersAlon Halevy, Peter Norvig, and Fernando Pereira wrote a popular article entitled “The Unreasonable Effectiveness of Data”. It has often been paraphrased as better date beats clever algorithms. But for our purposed we can interpret it as celebrating the triumph of computation over knowledge representation as a means to produce semantic or intelligent behavior.
  • Let’s take stock of what kind of data we have. Most of our data is semi-structured data -- the broad space that lives in between structured data (the rigid schemas we associated with relational database systems) and unstructured data (e.g., the free text indexed by search engines). The structure in semi-structured data takes the forms of tags and structural elements without a rigid schema (e.g., XML).
  • LinkedIn has one of the largest and richest collections of semi-structured data on the consumer internet. Here you can see how our people data combines free text, a connection network, and a collection of structured tags. And these aren’t the only entities – we have companies, jobs, etc.
  • Here I’m searching for people I know in the Bay Area who have “data” anywhere in their profiles and currently work at Google, Yahoo!, or Twitter. Maybe I should look at my Facebook connections too. Did I mention that I’m hiring? The power of such a search is incredible, and the experience is highly intuitive even for a user who has no idea that either the data or the search query is “semi-structured”. The interaction revolves around facets that are well represented in both the data and the user’s mental model.
  • True story, redacted only the protect my friend’s privacy.
  • Of course I know that the first place to research companies is LinkedIn. So I started with a generic company search for “mobile”. The results are reasonable, given the query lack of specificity. But clearly I need edto be more specific.
  • Here is my revised query: small mobile companies headquartered in the Bay Area in software-related industries. This may not have been exactly what my friend was looking for, but it was a great starting point. Specifically, the systems helped me map his information need to a query that captured its spirit.
  • Computation is powerful – especially at our scale of data and users. Applying machine learning allows us to produce recommendations for job matching, content, community, etc. And of course it drives the feature LinkedIn is most famous for: People You May Know.
  • One of the steps in processing search queries is to parse them and establish query interpretations – in this case, that “linkedin” refers to a company and “ceo” refers to a job title. We do so using a hidden Markov model (HMM) trained on our corpus statistics and search logs. This allows us to handle word-sense ambiguity, e.g., “dell” as a first name, last name, or company name.
  • In order to evaluate a job-candidate pair, we first use common-sense filtering to determine if the candidate is even plausible, e.g., we don't need fancy algorithms to determine that a sales executive in Turkey isn't a good match for a software engineering job in Mountain View. After this filtering, we take the two bags of features and create a single set of features for the pair to represent the matching. The matching features can be binary features (e.g., is the candidate in the same industry as the job?), softer (e.g., based on the transition probability between the the candidate's current job and the potential new one), and textual (we can use standard information retrieval methods to compare documents). Combining all of these using weights learned through regression, we can assign scores to matches. Note again that scale matters -- our corpus statistics are essential to computing the above features without falling victim to sparsity.
  • If the value of your network reflects the saying that "you are who you know", Skills offers the complementary "you are what you know". Skills are diverse -- ranging from Ballet to Hadoop. In order to identify the set of skills, we turn again to the unreasonable effectiveness of data. Many of our 160M+ users have a Specialties section where they list their skills as free text. By mining these sections and other profile elements, we generated a set of potential skills for our entire corpus. Bootstrapping on that list, we implemented a suggested skills feature that is leading to increasing adoption of our controlled vocabulary.
  • Skills is still in beta. But here you see how related skills – which are derived by mining our corpus – can increase recall on a search for people who have expertise in WordNet, a lexical database developed at Princeton. We can’t rely on people to mention WordNet in their profiles. But we can expand our search to include related skills like ontologies and semantic search. Of course it’s a precision / recall tradeoff – but one that is completely transparent to the user.
  • The same technique can be used to disambiguate a query like [owl]. If you’re looking for OWL specialists rather than ornithologists, then it’s helpful to require some supporting evidence, such as expertise in the semantic web or RDF.
  • Knowledge representation isn’t the answer. Computation is great. But with semi-structured data and data-driven computation, we can get even further.
  • To achieve the best results, we have to exploit the strengths of both people and machines. That means using computation to support communication.
  • Web search is beginning to embrace semi-structured data – using the unreasonable effectiveness of data to exploit the structure it has and derive latent structure where possible. The result is more user control and a more intuitive communication between the user and the system. What was once exotic is rapidly becoming mainstream.
  • At this year’s Strata conference, my colleague Monica Rogati one-upped Norvig etal’s argument about the unreasonable effectiveness of data. Not all data is created equal, and quality trumps quality. This is a teaser – I recommend you watch her talk on “The Model and the Train Wreck”.
  • Scale, Structure, and Semantics

    1. DanielScale, Structure, andSemanticsDaniel TunkelangPrincipal Data Scientist at LinkedIn Recruiting Solutions 1
    2. Take-Aways Communication trumps knowledge representation. Communication is the problem and the solution. 2
    3. Overview1.  Knowledge representation is overrated.2.  Computation is underrated.3.  We have a communication problem. 3
    4. The Bad News1.  Knowledge representation is overrated.2.  Computation is underrated.3.  We have a communication problem. 4
    5. AI: a dream deferred. 5
    6. Memex: the Computer Science Version 6
    7. Cyc 7
    8. Freebase 8
    9. Wolfram Alpha 9
    10. Knowledge representation is overrated.Today’s knowledge repositories are:§  incomplete§  inconsistent§  inscrutable§  and not sustained by economic incentives.1986 estimate of effort to complete Cyc:§  250,000 rules + 350 person-years 10
    11. The Good News1.  Knowledge representation is overrated.2.  Computation is underrated.3.  We have a communication problem. 11
    12. Deep Blue vs. 12
    13. Watson 13
    14. Plain Old Search Engines are Pretty Good Too 14
    15. The Unreasonable Effectiveness of Data§  simple models + lots of data >> elaborate models + less data§  machine translation: parallel corpora >> elaborate rules for syntactic and semantic patterns§  semantic web formalism just means semantic interpretation on shorter strings between angle bracketsAlon Halevy, Peter Norvig, and Fernando Pereira (2009) 15
    16. Today’s Challenge1.  Knowledge representation is overrated.2.  Computation is underrated.3.  We have a communication problem. 16
    17. Semi-structured Data Michael K. Bergman, 17
    18. Semi-structured Data at LinkedInSummary <person>I lead a data science <id>team at LinkedIn, which <first-name />analyzes terabytes of <last-name />data to produce products <location>and insights that serve <name>LinkedIn’s members. <country>Prior to LinkedIn, I led a <code>local search quality team </country>at Google and was a </location>founding employee of <industry>faceted search pioneer …Endeca (acquired by </person>Oracle in 2010), where…
    19. Semi-structured Search is a Killer App 19
    20. Another Example: Helping a FriendDear Daniel,Im attaching the resume of an old friend who just moved upto the Bay Area.He has a very strong background in:§  mobile / wireless applications§  start-ups and new product launches§  international expansionBest regards,XXX 20
    21. Company Search 21
    22. Semi-structured Data Empowers Users 22
    23. Data-Driven Recommendations 23
    24. Data-Driven Computation Serves Communication for i in [1..n]! s ← w 1 w 2 … w i! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k! 24
    25. Recommendations Leverage Semi-structured Data Job Corpus Stats Matching Transition probabilities Connectivity Binary yrs of experience to reach titletitle industry … Exact matches: education needed for this titlegeo description …company functional area geo, industry, … User Base Soft Similarity (candidate expertise, job description) transition Filtered 0.56 probabilities, Similarity Candidate similarity, (candidate specialties, job description) … 0.2 Transition probability Text (candidate industry, job industry) General Current Position 0.43 expertise title specialties summary Title Similarity education tenure length 0.8 headline industry Similarity (headline, title) geo functional area experience … 0.7 . derive d . . 25
    26. Skills: A Practical Knowledge Representation 26
    27. Data-Driven Query Expansion for Recall 27
    28. Data-Driven Query Refinement for Precision 28
    29. There is no perfect schema or vocabulary.§  And even if there were, not everyone would use it.§  Knowledge representation has only succeeded within narrow scope.§  Brute force is surprisingly effective but does not leverage the user as an intelligent partner. 29
    30. Communication is the problem and the solution.§  Rich communication channel fills gaps in system’s knowledge representation and in user’s knowledge.§  Use data science to make the system smart, but be humble and empower the human user. Youve got the brawn Ive got the brains Lets make lots of money Pet Shop Boys, “Opportunities” 30
    31. The Future is Upon Us 31
    32. One More Thing “More data beats clever algorithms but better data beats more data.” Monica Rogati @ Strata 2012 32
    33. Thank You! Questions? Contact: We’re Hiring! 33