Real-time Search at Yammer - By Aleksandrovsky Boris
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Real-time Search at Yammer - By Aleksandrovsky Boris

  • 1,762 views
Uploaded on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 ...

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

This talk will be focused on the architecture, scalability concerns, performance bottlenecks,
operational characteristics and lessons learned while designing and implementing Yammer
distributed real-time search system. Yammer is an enterprise social network SaaS offering with over
100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system
we developed scales well up to 1B messages and serves a foundation of knowledge base analysis
services Yammer is developing.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,762
On Slideshare
1,323
From Embeds
439
Number of Embeds
9

Actions

Shares
Downloads
21
Comments
0
Likes
1

Embeds 439

http://www.lucenerevolution.org 167
http://www.lucidimagination.com 150
http://lucenerevolution.com 79
http://searchhub.org 30
http://lucenerevolution.org 6
http://lucidsearchhub.stephenz.com 3
url_unknown 2
http://www.slideshare.net 1
http://info.lucidimagination.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Similar to how a single malt is made, knowledge is distilled from information, facts and experience. The role of the search engine is to capture the process and make it readily available.
  • private and secure enterprise social network for coworkers and colleagues to communicate, collaborate, and coordinate An interactive online knowledge base that connects dispersed workers in ways that are easy, real-time, social, and searchable A way to share what’s relevant to the right colleagues , by drawing attention to and discussing important issues “ The Social Glue” to an organization , driving better collaboration and process improvements while preserving institutional knowledge Real-time communication, coordination Business continuity and relevance Global connectivity, accessible anywhere
  • “ Introducing Yammer: combining the new ways we communicate, with the consumerization of enterprise software to achieve faster communications, better collaboration, and more productivity.” Overview of the key features but emphasize this is a Knowledge Base: Search for answers and topics, identify collaborators and experts, Messaging and Feeds: Ask questions, start discussions. Share news, links, opinions, and ideas. Streamline communication, understand context in threaded conversations.  My Feed, Company Feed, RSS Feeds: follow what and who are of most interest to you, stay on top of company news, add RSS to stay informed. Direct Messaging: Send private direct messages to co-workers, reduce email volume, add others who can catch up by reading thread histories. User Profiles: Each user creates a profile with their photo, title, and background. Easily connect with co-workers and expertise Company Directory: Upgrade to enterprise for additional security and admin features, including company directory integration. Help new employees quickly get up to speed. Groups and Communities: build engagement by creating internal Groups around projects and topics, and external Communities with partners and customers. Applications: Share files, enhance productivity, and increase collaboration through Yammer’s suite of core apps and a la carte Third Party Apps for document sharing, tracking, helpdesk ticketing, and more. Integrations: SharePoint 2007 and 2010, Outlook, Salesforce, soon: Box Access and Mobility: Access Yammer anywhere, through the web, Desktop client, IM, SMS, Microsoft Sharepoint, and mobile applications (iPhone, Blackberry, Android, Windows Mobile). Translations: soon available in 100 languages Network Consultation and Support: included with enterprise upgrade OTHER stuff to talk about if you like: @People and #Topics: Quickly loop co-workers into conversations and tag topics for further information discovery and sharing. Connectivity and Crisis Communications: connect your dispersed workforce, crowdsource ideas, and broadcast company-wide in times of critical need.
  • “ We know our product inside and out from our work with over 100K+ company networks. From product iterations to customer use cases, to deployment and engagement services, we have a depth of expertise that has made us the market leader.”
  • Before getting into the product – let’s get at the problem(s) Yammer is attempting to address…
  • From the perspective of search, people use Yammer today in two modes. First, they want to simply capture the information which might have scrolled out of view in their Yammer feed. This is very similar to Twitter - I check it once in a while, but what have I missed since the last time? For this use-case we want to present search results in reverse chronological order and answer simple queries. The second mode is the knowledge exploration mode. Yammer is a knowledge base created by interactions between colleagues over time within a company. Yammer can help with the on-boarding process, faq's, tips, computer setup, company procedures and processes, practices and culture. For this, search is an entry point and quite possibly the most important interaction element. We need to answer complicated queries and present results based on textual similarity, popularity, engagement and social distance.
  • From the perspective of search, people use Yammer today in two modes. First, they want to simply capture the information which might have scrolled out of view in their Yammer feed. This is very similar to Twitter - I check it once in a while, but what have I missed since the last time? For this use-case we want to present search results in reverse chronological order and answer simple queries. The second mode is the knowledge exploration mode. Yammer is a knowledge base created by interactions between colleagues over time within a company. Yammer can help with the on-boarding process, faq's, tips, computer setup, company procedures and processes, practices and culture. For this, search is an entry point and quite possibly the most important interaction element. We need to answer complicated queries and present results based on textual similarity, popularity, engagement and social distance. Wikis working in groups - people are creating some connections but they are not well organized.
  • he biggest challenges for search at Yammer is the real time nature of the information and the complicated relevancy story. Information on Yammer should be indexed and available for users to search in real time, virtually in less then a second. This makes the Yammer indexing system similar to Twitter where tweets are indexed in real time. Search results likewise are available in reverse chronological order which is based on the assumption that for certain types of events, timeliness is the most pertinent characteristic. This maps really well into types of content like news where relevancy declines fairly rapidly as time passes, or for types of content which are more transient in nature, like events and meetings. There are other types of content where the relationship between the creator of the content and the searcher is important, and also the sheer popularity of the content is important. This is more of a Facebook newsfeed case, which tries to present content from people you value or interact with most. A good example will be communications from your boss, or an expert opinion you trust. Popular discussion threads which capture the attention of the company are important to find since they usually encompass the "company culture".   There are however other types of content that are much more knowledge heavy and with the retrieval of each textual similarity, reputation and potential for engagement are more important then timeliness. For instance when the sales representative is searching for a relevant approach to a particular client industry, then he would be interested in the experiences of all other sales people who tried to sell to that industry, and he would want to look back as far as the records go. This is a case where Yammer's search system is trying to act more like Google search system
  • Out of order delivery source of all (most) evil Easily 50% of complexity is there. Solution Garanteee in-order delivery - buffer and wait - degrades performance, availability and only garantees very eventual consistency Minimize the probability and forget - ts precesion - clock skew Solution Arbitrate - based on ts / vector clocks (ts+versions) - based on semantics - based on business cases - need to index tombstones (mark-for-delete)
  • Out of order delivery source of all (most) evil Easily 50% of complexity is there. Solution Garanteee in-order delivery - buffer and wait - degrades performance, availability and only garantees very eventual consistency Minimize the probability and forget - ts precesion - clock skew Solution Arbitrate - based on ts / vector clocks (ts+versions) - based on semantics - based on business cases - need to index tombstones (mark-for-delete)
  • Editable TOC or bullet slide
  • Editable TOC or bullet slide
  • - Dual indexing - primary index for serving out - secondary index for reindexing - Verify secondary index consistency - foreach replica in turn - shutdown - mv secondary to primary - restart - Availability should not be affected except for slight chance of system failure on the serving replica.
  • - Indexing problems Detect - index integrity tool checks against the :source of truth: - identifies patches Reindex - gaps - whole - reindex into secondary, swap with primary Repair job - patch in place
  • Call all - more predictable latency profile, index warmup advantage Round robin - when under load stress Least busy - most complicated, requires metrics poll, prone to errors when burstable activity
  • - Testing Indexing Idempotent Out-of-order delivery 10K docs delivered in random order with X% of dupes Search Build small manual index by recording events Create unit-test style tests with Asserts
  • - Production Metrics Alerts via Zabbix (Zabbix is awesome) Puppet Ganglia for machine level diagnostics Have enough redundancy
  • Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
  • Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
  • Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
  • Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
  • Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)

Transcript

  • 1. Realtime revolution at work REAL-TIME SEARCH AT YAMMER May 25, 2011 By Boris Aleksandrovsky http://www.linkedin.com/in/baleksan Yammer, Inc. http://www.linkedin.com/in/baleksan
  • 2.
    • Communication is hard, search is harder
      • What me grammar?
      • Private language
      • Conversational language
      • Time compressed
      • Transient
      • Poorly organized
      • Authority is suspect
      • Social pressures
  • 3.
  • 4. Challenges - From information to knowledge Information Facts Knowledge Attention Engagement Retention Messages Metadata Personalized Search
  • 5. Agenda
    • Background
    • Why search?
    • Indexing
    • Search
    • Tools and methodologies
    • Lessons learned
    • Future
    • Q&A
  • 6. : Putting Social Media to Work
    • Yammer makes work
    • Real-time, Social, Mobile
    • Collaborative, Contextual
    • More Human!
    • Similar to:
    • Facebook
    • Twitter
    • Wikis
    • Groups
    Knowledge Management: Document-oriented Enterprise Collaboration: Outcome-focused Social Media: People-centric
  • 7. Yammer: The Enterprise Social Network
    • Messaging and Feeds
    • Direct Messaging
    • User Profiles
    • Company Directory
    • Groups (Internal)
    • Communities (External)
    • File Sharing
    • Applications
    • Integrations
    • Web, Desktop, Mobile, Tablet
    • Translations
    • Network Consultation and Support
    Easy. Shared. Searchable. Real-time. Where your company’s knowledge lives.
  • 8. 100,000+ companies, including 85% of the Fortune 500 – and growing.
  • 9. What do you discuss at work, and with whom?
    • Who do you need to communicate with, across the company?
    • How often are the same questions asked?
    • Who has the answers? Who has new ideas? Who can help?
    What do our employees think of our 401K program? Is everybody saving? What’s the latest with the XYZ account? What are our recommendations for financial and regulatory reform given the latest news about…? What will be discussed at our Quarterly Sales Kickoff? Where can I find out more about customer events here at the ABC conference? Who’s free to meet up? How can my team better prepare for our next product release? Who has any fresh ideas for… Who will I be working with on this new project?
  • 10.
  • 11. Search use case - Transient Awareness
    • Reverse-chronological
    • Simple queries
    • Facet
      • Date
      • Sender
      • Group
  • 12. Search use case - Knowledge Exploration
    • Complicated relevance story
      • tf/idf
      • popularity
      • engagement
      • social distance
    • Complicated queries
    • Facet
      • Date
      • Sender
      • Group
      • Object type
  • 13. Challenges for Yammer’s search engine
    • More knowledge is generated in realtime
      • Availability latency < 1 sec
      • Not always well formed
    • Complicated relevance story
      • experts and their reputation
      • popularity
      • social graph
      • tagging/topics
      • engagement signals
      • timeliness
      • location
  • 14. Team
    • 2 engineers
    • 8 man months
    • Lots of fun
  • 15. Indexing
    • DB to replica
  • 16.
  • 17. Replication
    • Independent near-replicas based on a single distributed source of truth
    • Can (will) get out of sync
    • Automatic monitoring of replication quality
      • Are replicas out of sync with other replicas?
        • number of docs
        • alert > X
      • Are replicas out of sync with the DB?
        • statistical sample of docs
  • 18. Indexing
    • In-replica to index
  • 19. 30s
  • 20. Why is it hard?
    • No timeliness guarantee
    • Fragmentation
    • Out-of-order deliveries
    • Index dependencies
      • Need to denormalize the information
    • Need to build for network partition tolerance and redundancy
    • But
      • Eventual consistency
      • Eventual delivery
  • 21. How do we cope?
    • Out of order delivery source of (most) evil
    • ?
      • A) Assure in-order delivery
        • buffer and wait
          • degrades performance, availability and timeliness and is only very eventual consistent
      • B) Minimize probability and ignore
        • timestamp precision
        • clock skew
      • C) Arbitrate
        • timestamp / vector clocks
        • semantics
        • need to index lifecycle events
    • Need to build for network partition tolerance and redundancy
    • But
      • Consistency guarantee
      • Eventual delivery
  • 22. Delete-update race
    • [create Message “hello” id=5 ts=12:34:39]
    • [delete Message “hello there” id=5 ts=12:45:01]
    • [modify Message “hello there” id=5 ts=12:45:01]
    id timestamp tombstone 5 12:34:39 no 5 12:45:01 yes
  • 23. Multiple update race
    • [create Message “hello” id=5 ts=12:34:39]
    • [modify Message “hello there now” id=5 ts=12:45:01]
    • [modify Message “hello there” id=5 ts=12:45:01]
    id timestamp text 5 12:34:39 hello 5 12:45:01 hello there now
  • 24. Dupes
    • [create Message “hello” id=5 ts=12:34:39]
    • [like Message id=5 userId=3 ts=12:45:01]
    • [like Message id=5 userId=3 ts=12:45:02]
    • [unlike Message id=5 userId=3 ts=12:45:04]
    id timestamp numLikes 5 12:34:39 0 5 12:45:01 1 5 12:45:02 1 5 12:45:04 0
  • 25. Thread example
  • 26. Zoie
    • Realtime indexing system
    • Open sourced by LinkedIn
    • Used by LinkedIn in production for about 3 years
    • Deployed at dozen or so locations
    • Thanks Xiaoyang Gu, Yasuhiro Matsuda, John Wang and Lei Wang
  • 27. Zoie
    • Push events into buffer and the transaction log
    • Push buffer into Zoie
    • When Zoie commits, transaction log is truncated.
  • 28. Indexing HA
    • Cluster queue systems
      • Round-robin of Rabbits introduce further out-of-order problems.
    • Transaction log
      • Between RabbiMQ dequeue and Zoie disk commit
  • 29. Dual indexing
    • Primary for serving out
    • Secondary for reindexing
    • Verify secondary index consistency
    • foreach replica do
      • shutdown
      • mv secondary to primary
      • restart
    • Availability should not be affected except for slight chance of system failure
  • 30. Index consistency problems
    • Detect
      • integrity check against the :source of truth:
    • Reindex
      • gaps
      • whole
      • reindex into secondary, swap with primary
    • Repair
      • patch in place
      • run on restart
  • 31. Search
    • <insert animated architecture slide>
  • 32. Goal
    • 50/50-500/100 per partition
    • 50M docs
    • 50 msec P75 - 500 msec P99
    • 100 qps
  • 33. REST-full API over HTTP
    • http://search.yammer.com:8085/api/search/1/1?query=i&start=0&pageSize=5&f=date,05242001
  • 34. Payload
    • Payload is usually small json object
    • For security reasons only ids and scores are send out
    • One page (usually 10 items) x 6 index types.
  • 35. Payload
  • 36. Web Server
    • Jersey over Jetty
    • http://jetty.codehaus.org/jetty/
      • Custom configuration
        • tuned to the required 100 qps
        • generally impeccable, occasional lock contention
    • http://jsr311.java.net/
      • Annotation driven
      • Much easier to test
  • 37. Search master
    • More like a router
    • Knows about partitioning scheme
    • Performs load normalization
      • Call all, take the first
        • Possible to use multicast
      • Round Robin
        • switch to for scale
      • DLB (Least busy)
    • Maintains primary SLA metrics
  • 38. Partitioning
    • Simple Jenkins 64bit hash of networkId
    • 2 level hash to split large partitions
    • Exception list to split large partition
    • Limitation: Cannot partition inside a single network
    • Repartitioning story is expensive
    • Consistent hashing?
  • 39. Testing
    • Indexing
      • Idempotent
      • Out-of-order delivery
      • Duplicate and incomplete docs tolerance
      • 10K docs delivered in random order with X% of dupes and Y% incomplete records
    • Search
      • Small manual index by recording event
      • Unit style tests (testng) with Asserts
  • 40. Production
    • Measure
    • Hardware is cheap, people are not
      • People require more maintenance
    • Have enough redundancy
  • 41. Metrics
    • JVM, Queue, Logging and Configuration
  • 42. Metrics
    • Gauges
  • 43. Metrics
    • Meters
  • 44. Metrics
    • Timers
  • 45. Metrics
    • https://github.com/codahale/metrics
  • 46. Lessons
    • Do not underestimate your data model
    • Tradeoff between consistency, RT availability and correctness
    • Measure
    • Flexible partitioning scheme
    • Data recovery plan
  • 47. Future
    • Dynamic routing
      • Zookeeper
    • Partition rebalancing
    • Multiple sub-partitions with different SLAs
    • Work on relevancy
    • Multiple languages
    • Document parsing
    • External data
    • Scala
  • 48. Q&A Session: What’s On Your Mind?