Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

System Revolution- How We Did It

500 views

Published on

My name is Victor Perepelitsky I'm an R&D Technical Leader at LivePerson leading the 'Real Time Event Processing Platform' team.

In this Meetup I talked about the journey of creating the platform from scratch - challenges, design decisions, technology choices and more.

During the last 3 years the team has built Real Time Event Processing Platform which is currently running in production with thousands of new and migrated customers. It is built to handle hundreds of thousands requests per/sec with low latency response time (under 30 ms round trip)

I went through different topics and stages of this journey and share details that led to specific choices and results.

“Stateful or Stateless”, “CEP”, “Rules engine”, “Automated performance testing”, “Locking”, “Timing” were a part of the menu.

Published in: Technology
  • Be the first to comment

System Revolution- How We Did It

  1. 1. System revolution How we did it Victor Perepelitsky questions: www.meetup.com/ILTechTalks/events/226834931/ slideshare: www.slideshare.net/victorperepelitsky email: victor.prp@gmail.com
  2. 2. LivePerson customer example salesman visitor from UK chat lines get session state activity revents chat lines sales manager invite chat UK visitors see reports invite 3
  3. 3. LivePerson at a glance 4 ● Account (brand) - LivePerson customer ● Visitor - individuals who interacts with the business owner’s brand ● Agent - an account representative who may interact with visitors (examples: technical support, sales) ● Admin - an account representative who defined the business goals and normally manages agents in order to effectively reach them
  4. 4. LivePerson at a glance agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines get session state activity revents chat lines admin define business rules see reports Admin scale (under 100 req/sec) invite 5
  5. 5. Legacy agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines get session state activity revents chat lines admin define business rulessee reports Admin scale (under 100 req/sec) Real Time Server Offline and Reporting 6
  6. 6. Legacy - stateful + account sticky session from account B RT server E, F, G RT server A, C RT server B, D web server web server session from account A 7
  7. 7. Legacy ● Works ● Fast ● Partially resilient ● Huge amount of features 8
  8. 8. Legacy - pains ● Hard to scale ● Hard to add new features ● Poor resource utilization ● Poor manageability ● Poor QoS ● Huge friction with customers 9
  9. 9. Let's go back agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines get session state activity revents chat lines admin define business rules see reports Admin scale (under 100 req/sec) invite 10
  10. 10. Proper system architecture agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines get session state activity revents chat lines admin define business rulessee reports Admin scale (under 100 req/sec) real time offline reporting config 11
  11. 11. The new dream agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines session state activity reventschat lines admin business rules see reports Admin scale (under 1K req/sec) chat offline reporting config monitor and engage * Business App / Extension 12
  12. 12. Monitor and engage = shark Shark manifesto ● Collects and makes available data about individuals (visitors) as they interact with the business owner’s brand (account) ● Acts in real-time to engage visitors (chat, ad, call etc..) ● Is a platform for a business logic modules (sharklets) which might be independently developed and deployed 13
  13. 13. Fundamental decisions Requirements? 14
  14. 14. Platform requirements ● E2E latency within DC < 30 mills ● Good resources utilization (CPU > 50%) ● Efficient - At least 500 req/sec per node ● Sharklet development lifecycle is independent ● High Availability ○ uptime > 99.99999% ○ data loss < 0.01% ● Resilient - no service downtime when external resource is unavailable (minimal degradation is allowed) ● Business logic correctness - 99.9% 15
  15. 15. Fundamental decisions Requirements? -> defined Stateful or stateless? 16
  16. 16. Stateful stickiness is required session 1 session 2 session 3 session 4 17
  17. 17. Stateless session 1 session 2 session 3 session 4 session data Each request potentially requires access to session data store 18
  18. 18. Facts that helped us to decide 1. Legacy works as “Stateful without HA” 2. A small data loss has a tiny customer impact (0.01% loss is good enough) 3. Stateless requires much more resources and initial effort 4. We can add HA store in the future 19
  19. 19. Stateful shark ACCOUNT Nsession B RT server E, F, G RT server A, C RT server D web server web server session A NN , B 20
  20. 20. Fundamental decisions Requirements? -> defined Stateful or stateless? What are the big parts? 21
  21. 21. What are the big parts? 22
  22. 22. Legacy - successful patterns 1. Requests are processed in memory 2. External resources are accessed asynchronously to visitor requests 3. Customer Rules and Data (AccountConfig) are kept in memory and may be updated on background 23
  23. 23. Legacy - pains 1. Order of calls (inside code + rules) 2. Business logic are not pluggable components 3. Http requests are tightly coupled within logical levels (hard to move toward other protocols as WebSockets) 24
  24. 24. 25
  25. 25. SYNC - Fast CEP, engagements ASYNC - slow actions, external resources access sharklet A (sync handlers) sharklet A (async handlers) web visitor agent mobile visitor facadeadapter adapter adapter Account Runtime Data Message BUS external resource 26
  26. 26. Shark - The Big Parts 1. Facade - decouples real world protocols from the logical layers 2. CEP - avoids call order management 3. Sync - very fast in memory processing 4. Async - allows slow actions and ext resources access 5. Account Runtime Store - allows in memory access to customer configuration 27
  27. 27. Fundamental decisions Requirements? -> defined Stateful or stateless? -> stateful What are the big parts? -> we have it Basic technology stack 28
  28. 28. Basic technology stack - ? 29
  29. 29. We were practical CEP technology? 30
  30. 30. CEP - in a nutshell 31
  31. 31. Drools - in a nutshell 32
  32. 32. Drools - we tried to kill it We had ● played with it - :) ● integrated into shark - :) ● made a POC using LivePerson logic - :) ● tested for performance - :( 33
  33. 33. We played with more technologies 34
  34. 34. And finally chose the solution 35
  35. 35. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue b a 36
  36. 36. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue a b 37
  37. 37. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue ba a 38
  38. 38. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue b c 39
  39. 39. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue b c 40
  40. 40. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue b c 41
  41. 41. Shark CEP - processing cycle handler 1 handler 2 handler 3 Event Queue 42
  42. 42. Sharklet handler example 43
  43. 43. Fundamental decisions Stateful or stateless? -> stateful What are the big parts? -> we have it Basic technology stack -> choosed CEP - Technology choice -> DIY (inhouse) 44
  44. 44. Fundamental decisions Stateful or stateless? -> stateful What are the big parts? -> we have it Basic technology stack -> choosed CEP - Technology choice -> DIY (inhouse) Locking architecture 45
  45. 45. Locking - The model The world account A session 1 session 1 session 1 session 4 46
  46. 46. Locking - Legacy pains ● You must be aware of locking when writing a business logic ● Write lock on account freezes all account operations ● Locking became the bottleneck (Not a CPU) ● BUGs 47
  47. 47. Locking - Shark solution ● Read/Write lock for session ● Write business logic only - no locking awareness ● No write lock on account - copy on write 48
  48. 48. SYNC - A single proc cycle uses consistent account data copy ASYNC - updates account data using copy on write pattern sharklet A (sync handlers) sharklet A (async handlers) web visitor agent mobile visitor facadeadapter adapter adapter Account Runtime Data external resource 49
  49. 49. Sharklet example (no locks) 50
  50. 50. Fundamental decisions Stateful or stateless? -> stateful What are the big parts? -> we have it Basic technology stack -> choosed CEP - Technology choice -> DIY (inhouse) Locking architecture -> decided 51
  51. 51. We had a good start 52
  52. 52. But! We were alone 53
  53. 53. LiveEngage - the big decision 54
  54. 54. Dream = LiveEngage platform agent visitor Chat scale (2K req/sec) Visitor scale (100K req/sec) chat lines session state activity reventschat lines admin business rules see reports Admin scale (under 1K req/sec) chat offline reporting config monitor and engage * Business App / Extension 55
  55. 55. Rules - from definition to runtime visitor activity revents admin business rules config monitor and engage * Business App / Extension if the visitor meets the conditions -> invite to chat 56
  56. 56. Rules in LiveEngage dream 57
  57. 57. What is rules engine Rules engine serves as pluggable software component which executes business rules These rules are externalized or separated from application code 58
  58. 58. Rules engine implementation Boolean logic is the easy part 59
  59. 59. Rules engine implementation Hard to detect which conditions must be evaluated new Fact 60
  60. 60. Rules engine implementation Hard to implement drools like DSL 61
  61. 61. Rules Engine - How to make it happen? 62 ● Drools - Eats memory ● Legacy rules engine ○ Customer friction is too high ○ Not efficient
  62. 62. 63
  63. 63. 64
  64. 64. GRF - Generic Rules Framework Conditions and outcomes are building blocks that can be used for complex rules creation hard coded building blocks TimeOnPage GeoLocation InviteToChat rule if ( timeOnPage(5) and geoLocation(“US”) )execute{ inviteToChat() } 65
  65. 65. GRF + CEP = RulesEngine GeoLocation condition trigger when (geo data is changed) evaluate(geo, accountConfig){ if (geo == accountConfig.geo) TRUE else FALSE } Condition type implementor defines the evaluation trigger instead of automatic detection 66
  66. 66. Shark Rules Engine (Condition) 67
  67. 67. GRF - giraffe GIRAFFE 68
  68. 68. SYNC - Detects which conditions should be evaluated and trigger GRF ASYNC - loades rules to shark rules engine sharklet A (sync handlers) sharklet A (async handlers) web visitor agent mobile visitor facadeadapter adapter adapter Account Runtime Data Message BUS Account Config Rules Engine 69
  69. 69. We did a little more AND Felt ready to go 70
  70. 70. SYNC - CEP, Rules, Report-Sharklet ASYNC - integrated with account config sharklet B sharklet B web visitor agent mobile visitor facadeadapter adapter adapter Rules Engine Account Config Account Runtime Data Message BUS sharklet A sharklet A Account Config Service 71
  71. 71. Feel the field Legacy agent visitoradmin activities - Silent mode 72
  72. 72. The dream comes true agent visitor chat lines session state activity reventschat lines admin business rules see reports chat offline reporting config monitor and engage * Business App / Extension 73
  73. 73. Platform in action Legacy chat agent visitoradmin activities engagements Account Config Reports First small customers 74
  74. 74. Shark We started with small cluster And just added servers with business growth 75
  75. 75. We recognized major bottlenecks 76
  76. 76. And easily fixed it 77
  77. 77. Tools and techniques ● Statistics monitoring ● Testing methodology ● Java 8 ● Notes about G1 78
  78. 78. Statistics monitoring - graphite 79
  79. 79. Statistics monitoring - graphite 80
  80. 80. Statistics monitoring - metrics https://github.com/dropwizard/metrics http://metrics.dropwizard.io private final Timer responses = metrics.timer(name(RequestHandler.class, "responses")); public String handleRequest(Request request, Response response) { final Timer.Context context = responses.time(); try { // etc; return "OK"; } finally { context.stop(); } } 81
  81. 81. Testing methodology ● Unit test - use it ● Integration test - invest here ● System test - try to minimize effort ● Performance ○ Integration - worth it ○ System - choose your tests 82
  82. 82. Performance test logs 83
  83. 83. Performance test validations 84
  84. 84. Testing methodology How did we test platform? We had ● built main code with tests in mind ● mocked our clients 85
  85. 85. Java 8 ● We moved to java 8 one year ago ● It was easy :) ● Pushed us to ○ more expressive code ○ functional style ○ immutability search on youtube - LivePerson Functional Java 8 86
  86. 86. Notes about G1 ● Designed for big heaps and minimizes big pauses ● Is considered to be the default GC in java 9 ● We have tested our system with G1 when 12 GB was used and ○ received good results (no big GC paused) 87
  87. 87. 88
  88. 88. We are happy now ● Horizontal scalability ● Independent and safe business logic development ● Fast development cycles (platform, sharklets, data-model) ● Efficient resource utilization ● Less BUGs (Easier to fix) ● Better QoS ● Overall confidence 89
  89. 89. Numbers ____________________________________ Pick statistics Shark Legacy Concurrent visitors ~ 100K ~ 1 Million Request/Sec ~ 11K ~ 110K Machines ~ 34 ~700 Cores ~ 224 ~ 6300 Cost per visitor ~ 0.001 ~ 0.006 90
  90. 90. Future challenges and ideas ● Better High availability ● Deployment with no downtime ● Management tools ● 100K accounts 91
  91. 91. Tips ● Define scope and requirements ● Company commitment is a must ● Work with your clients ● Treat test code as if it runs in production ● Automated perf tests - it helps ● Sometimes DIY is the best solution ● Respect legacy - combine old ideas with new technologies ● Understand the complexity and find the simplest solution 92
  92. 92. Never stop dreaming 93
  93. 93. THANK YOU! We are hiring 94

×