Jboss World 2011 Infinispan

1,855 views
1,706 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,855
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Jboss World 2011 Infinispan

  1. 2. Shane Johnson (Architect, Red Hat) Craig Bomba (Senior Engineer, CBOE) May 6, 2011 Infinispan Optimizing Performance & Consistency Chicago Board Options Exchange
  2. 3. Agenda <ul><li>Discuss approach, lessons, and optimizations made in this successful use of Infinispan </li></ul><ul><li>Business Case </li></ul><ul><li>Technical Requirements </li></ul><ul><li>Approach </li></ul><ul><li>Troubleshooting & Configuration </li></ul><ul><li>Asynchronous Performance & Consistency </li></ul><ul><li>Q & A </li></ul>
  3. 4. Business Case <ul><li>Synchronization of Critical Data in a Multi Session Trading Application </li></ul><ul><li>All Java </li></ul><ul><li>Facilitate High Availability of a Business Cluster </li></ul><ul><ul><li>Support HA for all JVM “partners” within the cluster </li></ul></ul><ul><ul><li>Specifically, reduce a multi-minute failure recovery event to less than a second </li></ul></ul>
  4. 5. Technical Requirements
  5. 6. Technical Requirements <ul><li>Synchronization of order details with little to no change on domain objects. </li></ul><ul><li>Limit performance impact on in-path order flow. </li></ul><ul><li>Limit performance impact on cache reads. </li></ul><ul><li>Limit impact of garbage generated on the active JVM. </li></ul><ul><li>Limit impact of GC pauses on the passive JVM such that it does not impact the active side. </li></ul><ul><li>Build in order error detection (counts) and recovery. </li></ul>
  6. 7. Technical Requirements (cont) <ul><li>Sub-second failover of a business cluster. </li></ul><ul><li>Utilize a listener on the passive side to synchronize local data structures. </li></ul><ul><li>Work done by passive side listeners should not impact performance on the active side. </li></ul><ul><li>Ability to clear caches on the active side and propagate clear to the passive side. </li></ul><ul><li>Access to source for quick resolution to bugs. </li></ul><ul><ul><li>Build it or go Open source </li></ul></ul>
  7. 8. Approach <ul><li>Technical requirements led us to Infinispan </li></ul><ul><ul><li>Many VMs acting in node pairs in Replication mode </li></ul></ul>
  8. 9. Approach <ul><li>Did it go that easily out of the box? </li></ul><ul><ul><li>Not exactly, but we were early adopters using an alpha release </li></ul></ul><ul><ul><li>Some performance and some consistency issues </li></ul></ul><ul><li>Is it enough to look at logs? What lessons were learned?? </li></ul><ul><ul><li>Helpful, but not enough. Must look at your cache contents (dump keys/values and compare) </li></ul></ul><ul><ul><li>Build test apps that simulate your use cases </li></ul></ul><ul><ul><ul><li>put, remove, get, mulitple threads, throughput, response time </li></ul></ul></ul><ul><ul><li>Useful tools </li></ul></ul><ul><ul><ul><li>Btrace is a good choice – more on that in a bit </li></ul></ul></ul><ul><ul><ul><li>pstop/pstart – to “break” the cluster </li></ul></ul></ul>
  9. 10. Approach
  10. 11. Approach – Troubleshooting & Configuration <ul><li>Issues </li></ul><ul><ul><li>Functional </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Detection </li></ul></ul><ul><ul><ul><li>Statistics (Cache Counts) </li></ul></ul></ul><ul><ul><ul><li>Instrumentation (Queue Size) </li></ul></ul></ul><ul><ul><ul><li>Cache Coherence/Contents </li></ul></ul></ul><ul><li>Log Analysis </li></ul><ul><li>Configuration </li></ul>
  11. 12. Approach – Troubleshooting & Configuration <ul><li>Tools/techniques used to detect issues: </li></ul><ul><ul><li>Functional </li></ul></ul><ul><ul><ul><li>BTrace </li></ul></ul></ul><ul><ul><ul><ul><li>Instrument a generic or specific Exception on construction (<init>) and print the stacktrace to expose point of origin </li></ul></ul></ul></ul><ul><ul><ul><li>pstop/pstart to break the cluster </li></ul></ul></ul><ul><ul><ul><li>Metrics and cache dumps </li></ul></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><ul><li>BTrace </li></ul></ul></ul><ul><ul><ul><ul><li>Instrument enter/exit of the put method in class org.infinispan.AbstractDelegatingCache </li></ul></ul></ul></ul>
  12. 13. Asynchronous Performance & Consistency <ul><li>Infinispan / JGroups Architecture </li></ul><ul><li>Asynchronous Options & Performance </li></ul><ul><li>Balancing Performance & Consistency </li></ul>
  13. 14. Infinispan / JGroups Architecture
  14. 15. Asynchronous Options <ul><li>Asynchronous Communication (JGroups) </li></ul><ul><li>Asynchronous Marshalling </li></ul><ul><li>Replication Queue </li></ul><ul><li>Asynchronous API </li></ul>
  15. 16. Asynchronous Performance Async Options -> Documentation
  16. 17. Infinispan / JGroups Asynchronous Architecture
  17. 18. Cache ‘Tier’ – Performance & Consistency
  18. 19. Core ‘Tier’ – Performance & Consistency
  19. 20. Core ‘Tier’ – Issues <ul><li>Issues </li></ul><ul><ul><li>Synchronize Flush() in Replication Queue via ISPN-691 </li></ul></ul><ul><ul><li>Consistent Ordering vs. Performance </li></ul></ul><ul><ul><li>Separate State Provider / Consumer via ISPN-610 </li></ul></ul><ul><ul><li>Return # Items Flushed via ISPN-716 </li></ul></ul><ul><ul><li>State Transfer / Out of Order -> ISPN-577 </li></ul></ul>
  20. 21. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Log Analysis </li></ul><ul><li>// Put Key: 342197608235468 (OrderCache) </li></ul><ul><li>2010-08-05 11:38:21,028 837785 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-1:) flush(): flushing repl queue (num elements=6) </li></ul><ul><li>2010-08-05 11:38:21,028 837785 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-1:) Flushing 6 elements </li></ul><ul><li>2010-08-05 11:38:21,028 837785 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Scheduled-ReplicationQueueThread-1:) dests=null, command=MultipleRpcCommand{commands=[PutKeyValueCommand{key=342197608235444, value=our.cache.ValueClass@3fdeec6c, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235458, value=our.cache.ValueClass@4473c736, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235468, value=our.cache.ValueClass@d1bcd56, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235463, value=our.cache.ValueClass@1c45cfd3, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235450, value=our.cache.ValueClass@2b4f3425, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235445, value=our.cache.ValueClass@5ce87f59, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='OrderCache'}, mode=ASYNCHRONOUS, timeout=15000 </li></ul><ul><li>// Remove Key: 342197608235468 (OrderCache) </li></ul><ul><li>2010-08-05 11:38:21,230 837987 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-1:) flush(): flushing repl queue (num elements=13) </li></ul><ul><li>2010-08-05 11:38:21,230 837987 TRACE [org.infinispan.remoting.ReplicationQueue] (Scheduled-ReplicationQueueThread-1:) Flushing 13 elements </li></ul><ul><li>2010-08-05 11:38:21,230 837987 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Scheduled-ReplicationQueueThread-1:) dests=null, command=MultipleRpcCommand{commands=[PutKeyValueCommand{key=342197608235476, value=our.cache.ValueClass@110e8e91, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235532, value=our.cache.ValueClass@52b1da56, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235518, value=our.cache.ValueClass@1a756e84, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235579, value=our.cache.ValueClass@13aaa9ae, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235537, value=our.cache.ValueClass@2969e898, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235586, value=our.cache.ValueClass@68256865, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235552, value=our.cache.ValueClass@6fe8f44d, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235510, value=our.cache.ValueClass@26ff24a1, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235601, value=our.cache.ValueClass@38bdda07, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, RemoveCommand{key=342197608235468, value=null}, PutKeyValueCommand{key=342197608235589, value=our.cache.ValueClass@38351eab, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235533, value=our.cache.ValueClass@27708961, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}, PutKeyValueCommand{key=342197608235470, value=our.cache.ValueClass@4d4b5381, putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1}], cacheName='OrderCache'}, mode=ASYNCHRONOUS, timeout=15000 </li></ul><ul><li>…… .. </li></ul>
  21. 22. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Log Analysis </li></ul><ul><li>Come on……….. </li></ul>
  22. 23. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Log Analysis </li></ul><ul><ul><li>Active VM </li></ul></ul><ul><ul><li>2010-08-05 11:38:21,028 // Put Key: 342197608235468 </li></ul></ul><ul><ul><li>2010-08-05 11:38:21,230 // Remove Key: 342197608235468 </li></ul></ul><ul><ul><li>2010-08-05 11:38:23,784 // State Transfer Response </li></ul></ul><ul><ul><li>Passive VM </li></ul></ul><ul><ul><li>2010-08-05 11:38:23,782 // State Transfer Request </li></ul></ul><ul><ul><li>2010-08-05 11:38:34,529 // Remove Key: 342197608235468 *Request 693* </li></ul></ul><ul><ul><li>2010-08-05 11:46:08,845 // Put Key: 342197608235468 *Request 692* </li></ul></ul>
  23. 24. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Log Analysis </li></ul><ul><ul><li>Notice anything? </li></ul></ul>
  24. 25. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Log Analysis </li></ul><ul><ul><li>Active VM </li></ul></ul><ul><ul><li>2010-08-05 11:38:21,028 // Put Key : 342197608235468 </li></ul></ul><ul><ul><li>2010-08-05 11:38:21,230 // Remove Key : 342197608235468 </li></ul></ul><ul><ul><li>2010-08-05 11:38:23,784 // State Transfer Response </li></ul></ul><ul><ul><li>Passive VM </li></ul></ul><ul><ul><li>2010-08-05 11:38:23,782 // State Transfer Request </li></ul></ul><ul><ul><li>2010-08-05 11:38:34,529 // Remove Key : 342197608235468 *Request 693* </li></ul></ul><ul><ul><li>2010-08-05 11:46:08,845 // Put Key : 342197608235468 *Request 692* </li></ul></ul>
  25. 26. Core ‘Tier’ – Configuration & Log Analysis <ul><li>Oops….. </li></ul>
  26. 27. Core ‘Tier’ – Configuration & Log Analysis <ul><li><asyncTransportExecutor factory=&quot;org.infinispan.executors.DefaultExecutorFactory&quot;>   <properties>           <property name=&quot; maxThreads&quot; value=&quot;1 &quot;/>           <property name=&quot;threadNamePrefix“ value=&quot;AsyncSerializationThread&quot;/>   </properties> </li></ul><ul><li></asyncTransportExecutor> </li></ul><ul><li><clustering mode=&quot;replication&quot;> <async asyncMarshalling=&quot;true“ useReplQueue=&quot;true“ replQueueInterval=&quot;10“ replQueueMaxElements=&quot;100&quot; replQueueClass =&quot;com.cboe.infrastructureServices.cacheService.ReplicationQueueInstrumentedImpl&quot;/> </li></ul><ul><li>  <stateRetrieval timeout=“180000&quot; fetchInMemoryState=&quot; false “ alwaysProvideInMemoryState=&quot;true&quot;/> </li></ul><ul><li></clustering> </li></ul><ul><li><locking useLockStriping=&quot;false&quot;/> </li></ul><ul><li><lazyDeserialization enabled=&quot;false&quot;/> </li></ul>
  27. 28. Choices <ul><li>Overall, it’s all about………….. </li></ul>
  28. 29. Core ‘Tier’ – Configuration Override Example <ul><li>Back to this one: </li></ul><ul><li><stateRetrieval timeout=“180000&quot; fetchInMemoryState=&quot; false “ alwaysProvideInMemoryState=&quot;true&quot;/> </li></ul><ul><li>if (needToRetrieveFromOtherNode) { Configuration overrideConfig = new Configuration(); overrideConfig.setFetchInMemoryState( true ); getCacheManager().defineConfiguration(cacheName,overrideConfig); } </li></ul>
  29. 30. Transport ‘Tier’ – Performance & Consistency
  30. 31. Transport ‘Tier’ – Issues <ul><li>Issues </li></ul><ul><ul><li>Asynchronous Merge Failure (JGRP-1061) </li></ul></ul><ul><ul><ul><li>via Discussion Forum (Merge Failure) </li></ul></ul></ul><ul><ul><li>Split Brain (GC, jmap, network, etc) </li></ul></ul><ul><ul><li>Report Merge Events (ISPN-609) </li></ul></ul><ul><ul><ul><li>via Discussion Forum (Propagate Transport Events) </li></ul></ul></ul>
  31. 32. Transport Tier Log <ul><li><!-- GC --> TRACE [FD]    [SUSPECT] suspect hdr is SUSPECT (suspected_mbrs=[nodeA], from=nodeB) WARN  [FD]    I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK DEBUG [FLUSH] nodeA: received START_FLUSH but I am not flush participant, not responding DEBUG [GMS]   nodeA: view is [nodeB|2] [nodeB] WARN  [GMS]   nodeA: not member of view [nodeB|2] [nodeB]; discarding it <!-- DISCOVERY/MERGE #1 --> TRACE [TCPPING] discovery took 3000 ms: responses: 1 total (1 servers (1 coord), 0 clients) TRACE [MERGE2] Discovery results: [nodeB]: [nodeB|2] [nodeB] [nodeA]: [nodeA|1] [nodeA, nodeB] <!-- FLUSH #1 --> DEBUG [FLUSH]  nodeA: flush coordinator  is starting FLUSH with participants [nodeA, nodeB] DEBUG [FLUSH]  nodeA: received START_FLUSH, responded with FLUSH_COMPLETED to nodeA DEBUG [FLUSH]  nodeA: FLUSH_COMPLETED from nodeA, completed false, flushMembers [nodeA, nodeB], flushCompleted [nodeA] WARN  [NAKACK] nodeB: dropped message from nodeA (not in xmit_table), keys are [nodeB], view=[nodeB] DEBUG [FLUSH]  nodeA: timed out waiting for flush responses after 2000 ms. Rejecting flush to participants [nodeA, nodeB] DEBUG [FLUSH]  nodeA: received ABORT_FLUSH from flush coordinator nodeA,  am I flush participant=true <!-- DISCOVERY/MERGE #2 --> TRACE [TCPPING] discovery took 3002 ms: responses: 1 total (1 servers (1 coord), 0 clients) TRACE [MERGE2] Discovery results: [nodeB]: [nodeB|2] [nodeB] [nodeA]: [nodeA|1] [nodeA, nodeB] </li></ul>
  32. 33. Transport Tier Log – Summarized <ul><li>*** Initial Join *** 17:32:06,794 JGroupsTransport : Starting JGroups Channel 17:32:07,176 JGroupsDistSync : Closing joinInProgress gate 17:32:07,282 JGroupsDistSync : Releasing ReclosableLatch [State = 1, empty queue] gate </li></ul><ul><li>  </li></ul><ul><li>*** Received MultipleRpcCommand *** 17:33:07,531 CommandAwareRpcDispatcher : Attempting to execute command: MultipleRpcCommand{cacheName='A'} 17:33:07,531 CommandAwareRpcDispatcher : Enough waiting; replayIgnored = false, sr STATE_PREEXISTED </li></ul><ul><li>  </li></ul><ul><li>*** The cache has not been properly started yet . *** 17:33:37,607 InboundInvocationHandlerImpl : Cache named A does not exist on this cache manager! 17:33:37,607 CommandAwareRpcDispatcher : Unable to execute command, got invalid response </li></ul><ul><li>  </li></ul><ul><li>*** The cache is started and the state transfer is started/completed. *** 17:35:48,502 StateTransferManagerImpl : Initiating state transfer process 17:35:48,507 JGroupsTransport : Received state for cache named 'A'.  Attempting to apply state. </li></ul>
  33. 34. Transport ‘Tier’ – Configuration & Log Analysis <ul><li>Configuration </li></ul><ul><li> <TCP bind_port=&quot;7020&quot; bind_addr=&quot;192.168.1.1&quot; loopback=&quot;true&quot; port_range=&quot;1“ ... </TCP> </li></ul><ul><li><TCPPING timeout=&quot;3000“ initial_hosts=&quot;${jgroups.tcpping.initial_hosts:192.168.1.1[7020]}“ port_range=&quot;1&quot; num_initial_members=&quot;2&quot;/> </li></ul><ul><li><!-- MPING bind_addr=&quot;192.168.1.1&quot; send_interfaces=&quot;192,168.1.1&quot; receive_interfaces=&quot;192.168.1.1&quot; break_on_coord_rsp=&quot;true“ ... num_initial_members=&quot;2&quot; num_ping_requests=&quot;20&quot; timeout=&quot;10000&quot; /--> </li></ul><ul><li><MERGE2 max_interval=&quot;30000&quot; min_interval=&quot;10000&quot;/> </li></ul><ul><li><FD timeout=&quot;15000&quot; max_tries=&quot;3&quot;/> < !-- FC max_credits=&quot;2000000&quot; min_threshold=&quot;0.10&quot;/ -- > </li></ul>
  34. 35. Summary - Log Configuration <ul><li>Enable Selective Detailed Logging (log4j sample) </li></ul><ul><li><category name=&quot;org.jboss.system&quot;> <category name=&quot;org.infinispan&quot;> <category name=&quot;org.jboss.serial&quot;> <category name=&quot;org.infinispan.statetransfer&quot;> <category name=&quot;org.infinispan.remoting&quot;> <category name=&quot;org.jgroups.protocols.MPING&quot;> <category name=&quot;org.jgroups.protocols.pbcast.GMS&quot;> <category name=&quot;org.jgroups.protocols.FD_SOCK&quot;> <category name=&quot;org.jgroups.protocols.MERGE2&quot;> <category name=&quot;org.jgroups.protocols.FD&quot;> <category name=&quot;org.jgroups.protocols.VERIFY_SUSPECT&quot;> <!-- set StateTransferManagerImpl to DEBUG level so we can see how long state transfers are taking --> <category name=&quot;org.infinispan.statetransfer.StateTransferManagerImpl&quot;> <!-- NOTE: only used INFO while debugging on org.jgroups.protocols.pbcast.FLUSH since it puts out lots of data --> <category name=&quot;org.jgroups.protocols.pbcast.FLUSH&quot;> </li></ul>
  35. 36. Summary <ul><li>Performance vs. Consistency </li></ul><ul><ul><li>It’s all about balance. </li></ul></ul><ul><li>Messaging </li></ul><ul><ul><li>A data grid is essentially a messaging system. </li></ul></ul><ul><li>Testing </li></ul><ul><ul><li>A test harness is crucial. </li></ul></ul><ul><li>Log Analysis </li></ul><ul><ul><li>Knowing how to interpret the logs is key. </li></ul></ul>
  36. 37. Summary <ul><li>OK, now that you have scared us….how are things looking now? </li></ul>
  37. 38. Q & A

×