Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TIPC Overview


Published on

A "Cluster Domain Socket" IPC implementing datagram messaging, connection oriented messaging and a broker free message bus in a kernel driver.

Published in: Software
  • Be the first to comment

TIPC Overview

  1. 1. TIPC Transparent Inter Process Communication “Cluster Domain Sockets” by Jon Maloy
  2. 2. TIPC FEATURES Service Addressing  Some similarity to Unix Domain Sockets, but cluster wide and with more features  Service addresses are translated on-the-fly to system internal port numbers and node addresses  A socket can be bound to multiple service addresses UDP or L2 Based Messaging Service with Three Modes  Datagram mode with unicast, anycast, multicast  Connection mode with stream or message oriented transport  Message bus mode with unicast, anycast, multicast, broadcast Service and Topology Tracking  Subscription/event functionality for node and service addresses  Using this, users can continuously track presence of nodes, sockets, addresses and connections  Feedback about service availability or cluster topology changes is immediate  Fully automatic neighbor discovery Implemented as Linux Kernel Driver  Present in main stream Linux ( and major distros  Name space / container support  Accessed via regular socket API
  3. 3. TIPC == SIMPLICITY No Need to Configure or Lookup Addresses  Addresses refer to services - not locations  Service addresses are always valid - can be hard-coded No Need to Configure Node Identities  But you may if you want to  Must tell each node which interfaces to use No need to actively monitor processes or nodes  No need for users to do active heart-beating  User will learn about changes - if he wants to know Easy synchronization when starting an application process  First, bind to own service address, if any  Second, subscribe for service addresses you want to track  Third, start communicating as services become available
  4. 4. A service address consists of two parts, assigned by the developer  A 32-bit service type number – typically hard-coded  A 32-bit service instance number – typically calculated by user in run time A service address is always qualified by a scope indicator  Indicating lookup scope on the calling side  node != 0 indicates that lookup should be performed only on that node  node == 0 indicates cluster global lookup  Indicating visibility scope on the binding side  Dedicated values for node local or cluster global visibility SERVICE ADDRESSING struct tipc_service_addr{ uint32_t type; uint32_t instance; }; Server Process bind(type = 42, instance = 2,, scope = cluster) bind(type = 42, instance = 1, scope = cluster) Server Process Client Process sendto(type = 42, instance = 2, node = 0)
  5. 5. No restrictions on how to bind service addresses  Different service addresses can be bound to same socket  Same service address can be bound to different sockets  “Anycast” lookup with round-robin selection  Service address ranges can be bound to a socket  Only one service address per socket in message bus mode SERVICE BINDING struct tipc_service_range{ uint32_t type; uint32_t lower; uint32_t upper; }; Server Process bind(type = 42, lower = 2, upper = 20, scope = cluster) Server Process Client Process sendto(type = 42, instance = 2, node = 0) bind(type = 42, instance = 2, scope = cluster) bind(type = 666, instance = 17, scope = node)
  6. 6. LOCATION TRANSPARENCY Client never needs to know location of server  Translation from service address to socket address performed on-the-fly at source node  Replica of global binding table for translation on each node  User can still indicate explicit socket address if he wants to struct tipc_socket_addr{ uint32_t port; uint32_t node; }; Node #9a6004c1 Node #1a6b7ce0 Node #c1f10e72 port=123456 port=98765 port=763456 Server Process bind(type = 42, lower = 2, upper = 20, scope = cluster) Server Process Client Process sendto(type = 42, instance = 2, node = 0) bind(type = 42, instance = 2, scope = cluster) bind(type = 666, instance = 17, scope = node)
  7. 7. Reliable transport socket to socket  Receive buffer overload protection  No end-to-end flow control  Messages may still be rejected by receiving socket Rejected messages may be dropped or returned to sender  Configurable in sending socket  If returned, message is truncated and equipped with an error code Unicast, Anycast or Multicast  Depends on indicated address type DATAGRAM MODE Server Process bind(type = 42, instance = 2,, scope = cluster) bind(type = 42, instance = 1, scope = cluster) Server Process Client Process sendto(type = 42, instance = 2, node = 0)
  8. 8. CONNECTION MODE Established by using service address  One-way setup (a.k.a. “0-RTT”) using data-carrying messages  Traditional TCP-style setup/shutdown also available Stream- or message oriented  End-to-end flow control for buffer overflow protection  No socket level sequence numbers, acknowledges or retransmissions  Link layer takes care of that Connection breaks immediately if peer becomes unavailable  Leverages link level heartbeats and kernel/socket cleanup functionality  No socket level “keepalive” heartbeats needed Node Node Socket Process Socket Process Socket Process Socket Process
  9. 9. Communication Groups - brokerless bus instances  User instantiated  Same addressing properties (service addressing) as datagram mode  Different traffic properties, - no dropped or rejected messages  Four different message distribution methods  Delivery and sequence order guaranteed, even between different distribution methods  Leveraging L2 broadcast / UDP multicast when possible and deemed favorable End-to-end flow control  Messages never dropped because of destination buffer overflow  Same mechanism covers all distribution methods  Point-to-multipoint, - “sliding window” algorithm  Multipoint-to-point, - “coordinated sliding window” MESSAGE BUS MODE Available from Linux 4.14
  10. 10. Members are sockets  Groups are closed, - members can only exchange messages with other sockets in same group  Each socket has two addresses: a <port:node> tuple bound by the system and a <group:member> tuple bound by the user  <group:member> is a tipc service address, i.e., the same as <type:instance>  Member sockets may optionally deliver join/leave events for other members in the group  Membership events are just empty messages delivered along with the source member’s two addresses  The TIPC binding table serves as registry and distribution channel for member identities and events join(<group:member>) TIPC Distributed Binding Table recvmsg(OOB, <group:member>, <port:node>); leave() TIPC Distributed Binding Table recvmsg(OOB|EOR, <group:member>, <port:node); recvmsg(OOB|EOR, <group:member>, <port:node>); TIPC Distributed Binding Table GROUP MEMBERSHIP
  11. 11. Unicast 28 60 34 7 28 60 34 7 Anycast Multicast Broadcast 28 60 34 7 28 60 34 7 sendto(SOCKET,<port:node>); recvmsg(<group:member>, <port:node>); recvmsg(<group:member, <port:node>); recvmsg(<group:member>, <port:node>); recvmsg(<group:member>, <port:node>); send(); sendto(SERVICE,<group:member>); sendto(MCAST,<group:member>); Received messages are delivered with both source addresses GROUP MESSAGING
  12. 12. Users can subscribe for contents of the global address binding table  Receive events at each change matching the range in the subscription There is a match when  Bound/unbound instance or range overlaps with range subscribed for Received events contain the bound socket’s service address and socket address SERVICE TRACKING Node #9a6004c1 Node #1a6b7ce0 Node #c1f10e72 port=123456 port=98765 port=763456 Server Process bind(type = 42, lower = 2, upper = 20, scope = cluster) Server Process Client Process subscribe(type = 42, lower = 0, upper = 10) bind(type = 42, instance = 2, scope = cluster)
  13. 13. Special case of service tracking  Using same mechanism, - based on service binding table contents  Represented by the built-in service type zero (== “node availability”)  It is also possible to subscribe for availability of individual links CLUSTER TOPOLOGY TRACKING Node #9a6004c1 Node #1a6b7ce0 Node #c1f10e72 Client Process subscribe(type = 0, lower = 0, upper = ~0)
  14. 14. NODE TO NODE LINKS “L2.5” reliable link layer  Guarantees delivery and sequentiality for all packets  Acts as trunk for multiple connections, and keeps track of those  Keeps track of peer node’s address bindings in local replica of the binding table Supervised by heartbeats at low traffic  Failure detection tolerance configurable from 50 ms to 10 s, - default 1.5 s  “Lost service address” events issued for bindings from peer node at lost contact  Breaks all connections to peer node at lost contact Several links per node pair  Load sharing or active-standby, - but max two active  Disturbance-free failover to remaining link, if any Node Node Socket Process Socket Process Socket Process Socket Process Socket Process Socket Process
  15. 15. NEIGHBOR DISCOVERY Nodes have a 128 bit node identity  By default assigned by system (from Linux 4.16)  Can also be set by user, e.g. a host name or a UUID  The identity is internally hashed into a guaranteed unique 32 bit node address  This is the node address used by the protocol Clusters have a 32 bit cluster identity  Can be assigned by user if anything different from default value is needed  All nodes using the same cluster identity will establish mutual links  One link per interface, maximum two active links per node pair Cluster identity determines network  Neighbor discovery by UDP multicast or L2 broadcast  If no broadcast/multicast support, discovery can be performed by explicitly configured IP addresses <1.1.3> Cluster id: 4711 Node id: goethe Node #: 2f1c0ab4 Cluster id: 4711 Node id: schiller Node #: 78fca34 Cluster id: 4711 Node id: heine Node #: 8cfba40 Cluster id: 4711 Node id: brandes Node #: c7f413cb Cluster id: 4711 Node id: ibsen Node #: f5430cba Cluster id: 110956 Node id: 95719650-3c19- 11e8-b467-0ed5f89f718b Node #: 8fa4ab00 Cluster id: 110956 Node id: 6c5719a38-38a6- 33b8-b467-0ed5f89f718b Node #: 97df4a1b Cluster id: 110956 Node id: 48719650-ba63- 12c8-b467-0ed5f89f77f2 Node #: 6f774bc4 Cluster id: 110956 Node id: 83719650-4c7b- 14b8-b467-0ed5f89f717a Node #: 016a3f02
  16. 16. ➢ Sort all cluster nodes into a circular list ▪ All nodes use same algorithm and criteria ➢ Select next [√N] - 1 downstream nodes in the list as “local domain” to be actively monitored ▪ CPU load increases by ~√N ➢ Distribute a record describing the local domain to all other nodes in the cluster ➢ Select and monitor a set of “head” nodes outside the local domain so that no node is more than two active monitoring hops away ▪ There will be [√N] - 1 such nodes ▪ Guarantees failure discovery even at accidental network partitioning ➢ Each node now monitors 2 x (√N – 1) neighbors • 6 neighbors in a 16 node cluster • 56 neighbors in an 800 node cluster ➢ All nodes use this algorithm ➢ In total 2 x (√N - 1) x N actively monitored links • 96 links in a 16 node cluster • 44,800 links in an 800 node cluster + x N = (√N – 1) Local Domain Destinations (√N – 1) Remote “Head” Destinations 2 x (√N – 1) x N Actively Monitored Links SCALABILITY Overlapping Ring Monitoring Algorithm Since Linux 4.7, TIPC comes with a unique auto-adaptive hierarchical neighbor monitoring algorithm. This makes it possible to establish full-mesh clusters of 1000 nodes with a failure discovery time of 1.5 sec
  17. 17. PERFORMANCE Latency times better than on TCP  ~33% faster than TCP inter-node  2 times faster than TCP intra-node for 64 byte messages  7 times faster than TCP intra-node for 64 kB messages  TIPC transmits socket-to-socket instead of via the loopback interface Throughput still somewhat lower than TCP  ~65-90 % of max TCP throughput inter-node  Seems to be environment dependent  But 25-30% better than TCP intra-node  We are working on this….
  18. 18. Link ARCHITECTURE Socket Socket Socket Ethernet Infiniband Media Plugins VxLAN UDP Link Link Link Link Binding Table Topology Service Node Node Link Node C Library External: Carrier Media L2/Internal: Fragmentation/Bundling/ Retransmission/Congestion Control L3: Destination Lookup L4: Connection Handling, Flow Control Node Table User Land Python Socket L2/Internal: Link Aggregation/ Synchronization/Failover/ Neighbor Discovery/Supervision User App Go
  19. 19. API Socket API  The original TIPC API TIPC C API  Simpler and more intuitive  Available as libtipc from the tipcutils package at SourceForge Python, Perl, Ruby, D, Go  But not yet for Java ZeroMQ  Not yet with full features More to come…
  20. 20. WHEN TO USE TIPC TIPC does not replace IP based transport protocols  It is a complement to be used under certain conditions  It is an IPC TIPC may be a good option if you  Need a high performing, configuration free, brokerless, message bus  Want startup synchronization and service discovery for free  Have application components that need to keep continuous watch on each other  Need short latency times  Traffic is heavily intra node or intra subnet  Don’t want to bother with cluster configuration  Are inside a security perimeter  Or can use IPSec or MACSec
  21. 21. WHO IS USING TIPC? Ericsson mobile and fix core network systems  IMS, PGW, SGW, HSS…  Routers/switches such as SSR, AXE  Hundreds of installed sites  Tens of thousands of nodes  Tens of millions of subscribers WindRiver  Mission critical system for Sikorsky Aircraft’s helicopters Cisco  onePK, IOS-XE Software, NX-OS Software Mirantis  OpenStack Nokia, Huawei and numerous other companies and institutions
  22. 22. MORE INFORMATION TIPC home page TIPC project page TIPC Demo/Test/Utility programs TIPC Communication Groups TIPC Overlapping Ring Neighbor Monitoring TIPC protocol specification (somewhat dated) TIPC programmer’s guide (somewhat dated)