TIPC Overview

TIPC
Transparent Inter Process Communication
“Cluster Domain Sockets”
by Jon Maloy

TIPC FEATURES
Service Addressing
 Some similarity to Unix Domain Sockets, but cluster wide and with more features
 Service addresses are translated on-the-fly to system internal port numbers and node addresses
 A socket can be bound to multiple service addresses
UDP or L2 Based Messaging Service with Three Modes
 Datagram mode with unicast, anycast, multicast
 Connection mode with stream or message oriented transport
 Message bus mode with unicast, anycast, multicast, broadcast
Service and Topology Tracking
 Subscription/event functionality for node and service addresses
 Using this, users can continuously track presence of nodes, sockets, addresses and connections
 Feedback about service availability or cluster topology changes is immediate
 Fully automatic neighbor discovery
Implemented as Linux Kernel Driver
 Present in main stream Linux (kernel.org) and major distros
 Name space / container support
 Accessed via regular socket API

TIPC == SIMPLICITY
No Need to Configure or Lookup Addresses
 Addresses refer to services - not locations
 Service addresses are always valid - can be hard-coded
No Need to Configure Node Identities
 But you may if you want to
 Must tell each node which interfaces to use
No need to actively monitor processes or nodes
 No need for users to do active heart-beating
 User will learn about changes - if he wants to know
Easy synchronization when starting an application process
 First, bind to own service address, if any
 Second, subscribe for service addresses you want to track
 Third, start communicating as services become available

A service address consists of two parts, assigned by the developer
 A 32-bit service type number – typically hard-coded
 A 32-bit service instance number – typically calculated by user in run time
A service address is always qualified by a scope indicator
 Indicating lookup scope on the calling side
 node != 0 indicates that lookup should be performed only on that node
 node == 0 indicates cluster global lookup
 Indicating visibility scope on the binding side
 Dedicated values for node local or cluster global visibility
SERVICE ADDRESSING
struct tipc_service_addr{
uint32_t type;
uint32_t instance;
};
Server Process
bind(type = 42,
instance = 2,,
scope = cluster)
bind(type = 42,
instance = 1,
scope = cluster)
Server Process
Client Process
sendto(type = 42,
instance = 2,
node = 0)

No restrictions on how to bind service addresses
 Different service addresses can be bound to same socket
 Same service address can be bound to different sockets
 “Anycast” lookup with round-robin selection
 Service address ranges can be bound to a socket
 Only one service address per socket in message bus mode
SERVICE BINDING
struct tipc_service_range{
uint32_t type;
uint32_t lower;
uint32_t upper;
};
Server Process
bind(type = 42,
lower = 2,
upper = 20,
scope = cluster)
Server Process
Client Process
sendto(type = 42,
instance = 2,
node = 0)
bind(type = 42,
instance = 2,
scope = cluster)
bind(type = 666,
instance = 17,
scope = node)

LOCATION TRANSPARENCY
Client never needs to know location of server
 Translation from service address to socket address performed
on-the-fly at source node
 Replica of global binding table for translation on each node
 User can still indicate explicit socket address if he wants to
struct tipc_socket_addr{
uint32_t port;
uint32_t node;
};
Node #9a6004c1
Node #1a6b7ce0
Node #c1f10e72
port=123456
port=98765
port=763456
Server Process
bind(type = 42,
lower = 2,
upper = 20,
scope = cluster)
Server Process
Client Process
sendto(type = 42,
instance = 2,
node = 0)
bind(type = 42,
instance = 2,
scope = cluster)
bind(type = 666,
instance = 17,
scope = node)

Reliable transport socket to socket
 Receive buffer overload protection
 No end-to-end flow control
 Messages may still be rejected by receiving socket
Rejected messages may be dropped or returned to sender
 Configurable in sending socket
 If returned, message is truncated and equipped with an error code
Unicast, Anycast or Multicast
 Depends on indicated address type
DATAGRAM MODE
Server Process
bind(type = 42,
instance = 2,,
scope = cluster)
bind(type = 42,
instance = 1,
scope = cluster)
Server Process
Client Process
sendto(type = 42,
instance = 2,
node = 0)

CONNECTION MODE
Established by using service address
 One-way setup (a.k.a. “0-RTT”) using data-carrying messages
 Traditional TCP-style setup/shutdown also available
Stream- or message oriented
 End-to-end flow control for buffer overflow protection
 No socket level sequence numbers, acknowledges or retransmissions
 Link layer takes care of that
Connection breaks immediately if peer becomes unavailable
 Leverages link level heartbeats and kernel/socket cleanup functionality
 No socket level “keepalive” heartbeats needed
Node
Node
Socket
Process
Socket
Process
Socket
Process
Socket
Process

Communication Groups - brokerless bus instances
 User instantiated
 Same addressing properties (service addressing) as datagram mode
 Different traffic properties, - no dropped or rejected messages
 Four different message distribution methods
 Delivery and sequence order guaranteed, even between different distribution methods
 Leveraging L2 broadcast / UDP multicast when possible and deemed favorable
End-to-end flow control
 Messages never dropped because of destination buffer overflow
 Same mechanism covers all distribution methods
 Point-to-multipoint, - “sliding window” algorithm
 Multipoint-to-point, - “coordinated sliding window”
MESSAGE BUS MODE
Available from Linux 4.14

Members are sockets
 Groups are closed, - members can only exchange messages with other sockets in same group
 Each socket has two addresses: a <port:node> tuple bound by the system and a <group:member>
tuple bound by the user
 <group:member> is a tipc service address, i.e., the same as <type:instance>
 Member sockets may optionally deliver join/leave events for other members in the group
 Membership events are just empty messages delivered along with the source member’s two addresses
 The TIPC binding table serves as registry and distribution channel for member identities and events
join(<group:member>) TIPC
Distributed
Binding Table recvmsg(OOB,
<group:member>,
<port:node>);
leave() TIPC
Distributed
Binding Table recvmsg(OOB|EOR,
<group:member>,
<port:node);
recvmsg(OOB|EOR,
<group:member>,
<port:node>);
TIPC
Distributed
Binding Table
GROUP MEMBERSHIP

Unicast
28
60
34
7
28
60
34
7
Anycast
Multicast Broadcast
28
60
34
7
28
60
34
7
sendto(SOCKET,<port:node>);
recvmsg(<group:member>,
<port:node>);
recvmsg(<group:member,
<port:node>);
<port:node>);
<port:node>);
send();
sendto(SERVICE,<group:member>);
sendto(MCAST,<group:member>);
Received messages are delivered with both source addresses
GROUP MESSAGING

Users can subscribe for contents of the global address binding table
 Receive events at each change matching the range in the subscription
There is a match when
 Bound/unbound instance or range overlaps with range subscribed for
Received events contain the bound socket’s service address and socket address
SERVICE TRACKING
Node #9a6004c1
Node #1a6b7ce0
Node #c1f10e72
port=123456
port=98765
port=763456
Server Process
bind(type = 42,
lower = 2,
upper = 20,
scope = cluster)
Server Process
Client Process
subscribe(type = 42,
lower = 0,
upper = 10)
bind(type = 42,
instance = 2,
scope = cluster)

Special case of service tracking
 Using same mechanism, - based on service binding table contents
 Represented by the built-in service type zero (== “node availability”)
 It is also possible to subscribe for availability of individual links
CLUSTER TOPOLOGY TRACKING
Node #9a6004c1
Node #1a6b7ce0
Node #c1f10e72
Client Process
subscribe(type = 0,
lower = 0,
upper = ~0)

NODE TO NODE LINKS
“L2.5” reliable link layer
 Guarantees delivery and sequentiality for all packets
 Acts as trunk for multiple connections, and keeps track of those
 Keeps track of peer node’s address bindings in local replica of the binding table
Supervised by heartbeats at low traffic
 Failure detection tolerance configurable from 50 ms to 10 s, - default 1.5 s
 “Lost service address” events issued for bindings from peer node at lost contact
 Breaks all connections to peer node at lost contact
Several links per node pair
 Load sharing or active-standby, - but max two active
 Disturbance-free failover to remaining link, if any
Node
Node
Socket
Process
Socket
Process
Socket
Process
Socket
Process
Socket
Process
Socket
Process

NEIGHBOR DISCOVERY
Nodes have a 128 bit node identity
 By default assigned by system (from Linux 4.16)
 Can also be set by user, e.g. a host name or a UUID
 The identity is internally hashed into a guaranteed unique 32 bit node address
 This is the node address used by the protocol
Clusters have a 32 bit cluster identity
 Can be assigned by user if anything different from default value is needed
 All nodes using the same cluster identity will establish mutual links
 One link per interface, maximum two active links per node pair
Cluster identity determines network
 Neighbor discovery by UDP multicast or L2 broadcast
 If no broadcast/multicast support, discovery can be performed by explicitly configured IP addresses
<1.1.3>
Cluster id: 4711
Node id: goethe
Node #: 2f1c0ab4
Cluster id: 4711
Node id: schiller
Node #: 78fca34
Cluster id: 4711
Node id: heine
Node #: 8cfba40
Cluster id: 4711
Node id: brandes
Node #: c7f413cb
Cluster id: 4711
Node id: ibsen
Node #: f5430cba
Cluster id: 110956
Node id: 95719650-3c19-
11e8-b467-0ed5f89f718b
Node #: 8fa4ab00
Cluster id: 110956
Node id: 6c5719a38-38a6-
33b8-b467-0ed5f89f718b
Node #: 97df4a1b
Cluster id: 110956
Node id: 48719650-ba63-
12c8-b467-0ed5f89f77f2
Node #: 6f774bc4
Cluster id: 110956
Node id: 83719650-4c7b-
14b8-b467-0ed5f89f717a
Node #: 016a3f02

➢ Sort all cluster nodes into a circular list
▪ All nodes use same algorithm and
criteria
➢ Select next [√N] - 1 downstream nodes in
the list as “local domain” to be actively
monitored
▪ CPU load increases by ~√N
➢ Distribute a record describing the local
domain to all other nodes in the cluster
➢ Select and monitor a set of “head” nodes
outside the local domain so that no node is
more than two active monitoring hops away
▪ There will be [√N] - 1 such nodes
▪ Guarantees failure discovery even at
accidental network partitioning
➢ Each node now monitors 2 x (√N – 1)
neighbors
• 6 neighbors in a 16 node cluster
• 56 neighbors in an 800 node cluster
➢ All nodes use this algorithm
➢ In total 2 x (√N - 1) x N actively monitored
links
• 96 links in a 16 node cluster
• 44,800 links in an 800 node cluster
+ x N =
(√N – 1) Local Domain
Destinations
(√N – 1) Remote
“Head” Destinations
2 x (√N – 1) x N Actively
Monitored Links
SCALABILITY
Overlapping Ring Monitoring Algorithm
Since Linux 4.7, TIPC comes with a unique auto-adaptive hierarchical neighbor monitoring algorithm.
This makes it possible to establish full-mesh clusters of 1000 nodes with a failure discovery time of 1.5 sec

PERFORMANCE
Latency times better than on TCP
 ~33% faster than TCP inter-node
 2 times faster than TCP intra-node for 64 byte messages
 7 times faster than TCP intra-node for 64 kB messages
 TIPC transmits socket-to-socket instead of via the loopback interface
Throughput still somewhat lower than TCP
 ~65-90 % of max TCP throughput inter-node
 Seems to be environment dependent
 But 25-30% better than TCP intra-node
 We are working on this….

Link
ARCHITECTURE
Socket Socket Socket
Ethernet Infiniband
Media Plugins
VxLAN UDP
Link Link Link Link
Binding Table
Topology
Service
Node Node
Link
Node
C Library
External: Carrier Media
L2/Internal: Fragmentation/Bundling/
Retransmission/Congestion Control
L3: Destination Lookup
L4: Connection Handling, Flow Control
Node Table
User Land Python
Socket
L2/Internal: Link Aggregation/
Synchronization/Failover/
Neighbor Discovery/Supervision
User App Go

API
Socket API
 The original TIPC API
TIPC C API
 Simpler and more intuitive
 Available as libtipc from the tipcutils package at SourceForge
Python, Perl, Ruby, D, Go
 But not yet for Java
ZeroMQ
 Not yet with full features
More to come…

WHEN TO USE TIPC
TIPC does not replace IP based transport protocols
 It is a complement to be used under certain conditions
 It is an IPC
TIPC may be a good option if you
 Need a high performing, configuration free, brokerless, message bus
 Want startup synchronization and service discovery for free
 Have application components that need to keep continuous watch on each other
 Need short latency times
 Traffic is heavily intra node or intra subnet
 Don’t want to bother with cluster configuration
 Are inside a security perimeter
 Or can use IPSec or MACSec

WHO IS USING TIPC?
Ericsson mobile and fix core network systems
 IMS, PGW, SGW, HSS…
 Routers/switches such as SSR, AXE
 Hundreds of installed sites
 Tens of thousands of nodes
 Tens of millions of subscribers
WindRiver
 Mission critical system for Sikorsky Aircraft’s helicopters
Cisco
 onePK, IOS-XE Software, NX-OS Software
Mirantis
 OpenStack
Nokia, Huawei and numerous other companies and institutions

MORE INFORMATION
TIPC home page
http://tipc.sourceforge.net
TIPC project page
http://sourceforge.net/project/tipc
TIPC Demo/Test/Utility programs
http://sourceforge.net/project/tipc/files
TIPC Communication Groups
https://www.slideshare.net/JonMaloy/tipc-communication-groups
TIPC Overlapping Ring Neighbor Monitoring
https://www.youtube.com/watch?v=ni-iNJ-njPo
TIPC protocol specification (somewhat dated)
http://tipc.sourceforge.net/doc/draft-spec-tipc-10.html
TIPC programmer’s guide (somewhat dated)
http://tipc.sourceforge.net/doc/tipc_2.0_prog_guide.html

TIPC Overview

More Related Content

What's hot

Similar to TIPC Overview

Recently uploaded

TIPC Overview