Clique

Intro

For MUC communication salut uses its own reliable udp based multicast protocol, "Clique", optimized for use in mesh networks. This protocol is completely seperate from the mdns protocol used for discovery of presences on the local network. The same identifiers are used on both networks though.

Please note that this is still work in progress and details might change all the time. The current version of this page concentrates mostly on the methods/algorithms used in the protocol. Not on the exact timing value and packet layout and retries.

The protocol guarantees reliable message transport with CausalOrdering and virtual synchronisation.

Virtual synchronisation means that all machines in the current group membership agree on who are connected and that any two machines receive the same set of messages if they see the same configuration changes. For example if the the membership starts with {A,B,C} and C fails for some reason, thus chaging the membership to {A,B}. Then A and B will see the same messages between these changes.

Logically the protocol consists of a transport layer which provides the reliability, CausalOrdering and identification aspects and a Membership layers which ensures the virtual synchronisation.

transport layer

Common principles

There are various packets that trigger replies (especially REPAIR_REQUEST and WHOIS_REQUEST). When sending a reply to these a node should schedule a reply after a random timeout in a specified interval. And the node must cancel the sending when a reply is received from another node containing equivalent information.

Unique id generation and translation

In the mdns world the identifiers are free form strings. As various parts of the protocol send out packets with information about all members these are quite inefficient. Therefor the protocol uses 32 bit identifiers for each node.

To translate a 32 bit identifier to string form a node sends out an WHOIS_REQUEST packet. As a response to this a WHOIS_REPLY should be sent using the following information:

When joining the multicast group a node has to generate an unique identifier for itself using the following protocol:

  1. Generate a random identifier, i = 0
  2. Send out a WHOIS_REQUEST packet for this identifier
  3. Wait 500 ms
    1. If a WHOIS_REPLY for the generated identifier is received go back to step 1
    2. if i < 3 and no WHOIS_REPLY is received i = i + 1 and goto step 2

    3. if i >= 3 and no WHOIS_REPLY is received continue to step 4

  4. Send out a WHOIS_REPLY with the identifier

If a node encounters another node with the same unique identifier it must leave the properly leave the group and start the process described above again. (In case of collision both loose).

Transfer and repair protocol

The transports sends out both unreliable and reliable packets. Unreliable packets aren't guaranteed to arrive and are just uses for signalling (for example the packets of the identification protocol above).. Reliable packets are guaranteed to arrive at all nodes in the current set in the absence of node failures. reliable packets contain the sender, a sequential packet identifier, fragmentation information, generation and dependency information (Which other packets should be presented to the higher levels before this one).

When a node detects gaps in the data stream of other members it should schedule a REPAIR_REQUEST using the common principles (wait for a random amount of time, don't send the request if someone else already did). After sending out or discarding the REPAIR_REQUEST a node should reschedule REPAIR_REQUEST untill the relevant packet is received. Upon receiving a REPAIR_REQUEST each node with the needed information is allowed to and should sent the reliable packet using the common guidelines.

To ensure everyone keeps up-to-data once every so often one member sends a SESSION packet which contains the current next-packet of all members. Each node must schedule to sent out a SESSION packet in a certain time interval. If a SESSION packet is received with equivalent information the emitting of a SESSION packet should be rescheduled. (TODO give guidelines about good time intervals and how to gradually make them bigger when there is no data being sent)

The dependency information is for both acknowledgement and to guarantee causal ordering. Each reliable packet contains the sequence number of the packet from each node it needs to receive next. Which also implies it has received all previous messages from each node and that each node should deliver this message after all those previous messages.

If a node has no data to send, it should send regular empty messages as a way to acknowledge everything it has received and to indicate to the rest of the membership that it's still alive.

Implementation notes

All nodes should cache a reliable packet untill _all_ non-failed nodes have acknowledged that they received the packet. To prevent huge amounts of cache data the implementation should use a high water mark of maximum number of cache messages. After which it will no longer send out data (only pure acknowledgement packets) untill the cache size drops between a low water mark. If this doesn't happen within a certain time interval the node should assume that the nodes that haven't acknowledge the oldest packets in the cache are faulty. Also when there are no packets received from some node for a certain amount of time it should be assumed faulty.

When receiving reliable packets from nodes not in the current set, the transport layer notifies the membership layer so it can start the joining protocol.

Leaving

When a node wants to leave, it stops receiving messages and sends a reliable BYE message three times with 500ms in between (three times the same message, thus the same sequence number). Nodes picking up this message will declare the faulty and handle it using the normal fault handling protocol.

Membership layer

The membership layer exists of two seperate but cooperating protocols. One for joining and one for handling faulty nodes

Fault handle protocol

When the transport layer detects a faulty node the membership layer sends out a reliable FA (fault announce) packet, with the current set of nodes it thinks are faulty. All packets received from a faulty node after sending out the FA are delayed unless a packet from a non-faulty node is received which depends on them. When a node receives an FA messages from a non-faulty node containing a set of faulty nodes different from it's own, it merges both sets and submit an FA message.

When from all non-faulty node a FA messages is received containing node f then node f is removed from the membership. This guarantees that all non-faulty nodes see the same set of messages from F when it is removed (As FA messages are only delivered to the membership layer when all dependencies are furfilled and each FA depends on the last messages of F each node saw).

For a more complete prove see Membership algorithms for multicast communication groups by Y. Amier, D. Dolev, S. Kramer and D. Malki

Implementation note

It's good practise too keep the mapping of an identifier around for some time, to prevent new nodes from re-using it and is better for network performance when groups often split up and join again.

Joining protocol

The joining protocol consists of two stage. First the nodes are joined at the transport level, which results in a reliable transport between all potential nodes. After that consensus needs to be reached after which points the nodes are joined in the new membership.

After a foreign node B (in membership BM) is detected by A (in membership AM). A will send an reliable AJ (attempt join) message (AJ0) containing BM as set of nodes it wants to join with (note that each reliable packet contains the current group as dependency info) and adds BM as nodes that have to acknowledge all messages after (AJ1). When a node of BM receives this AJ it replies with an AJ (AJ1) containing no extra nodes and adds all nodes of AM as nodes that have to acknowledge all messages after this AJ. As soon as AJ 1 is received by a node of AM again it sends an empty AJ itself (AJ2).

This provides the following guarantees. After a node of BM receives (AJ0) it can safely receive all packets from AM after (AJ0), after AM recevies (AJ1) it can safely retreive all messages of BM since (AJ1). As there is nothing to garantee BM receives AJ0, the nodes in AM need to resend untill they either receive AJ1 or AJ2. And as there is no guarantee that AM will receive AJ1, the nodes in BM need to resent AJ1 untill they receive AJ2. As all nodes can retreive AJ2 via reliable message transport, there is no extra need to acknowledge that. After AJ2 each node in AM + BM is free to send a new AJ to start merging with a group CM, untill the stage 2 is started.

After two or more groups have joined at the transport level, consensus needs to be create among all nodes in the group when to shift to the new membership. To start this process nodes send a JOIN message containing the nodes current membership and the nodes considered faulty before this JOIN was send. When JOIN messages are received from all nodes with the same information, the new membership is instanted. With all messages send _before_ any of the JOIN's (thus a JOIN depends on those messages) belonging to the old membership and all others belonging to the new membership. When shifting to the new membership, first faulty nodes in the agreed JOIN should be signalled as leaving after which the new nodes are signaled as having joined

When a JOIN message is received from another node with different information then the nodes last or planned JOIN _and_ it contains either more nodes or more failures, the previous JOIN is invalidated and a new JOIN is send merging the new information. Also when a failure is detected after a JOIN has send, this failure is sent using an FA message with the same rules as the fault handling protocol. Except that a failure being received before sending a JOIN, this information is merged into the JOIN message. Handling faults after joining is complete follows the same procedure as the fault handling protocol.

Packet description/layout

TODO types:


CategorySalut