hkucuk

Chandy-Lamport Algorithm: How to View Global State in Distributed Systems?

February 23, 2025 • ☕️☕️ 11 min read • 🏷 computer, software, algorithm

Translated by author into: English


Distributed systems are systems where multiple independent processes or nodes cooperate for a common purpose. They are widely used in many areas from microservice architectures to cloud infrastructures in the modern technology world.

  • Global state is the combination of the current states of all processes that make up a distributed system and the communication channels between these processes.
  • Determining the global state in a distributed system plays a critical role in issues such as error detection, checkpointing, fault tolerance and process coordination.

But there is a big problem here: In distributed systems, since there is no single global clock or mechanism that “freezes” all nodes at the same time, the concept of “current moment” can be different for each node. This makes it difficult to capture a consistent global state. Chandy-Lamport Algorithm is a classic method developed to solve this problem.

Chandy-Lamport Algorithm


The Concept of Global State in Distributed Systems

In a distributed system:

  1. Processes: Each one running independently, usually a processor, a kernel, or a program fragment.
  2. Communication Channels: Message transmission paths between processes. These channels may have their own delays or ordering issues.

A process stores its own memory, saved variables, data structures, etc., called local state. The global state is the combination of the local states of all processes and the messages in the interprocess channels.

History and Importance of Chandy-Lamport Algorithm

The Chandy-Lamport Algorithm was introduced in the article titled “Distributed Snapshots: Determining Global States of Distributed Systems” published by K. Mani Chandy and Leslie Lamport in 1985. This article broke new ground in the literature with its solution to the problem of taking consistent snapshots in distributed systems.

Since then, the algorithm has provided a basic framework in many areas such as:

  • Checkpointing: Recording the current state of the distributed system and returning to this state after an error.
  • Error detection: Observing situations such as “which processes are running and which are not”.
  • Deadlock detection and analysis: When some processes are stuck waiting for resources, analysis of the global state is required.

Basic Principles of the Algorithm

The Chandy-Lamport Algorithm is based on the following simple but effective idea:

  1. Marker Usage: When a process decides to initiate a snapshot, it records its own state and sends a special marker message to other processes.
  2. Coordination: Processes that receive a marker message, if they see this marker for the first time, record their own state and send the marker to others.
  3. Channel State Recording: Each process determines the channel state consistently by tracking the messages that come before and after the marker message.

In this way, all processes can obtain a temporally distinct but logically consistent snapshot.

Steps of the Algorithm

The Chandy-Lamport Algorithm consists of three basic steps:

  1. Snapshot Starting

    • A process (e.g. P_1) records its local state.
    • It sends a marker message to all neighboring processes.
  2. Receiving the Marker Message

    • Another process (e.g. P_2), if it is starting this algorithm for the first time:

      • It records its local state.
      • It sends a marker message to all neighbors.
      • It records the messages received before the marker message arrives as part of the relevant channel.
    • If the process has already started a snapshot and sees the marker message for the second time, this marker is used only to complete the channel state.
  3. Recording the Channel State

    • The state of a channel is associated with the message traffic from the moment the marker message arrives or immediately after sending the marker.
    • In this way, the current content of the channel (messages that have arrived but have not yet been processed) is recorded.

Recording Channel State

The channel state is associated with the messages that pass through the channel after a process sends (or receives) a marker message. To be more precise:

  • A channel state is considered “incomplete” until a process records and sends a marker.
  • After a marker message is received, another process starts collecting messages from that channel. Those messages become part of the snapshot.

In particular, messages that arrive before receiving a marker message are not included in the recording of the channel state. Because those messages are already associated with the previous snapshot of that channel or included in the local state of the process.


A Detailed Example Scenario

Consider a distributed system consisting of three processes (P_1, P_2, P_3). Let’s say that each process interacts with each other as follows:

  • P_1P_2
  • P_2P_3
  • P_1P_3

So, we have a fully connected topology.

  1. Snapshot Start (P_1)
  2. P_1 records its own state. Let’s say that P_1’s “memory value: x=10”, “number of processes: 3”.
  3. P_1 sends marker messages to P_2 and P_3.
  4. Receives Marker Message P_2
  5. If P_2 has not received a marker message before (receiving it for the first time):
  6. P_2 records its own state. (For example: y=20, processes=5)
  7. P_2 sends marker messages to both P_1 and P_3. (Actually, it also sends to P_1, but since P_1 has the marker, it only completes the channel state.)
  8. Receives Marker Message P_3
  9. By the same logic, P_3 records its own state. (z=15, processes=7)
  10. P_3 sends markers to the other two processes.
  11. Recording Channel States
  12. P_2 may have sent some messages to P_3 before the marker arrived from P_3. If these messages were sent out before the marker and received after the marker, these messages may fall into a state of “outgoing but not yet processed”. Therefore, they are included in the channel state.

Once these steps are completed, the local states of all three processes and the messages “pending” on the channels between them are recorded, creating a complete, consistent snapshot.


Code Example with GoLang

Below we will try to code a simplified simulation of the Chandy-Lamport algorithm. The example shows the message exchange between three processes (goroutines) and how they process a marker message.

Please note: The code below is not a complete production code, but rather for educational purposes. In distributed systems, actual messaging usually happens over the network; here, we are simulating it with Go’s chan (channel) mechanism.

package main

import (
    "fmt"
    "sync"
    "time"
)

// We will define a simple structure to hold the state of each process
type Process struct {
    id            int
    localState    int
    hasRecorded   bool
    markerChannel chan bool
    messageChan   chan int
    wg            *sync.WaitGroup
}

// Global wait group for all processes
var wg sync.WaitGroup

func main() {
    // Create 3 processes
    process1 := Process{
        id:            1,
        localState:    10,
        hasRecorded:   false,
        markerChannel: make(chan bool, 1),
        messageChan:   make(chan int, 10),
        wg:            &wg,
    }
    process2 := Process{
        id:            2,
        localState:    20,
        hasRecorded:   false,
        markerChannel: make(chan bool, 1),
        messageChan:   make(chan int, 10),
        wg:            &wg,
    }
    process3 := Process{
        id:            3,
        localState:    30,
        hasRecorded:   false,
        markerChannel: make(chan bool, 1),
        messageChan:   make(chan int, 10),
        wg:            &wg,
    }

    // We will use one channel to simulate message passing between each pair
    // But for clarity, let's just consider processes sending messages to each other in a ring or fully connected scenario
    // Here, we simulate a fully connected pattern in a simplistic manner

    wg.Add(3)
    go process1.run(&process2, &process3)
    go process2.run(&process1, &process3)
    go process3.run(&process1, &process2)

    // Start the system for a bit, then initiate a snapshot from process1
    time.Sleep(2 * time.Second)

    fmt.Println("=== Starting Snapshot from Process 1 ===")
    process1.markerChannel <- true // send a marker to process1

    // Wait for all processes to finish
    wg.Wait()
}

// run is the main loop for each process
func (p *Process) run(p2, p3 *Process) {
    defer p.wg.Done()

    // Start a ticker that sends a message to p2 and p3 every 500ms
    ticker := time.NewTicker(500 * time.Millisecond)

    for {
        select {
        case <-ticker.C:
            // Send a random message (here, just localState) to p2 and p3
            p2.messageChan <- p.localState
            p3.messageChan <- p.localState
            fmt.Printf("Process %d sends message with content %d\n", p.id, p.localState)

        case marker := <-p.markerChannel:
            // If marker is received, handle the snapshot logic
            if marker {
                if !p.hasRecorded {
                    p.takeSnapshot()
                    // Send marker to other processes
                    p2.markerChannel <- true
                    p3.markerChannel <- true
                } else {
                    // If already recorded, do nothing special
                    // except acknowledging we received a marker
                    fmt.Printf("Process %d already recorded a snapshot\n", p.id)
                }
            }

        case msg := <-p.messageChan:
            // We receive a normal message
            fmt.Printf("Process %d receives message with content %d\n", p.id, msg)
            // If we haven't recorded yet, this message belongs to the channel state
            if !p.hasRecorded {
                fmt.Printf("Process %d: This message is stored as part of the channel state for snapshot.\n", p.id)
            }

        // We add a timeout or some condition to exit
        case <-time.After(5 * time.Second):
            fmt.Printf("Process %d shutting down.\n", p.id)
            return
        }
    }
}

// takeSnapshot simulates recording the local state
func (p *Process) takeSnapshot() {
    p.hasRecorded = true
    fmt.Printf("Process %d is recording local state: %d\n", p.id, p.localState)
    // In a real scenario, we would also store the channel states here
}

When the program is run, the output will be as follows.

Process 2 sends message with content 20
Process 1 receives message with content 20
Process 1: This message is stored as part of the channel state for snapshot.
Process 1 sends message with content 10
Process 3 receives message with content 20
Process 3: This message is stored as part of the channel state for snapshot.
Process 3 sends message with content 30
Process 3 receives message with content 10
Process 3: This message is stored as part of the channel state for snapshot.
Process 2 receives message with content 10
Process 2: This message is stored as part of the channel state for snapshot.
Process 2 receives message with content 30
Process 2: This message is stored as part of the channel state for snapshot.
Process 1 receives message with content 30
Process 1: This message is stored as part of the channel state for snapshot.
Process 1 sends message with content 10
Process 1 receives message with content 30
Process 1: This message is stored as part of the channel state for snapshot.
Process 1 receives message with content 20
Process 3 sends message with content 30
Process 3 receives message with content 10
Process 3: This message is stored as part of the channel state for snapshot.
Process 3 receives message with content 20
Process 3: This message is stored as part of the channel state for snapshot.
Process 1: This message is stored as part of the channel state for snapshot.
Process 2 sends message with content 20
Process 2 receives message with content 10
..........

The working version of the program is available here.

This example is a simplified version of the entire Chandy-Lamport process. Complex issues such as real-world network delays, retransmissions, error conditions are added and addressed.


Real World Applications

  1. Checkpointing in Distributed Databases

    • Important databases maintain data integrity by taking snapshots from time to time. This way, the system can revert to this record in case of an error.
  2. Error Detection and Recovery

    • Since distributed systems aim to be constantly online, snapshots taken at certain times can easily identify which processes are running and which channels messages are stuck in.
  3. Consistency Control in Distributed Systems

    • For example, replicated ledger structures such as blockchain require global state information periodically.
  4. IoT and Microservices Architecture

    • Snapshots are taken to analyze the state of the system at a specific moment during the collection of sensor data or communication of multiple microservices.

Frequently Asked Questions and Answers

Question 1: Does the algorithm work on synchronous or asynchronous networks? Answer: The Chandy-Lamport Algorithm can also work on asynchronous systems. In fact, the beauty of this algorithm is that it guarantees consistent snapshots even in environments where communication delays are uncertain.

Question 2: How does the marker message itself represent a state? Answer: The marker message means “Snapshot is starting on this channel”. When each process receives the marker message (if it sees it for the first time), it records its local state and starts the process.

Question 3: Do we need to stop the system until all processes complete the snapshot? Answer: No. The Chandy-Lamport Algorithm works in a distributed manner and does not prevent the normal operation of processes. Processes can continue to send and receive messages while taking a snapshot.

Question 4: How is it different from other snapshot algorithms? Answer: Chandy-Lamport takes simple, extensible and consistent snapshots using markers. There are other algorithms (e.g. Mattern’s colored marker algorithm) but most of them are inspired by the Chandy-Lamport idea.


The Chandy-Lamport algorithm is a marker-passing based method that skillfully solves the global state problem in distributed systems. Its main success is that it allows processes to create a consistent snapshot without using any central controller and without blocking each other.


Resources