Atomics: the beauty is in the eye of the holder

Atomics: the beauty is in the eye of the holder

Back when I was working at Radisys, one of the first things I had to do was implement a queue. Per the VOLTHA architecture, the OLT must stream it's

Here's an excerpt from this schema:

service NativeEventsManagementService {
    // ...
    // Initiate the server streaming of the events
    rpc StreamEvents(google.protobuf.Empty) returns(stream Event);
}

The Radisys OLT, requires sending a gRPC stream of errors/alerts about its health. Now, VOLTHA itself is written in Go but the OLT firmware is written in C++.

Disclaimer

An OLT is a pretty simple device. It doesn't really need a whole lot of performance engineering, and atomics were definitely overkill for this. I was young and dumb

and wanted to utilise modern C++ to its fullest.


So in the implementation of the queue, I did the following:

class Queue {
  struct Node {
    T data;
    std::atomic<Node*> next{nullptr};
    // ...
  }
}

The head and tail of the queue can hold the std::atomic<Node*> type, both initialised to the same pointer location. Now, for a quick overview of the memory_order: essentially, the sequence of instructions you write isn't always what's actually executed. After you've written an instruction(s), the compiler as well the CPU hardware reorders them to optimise for performance. This is called Out of Order Execution (Wiki). This is usually good, but sometimes we want to be more explicit about how we want said instructions to be executed. Per cppreference, we have the following paradigms:

With this, our enqueue function becomes:

void enqueue(const T& data) {
  auto new_node = std::make_unique<Node>(data);
  Node* raw_node = new_node.release();  // * ownership transferred to queue

  Node* current_tail = tail.load(std::memory_order_relaxed);
  // * ┌------------------------------┘
  // * can be `relaxed` since `tail` is only modified by the producer
  current_tail->next.store(raw_node, std::memory_order_release);
  tail.store(raw_node, std::memory_order_release);
}

dequeue would have a similar implementation, but use std::memory_order_acquire instead of std::memory_order_release. Furthermore, reading from the head would be a std::memory_order_relaxed operation as it's only modified by the modified by the consumer.

Looks good, right? There's a problem. The tail and head pointers work well when we have a single producer and a single consumer. But in our case, we have events coming in from multiple producers. Which means they'll now have to contend to write to this queue. So now we have to add a retry mechanism:

void enqueue(T data) {
  auto new_node = std::make_unique<Node>(data);
  Node* raw_node = new_node.release();

  while (true) {
    Node* current_tail = tail.load(std::memory_order_acquire);
    Node* next = current_tail->next.load(std::memory_order_acquire);
    if (next == nullptr) {
      // * next node is null, try to set tail of `next` node to `raw_node`
      if (current_tail->next.compare_exchange_strong(next, raw_node, _, _)) {
        // * perform o/p on success
        tail.compare_exchange_strong(current_tail, raw_node, _, _);
        break;
      }
    }
  }
}

But this would get stuck when a thread modifies the value of tail, but before it can update it's location, another thread coms in to modify tail themself,

So we must handle the case where tail gets stuck pointing to next != null. Our while loop now becomes:

while (true) {
  Node* current_tail = tail.load(std::memory_order_acquire);
  Node* next = current_tail->next.load(std::memory_order_acquire);

  if (next == nullptr) {
    if (current_tail->next.compare_exchange_strong(next, raw_node, _, _)) {
      // success: now update tail
      tail.compare_exchange_strong(current_tail, raw_node, _, _);
      return;
    }
    // handle race condition
  } else {
    // advance the tail past the already-linked node
    tail.compare_exchange_strong(current_tail, next, _, _);
  }
}

I'm not even going to pretend that I fully understand what's happening anymore. After this, I was hesitant to use atomics. Another reason was their inconsistent compatibility across hardware. This blog by ArangoDB delves into more detail, and does so quite well.

In cases where an atomic operation is not fully supported, a mutex lock is used as a fallback. Damn.

The other reason is what I like to call the illusion of performance. In theory, atomic operations are faster than mutex locks.

std::atomic was introduced to enable

All of this charade to arrive at a rather predictable solution: just use std::mutex instead.

But wait!

C++ is just so big that you could abandon a whole world of design and move on without thinking too much.

Since then I've learnt the faults. My experience wasn't unique.

Reflection

In my opinion the beauty of C++ is it's flexibility in design paradigms - you can be functional, can follow classes, you could work with numerous abstractions if you don't wanna get lost in implementation.

if one assesses the trends of expectations from a programming language what do you see - Python and Rust are booming. They coexist. There are instances when

Gratitude

A big thanks to my colleagues from the Radisys OLT team, Praveen, Abhilash, Prateek, Tejal, Shiva and Thulasi.