Nathan Evans' Nemesis of the Moment

Simulating the P of CAP on a Riak cluster

Posted in .NET Framework, Automation, Unix Environment by Nathan B. Evans on February 10, 2013

When developing and testing a distributed system, one of the essential concerns you will deal with is eventual consistency (EC). There are plenty of articles covering EC so I’m not going to dwell on that much here. What I am going to talk about is testing, particularly of the automated kind.

Testing an eventually consistent system is difficult because everything is transient and unpredictable. Unless you tune your consistency values like N and DW then you’re offered no guarantees about the extent of propagation of your commit around the cluster. And whilst consistency tuning may be acceptable for some tests, it most definitely won’t be acceptable for tests that cover concurrent write concerns such as your sibling resolution protocols.

What is “partition tolerance”?

This is where a single node or a group of nodes in the cluster become segregated from the rest of the cluster. I liken it to a “net split” on an IRC network. The system continues operating but it has been split into two or more parts. When a node has become segregated from the rest of the cluster it does not necessarily mean that its clients can also not reach it. Therefore all nodes can continue to perform writes on the dataset.

Generally speaking, a partition event is a transient situation. They may last a few seconds, a few minutes, hours.., days.. or even weeks. But the expectation is that eventually the partition will be repaired and the cluster returned to full health.

State changes of a cluster during a partition event

Fig. 1: State changes of a cluster during a partition event

In the diagram (Fig. 1) there is a cluster comprised of three nodes, and this is the sequence of events:

  1. N2 becomes network partitioned from N1 and N3.
    N1 and N3 can continue replicating between one another.
  2. Client(s) connected to either N1 or N3 perform two writes (indicated by the green and orange).
  3. Client(s) connected to N2 perform three writes (purple, red, blue).
  4. When the network partition is resolved, the cluster begins to heal by merging the commits between the nodes.
  5. The yellow and green commits (that were already replicated between N1 and N3) are propagated onto N2.
  6. The purple, red and blue commits on N2 are propagated onto N1 and N3.
  7. The cluster is now fully healed.

Simulating a partition

I have a Riak cluster of three nodes running on Windows Azure and I needed some way to deterministically simulate a network partition scenario. The solution that I came up with was quite simple. I basically wrote some iptables scripts that temporarily firewalled certain IP traffic on the LAN in order to prevent the selected node from communicating with any other nodes.

To erect the partition:

# Simulates a network partition scenario by blocking all TCP LAN traffic for a node.
# This will prevent the node from talking to other nodes in the cluster.
# The rules are not persisted, so a restart of the iptables service (or indeed the whole box) will reset things to normal.

# First add special exemption to allow loopback traffic on the LAN interface.
# Without this, riak-admin gets confused and thinks the local node is down when it isn't.
sudo iptables -I OUTPUT -p tcp -d $(hostname -i) -j ACCEPT
sudo iptables -I INPUT -p tcp -s $(hostname -i) -j ACCEPT

# Now block all other LAN traffic.
sudo iptables -I OUTPUT 2 -p tcp -d -j REJECT
sudo iptables -I INPUT 2 -p tcp -s -j REJECT

To tear down the partition:

# Restarts the iptables service, thereby resetting any temporary rules that were applied to it.
sudo service iptables restart

Disclaimer: These are just rough shell scripts designed for use on a test bed development environment.

I then came up with a fairly neat little class that wraps up the concern of a network partition:

internal class NetworkPartition : IDisposable {
    private readonly string _nodeName;
    private bool _closed;

    private NetworkPartition(string nodeName) {
        _nodeName = nodeName;

    /// <summary>
    /// Creates a temporary network partition by segregating (at the IP firewall level) a particular node from all other nodes in the cluster.
    /// </summary>
    /// <param name="nodeName">The name of the node to be network partitioned. The name must exist as a PuTTY "Saved Session".</param>
    /// <returns>An object that can be disposed when the partition is to be removed.</returns>
    public static IDisposable Create(string nodeName) {
        var np = new NetworkPartition(nodeName);
        Plink.Execute("", nodeName);
        return np;

    private void Close() {
        if (_closed)

        Plink.Execute("", _nodeName);

        _closed = true;

    public void Dispose() {

Which allows me to write test cases in the following way:


using (NetworkPartition.Create("riak03")) {

    // TODO: Do other stuff here whilst riak03 is network partitioned from the rest of the cluster.


Yes, RingStatus and RingReady aren’t documented here. But they’re pretty simple.

Obviously as part of this work I had to write a quick and dirty wrapper around the PuTTY plink.exe tool. This tool is basically a CLI version of PuTTY and it is very good for scripting automated tasks.

My solution could be improved to support creating partitions that consist of more than just one node, but for me the added value of this would be very small at this stage. Maybe I will add support for this later when I get into stress testing territory!


You can view it over on GitHub; the namespace is tentatively called “ClusterTools”. Bear in mind it’s currently held inside another project but if there’s enough interest I will make it standalone. There has been talk on the CorrugatedIron project (which is a Riak client library for .NET) about starting up a Contrib library, so maybe this could be one its first contributions.

Riak timing out during startup

Posted in Databases, Unix Environment by Nathan B. Evans on February 6, 2013

Recently I’ve been working on some interesting projects involving the eventually consistent Riak key-value database. Today I encountered a puzzling issue with a fresh cluster I was deploying. I had seemingly done everything identically to in the past except something was causing it to fail to startup when in service mode.

[administrator@riak01 ~]$ sudo service riak start
Starting Riak: Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.

Bizarrely, in console mode it would start up fine. This led to me to believe it was some sort of user or permissions issue but I wasn’t totally sure. Perhaps I had accidentally executed Riak as another user and some sort of locking file was created? Or was it perhaps a performance issue with the “Shared Core” Azure VM’s I was using.

First I followed Riak’s advice by increasing the WAIT_FOR_ERLANG environment variable, I tried first 30 and then 60 seconds. But this made no difference at all. I’m not even sure if Riak was even using my new value as it still kept on printing out “15 seconds” as its reason for failing to start.

I researched some more and many places on the interweb were suggesting to purge the /var/lib/riak/ring/ directory (don’t do this if you have valuable data stored on the Riak instance). I tried this, but it also had no effect.

But it turned out that the solution was incredibly simple. Riak had created some sort of temporary lock directory at /tmp/riak. All I had to do was delete this directory and, hey presto, Riak would now start perfectly fine as a service!

$ sudo rm -r /tmp/riak

There may be more posts on the subject of Riak soon. 🙂

PS: I am using Riak version 1.2.1, on CentOS 6.3.

Tagged with: ,