Paxos Made Wrong (part I)

Mar 18, 2020 3 min read

In each blog post of this series, I will present a modification of Paxos, together with a sequence of protocol messages, that will lead to an invariant violation. Hopefully this will lead to a better understanding of Paxos (mine and yours).

On a quest for understanding

Maybe it’s just me, but I often gain intuition on why a protocol works with situations where it doesn’t. A way to do that, is to change one step of the protocol (e.g. change a comparison from $>$ to $\leq$ or reduce the size of a quorum by 1) and see what happens.

These protocol modifications typically lead to an invariant violation (e.g. processes disagree on the consensus value). Then, when examining this misbehavior, we can gain more intuition on why that protocol step had to be the way it was. The more we do this, the more understanding we build, and at some point, we will have a pretty good grasp of what’s going on.

If the modification doesn’t lead to an invariant violation and the protocol remains correct¹, then maybe the protocol is unnecessarily strict and there might be room for improvement. Flexible Paxos reminds me exactly of that: a rigid requirement in Paxos that, when lifted, opened the door to new optimizations. (To be clear, I’m not saying this is how it happened.)

Mutation testing

In a way, this is similar to mutation testing but with a different goal. In mutation testing, the program suffers mutations to check the quality of tests: if after the mutation the test still passes, then it’s not a good test, and we should improve it.

In our case:

the mutated program is the protocol
the tests validate the protocol invariants
but the goal is not to improve the tests
instead, a failing test – in our case, an invariant that didn’t hold – will help us understand the protocol a bit better
with a passing test – in our case, an invariant that did hold – we may be on our way to an optimization – yay! (this is unlikely and won’t be the goal)

“The tests validate the protocol invariants”

To better understand this, let’s think about unit tests. Typically a unit test will have a setup, run some program logic, and in the end check that everything went well. These checks in the end are the protocol invariants. But we still need the setup and the program logic to be executed.

As an example on what I mean by setup and program logic in the context of distributed protocols, let’s think about Paxos:

the setup would be the number of processes $n$ and the number of allowed faults $f$
the program logic would be the set of messages that are exchanged between processes

With something like property testing, the testing framework can generate the setup and protocol messages for us. Then, for each test case generated, the framework checks some property. In our case, this property would be a protocol invariant.

I’m not planning to, but I may need to resort to property testing if at some point I’m unable to generate an example with a failing invariant. Nevertheless, it doesn’t matter where the example comes from, as long as it helps building intuition, ultimately leading to a full understanding.

What’s next

In the next blog post I will probably start with a specification of Paxos that we can modify in later posts.

Before any parting words, I would like to thank Cláudia Brito (@cbrito_18) and Rogério Pontes (@rogerio_acp) for their feedback on earlier versions of this post.

Until next time. Stay safe!

Knowing that some protocol change maintains correctness is a more complex topic. My go-to option would be adapting previous pen-and-paper proofs and existing TLA+ models. ^{^}

Paxos Made Wrong (part I)

On a quest for understanding

Mutation testing

“The tests validate the protocol invariants”

What’s next

Vitor Enes