The other key foundation I mentioned was "things fail in a large system".
If you're trying to build a large distributed system from cheap commodity hardware, this is a fact of life. individual hard drives will fail, entire servers will fail, you will most likely face an entire datacentre going dark in your career. (Probably far more often than you like). And the more components in a system, the more points of failure.
So if we go back to the "3 x 6TiB HDD vs 1 x 18TiB HDD" example from the discussion of vertical vs. horizontal scaling, there are 3 hard drives that could fail, vs. 1 that could fail. So (assuming the 6TiB and 18TiB have the same failure rate), using 3 hard drives increases your probability of having to deal with a HDD failure by 66%. And if I were to run, say, 3 servers with 3 6 TiB HDDs each, to reach 46TiB total storage, compared to running 1 server with 3 18 TiB HDDs, you can see that there's so many more potential points of failure, and not just at the disk level, more power supplies, more network interfaces, more sticks of RAM.
So the cheapness of scaling horizontally on commodity hardware brings with it the cost of surviving the inevitable failures.
So if you want to scale horizontally effectively, you have to accept that failures are normal, and build your system accordingly.
And Kafka does this in two ways - in the control plane, and in the data plane.
Firstly, for the control plane, it's focused on cluster state (e.g., which brokers are in the cluster, which brokers are leaders for a given partition etc.). Production Apache Kafka clusters use ZooKeeper to maintain this state resiliently in the face of probable failures. (It's worth mentioning that a ZooKeeper replacement called KRaft is under active development in Kafka 3.0+, but still isn't production ready).
ZK is a very well put together piece of software that implements the Paxos consensus algorithm. KRaft is a feature (that is still in development) that uses Kafka brokers running as electors to achieve consensus using the Raft algorithm.
And I'd like to quote the excellent
Jepsen analysis of ZooKeeper here, because it's pretty unequivocal:
Recommendations
Use Zookeeper. It’s mature, well-designed, and battle-tested.
Paxos and Raft have many subtleties and differences, but from the point of view of someone running a Kafka cluster, they're essentially the same, and the main point of both protocols is this - consensus about the cluster state is achievable as long as a majority of electors are still available. Which can be expressed as Q = 2N + 1, where Q is the number of electors you need to survive N elector failures. (It can also be expressed as N = floor((Q / 2) - 1))
In other words, for a ZK ensemble with 5 members, 2 can be unavailable but consensus can still be maintained, as a majority remains. 7 members means 3 can fail while retaining consensus, and so on and so forth. (These equations also express why running an even number of ZK nodes is a waste of money - a 3 ZK cluster can survive 1 failure, a 4 ZK cluster can survive 1 failure, a 5 ZK cluster can survive 2...)
The next plank of Kafka's resilience is in the data plane, which is where how topic replication factor, minimum in-sync replicas, and producer acknowledgements come in, and I'll discuss those in my next comments.