Solr Recovery

At Honeybadger this morning we had a failure of our SolrCloud cluster (of three servers). Each of the three servers has a replica of the eight shards of our notices collection. Theoretically this means that two of the three servers can go away and we can still serve search traffic. Sadly, reality doesn’t always match the theory.

What happened to us this morning is that some of the shards became leaderless when one of the servers ran out of disk space and started throwing errors. In other words, we kept seeing this error in the logs: No registered leader was found. As a result, the two remaining servers refused to update the index, which brought a halt to search-related operations. Since I’m relatively new to Solr, I had to bang my head against the wall for a bit before I stumbled upon the solution.

Simply bringing down one of the two remaining good servers didn’t solve the problem. The last remaining server refused to become the leader for four of the shards. To fix this, I had to unload each of the stubborn shards and load them again. This was accomplished easily enough via the admin UI, and, once completed, our search functionality was restored.

Once that was done I brought the other good server back up, and it quickly caught up the one server that was now the leader for all the shards. Easy-peasy. Sadly, bringing the last of the servers up – the culprit with the full disk – took a bit more work. Since its data directory was about twice the size of the directory on the other two servers, despite all three supposedly having the same documents, I decided to just blow away all the data on the third server and replace it with a copy of the snapshots from the leader. The process was basically this:

  1. Take a fresh snapshot of each shard from the leader
  2. Copy the eight snapshots from the leader to the damaged server
  3. Move the index data in place, renaming the shards
  4. Unload and load the cores on the damaged server

Here’s a ruby script for step 1 and a shell script for the other steps.

With that, the damaged server quickly got each of the shards back in sync (since I had just taken snapshots on the leader), and everything was back to normal.