How We Mitigated a Critical Infra Issue

Ghostknife
3 min readOct 29, 2020

In this story, I am going to explain how I accidentally created a critical infra incident and how we handled it.

The Architecture

So we have a couple of core services and workers which are kind of considered as the core of our product. Without these services and workers, our product possibly cannot run. And these critically depend on a common Redis cluster for caching as well as for instant access of intermediate data, without any failsafe.

The Incident

One fine evening I went ahead and ran Redis KEYS command on one of the services. The request did not return a result so I ran it again. This time request was timed out. After a couple of minutes, I realized there was a pandemic that customers have been seeing 5XX in our front end for about 3 minutes. We investigated for about half an hour and couldn’t find a root cause. So we contacted AWS support and they found out there was a spike in CPU utilization due to the running of Redis KEYS command. But it was run from the Rails console, so we don't have any metrics available on our side.

We classified it as a P0 (That’s what we call the critical issues that need to be mitigated in the next 24 hours) and started working on it.

Cause Analysis

Redis KEYS is a command used for fetching multiple key values at the same time using a pattern as input. The command runs in O(n). So more the number of keys present in the cluster, it is going to take longer. And our Redis cluster contains millions of keys at a specific point in time. So it causes a sudden spike in CPU Utilization in the Redis cluster. Causing a bottleneck in the requests from the core services and workers. And As there is no failsafe present, other services are blocked as well.

What We Can Learn from It

More we use Redis for storing data that are to be used for intermediate calculations, our service and workers are going to be more coupled with Redis, and it will be tougher to make them fail-safe.

An alternative way can be to send the intermediate data along with the input, but that makes the SQS message longer.

So the better approach here would be to have separate Redis clusters for separate services. So that next time one cluster goes down, only the services depending on that cluster would have to suffer.

How We Mitigated It

  1. We had SSH access to the production EB environments for almost all the folks. We revoked it and retained it only for a certain amount of people.
  2. There was a wrapper for Ruby Redis Client, which I had bypassed to run the command, we restricted it.
  3. We set up proper Cloudwatch and PagerDuty alarms around Redis CPU usage, CPU Utilization of EC2 Instances, etc.
  4. Created NewRelic dashboards for monitoring related metrics.
  5. We set up a TechOps (SRE) team to handle such critical infra incidents in the future.

--

--