(All times CET.)

Summary of the Incident

Date	Time (CET)	Event
Dec 12	1 pm	A customer starts syncing large amounts of data (for 5 connections) without immediate signs of system degradation.
Dec 13	1 am	Slowdowns & errors become noticeable and reported in the community. Both the Nango API and the management dashboard are affected. For short periods of time, many API requests fail, rendering Nango unavailable to most users. The system recovers between bursts of activity before deteriorating again.
Dec 13	6 am	A Nango team member wakes up and starts investigating. Several mitigations are attempted to restore the service to full availability, but only with limited effects.
Dec 13	10:15 am	The 5 new high-traffic connections are paused. Metrics improve, but performance issues persist.
Dec 13	3 pm	The Nango team suspects a hardware problem and contacts our cloud provider.
Dec 13	5:45 pm	Our cloud provider swaps the database disks (end of incident).

Screenshot 2023-12-15 at 12.48.10.png

Description of Root Cause

Nango uses Render.com for cloud hosting services, including our Postgres database.

For historical reasons, and without our knowledge, our Postgres database instance was provisioned on a legacy disk type by Render that utilizes burst credits.

Leading up to the incident, a traffic increase on Nango Cloud had depleted the burst credits of our database instance. This severely deteriorated the performance of our Postgres database, which in turn affected all workloads of Nango Cloud.

This cloud provider no longer uses a disk type with burst credits, but Nango’s database was maintained on the legacy disk type for some reason. This limitation was not communicated to Nango. Without it, the outage would not have happened. Following the disk type change, the higher level of traffic was sustained perfectly.

Fix

Our database provider updated our database disk type. This change resolved the incident and provided a long-term solution, as the new disk type doesn't operate on the same burst balance credit system.

Additional mitigation measures are below. 👇

Post-incident analysis & improvements

We are working hard to severely reduce the chances of an incident of this length & severity. In particular, we are working on improving:

Time to detection: react to such critical incidents much faster
Investigation length: discover incident root causes much faster
System compartmentalisation: one service should not bring down the whole platform
Fail safes & resource caps: traffic spikes should be mitigated with rate-limits, timeouts, etc.

Summary of the Incident

Description of Root Cause

Fix

Post-incident analysis & improvements

List of improvements & mitigation measures