Summary
On Thursday April 30 2026, our S3-compatible object storage service experienced intermittent disruptions over a roughly six-hour window between 14:39 and 22:00 CEST. During this period, some object reads and writes returned errors or suffered elevated latency, and our public status page reflected a "degraded" state for the object storage component. Other DanubeData services (VPS instances, managed databases, cache, serverless containers, static sites) were not affected.
We have identified the root cause, replaced the failing hardware, and the service has returned to full operation. We are publishing this postmortem to explain what happened, why our initial diagnosis was wrong on multiple occasions, and what we are doing to prevent a recurrence.
Customer impact
| Aspect | Detail |
|---|---|
| Service affected | Object storage (S3 API) |
| Window | 2026-04-30 14:39 → 22:00 CEST (~6h, intermittent) |
| Number of disruption events | 9 (each ~5–10 min) |
| Effective availability during the window | ~85–90% |
| Customer-facing symptoms | S3 read/write errors, slow uploads, occasional 503 Slow Down, RGW timeouts |
| Data integrity | No data loss. Replicas/EC chunks remained intact and recoverable throughout. |
| Other services | Unaffected |
Customers running automated backups or rsync-style copies generally saw their clients retry transparently. Customers with synchronous interactive S3 calls (e.g. dashboards loading thumbnails) saw intermittent loading errors during the disruption peaks.
Background
Earlier in the week we placed an order with our hardware provider for an additional Ceph storage node to expand cluster capacity. The new server, referred to internally as storage-03, was delivered and provisioned on April 30. After OS install, network configuration, and Kubernetes node integration, we added the new node to our Ceph cluster in the late morning. Customer data began migrating to balance load across the now-larger cluster (Ceph terms this "backfill"). At that point everything looked routine.
Roughly two hours into the backfill, the new server became unresponsive. SSH, ping, and the Kubernetes control plane all simultaneously lost contact with the node. Approximately five minutes later, the node returned. This pattern then repeated, on average every 30 minutes, for the rest of the day.
Timeline
All times CEST (UTC+2). Times are approximate.
- ~12:30 — storage-03 added to the Ceph cluster, backfill begins.
- 14:39 — First unscheduled reboot of storage-03. Initially attributed to a single anomaly during the heavy initial backfill traffic.
- 14:52, 15:18, 16:00 — Three further unscheduled reboots over 90 minutes. Pattern recognized; active investigation begins.
- 15:30 — First diagnostic hypothesis: a Kubernetes networking component (Cilium) appears to be running out of memory. We increase its memory limit cluster-wide. This was a real misconfiguration that we fixed, but as we learned later, it was a symptom rather than a cause.
- 16:31, 17:08 — Two further reboots. Cilium is no longer crashing, but storage-03 still becomes unreachable under load. We rule out the Cilium-memory hypothesis.
- 17:30 — Persistent journal logging enabled on storage-03 so we can read kernel messages from the previous (crashed) boot. This is the diagnostic step that finally exposes the actual failure pattern.
- 17:45 — Smoking gun identified in the kernel logs:
The 10 Gigabit network interface on storage-03 was hanging its transmit queues under sustained load. The driver-level reset was failing, eventually triggering the kernel watchdog and forcing a hard reboot.ixgbe 0000:01:00.0 enp1s0: Detected Tx Unit Hang ixgbe 0000:01:00.0 enp1s0: tx hang detected on queue 8, resetting adapter ixgbe 0000:01:00.0 enp1s0: initiating reset due to tx timeout ixgbe 0000:01:00.0: primary disable timed out - 18:00 — Comparison with our other storage server (stable for 50+ days) reveals two differences: NIC firmware version and Linux kernel point release. We open a support ticket with our hardware provider asking for a firmware update, and in parallel run a controlled test by downgrading storage-03's kernel to match the stable server.
- 19:30 — Kernel downgrade test result: storage-03 still crashes on the older kernel. This rules out the kernel as the cause. Firmware (or a hardware fault on the NIC card itself) is the only remaining variable.
- 19:45 — We pause all data migration on the cluster (
nobackfillflag) to reduce load on storage-03 and minimize further disruption while we wait for the hardware provider to act. - 20:15 — Hardware provider responds, offers to replace the 10 Gigabit NIC card as a goodwill gesture, with ~30 minutes of downtime.
- 21:00 — Customer-facing status banner posted, hardware provider given the go-ahead.
- 21:15 → 21:45 — storage-03 rebooted into the provider's rescue environment, NIC physically swapped, server returned to service.
- 21:50 — New NIC verified: same Intel 82599 chipset, but a different MAC address and updated firmware (1.3769.0 instead of 1.3429.0).
- 22:00 — Cluster maintenance flags removed, backfill resumes, storage-03 remains stable under load. No further crashes.
- 22:15 — Status banner updated to "operating normally."
Total customer-impacting window: approximately six hours.
Root cause
The Intel 10 Gigabit NIC card shipped with storage-03 had a firmware version (1.3429.0) several revisions older than what is currently in production on our other identical-spec storage server (1.3769.0). Under sustained inter-server traffic, this older firmware would lose track of its transmit descriptor queues — a known class of bug in the Intel 82599 series referred to in the driver code as "Tx Unit Hang." The Linux ixgbe driver detects the hang and attempts a hardware reset, but the reset itself fails (the driver prints primary disable timed out), leaving the network interface in an unrecoverable state. The kernel software watchdog then triggers a full hardware reset of the server.
The trigger for the bug is sustained network throughput at 10 Gbps line rate combined with the specific traffic pattern of Ceph backfill, which produces many concurrent TCP flows of varying packet sizes. Our other storage server runs continuously without any failures of this kind under similar workloads, which is what allowed us to isolate the firmware version as the variable.
The fix was a physical NIC card replacement, performed by our hardware provider as a goodwill gesture. The replacement card came with the working firmware revision and has been stable since.
What our diagnosis got wrong, and what we learned
We took longer than we should have to identify the root cause, and we want to explain why honestly because each wrong turn taught us something useful.
1. We chased Cilium memory exhaustion as a cause when it was a symptom.
Early in the incident, our Kubernetes networking component (Cilium) was crashing with out-of-memory errors. The memory limit had been set conservatively at 1 GiB cluster-wide, and a busy storage node legitimately needed more. We increased it to 4 GiB. This was a real misconfiguration we should have fixed earlier regardless of this incident, but it did not stop the crashes — because the underlying cause was the NIC pulling the rug out from under everything (when the NIC hangs, every connection-tracking and network-policy data structure suddenly becomes invalid, and Cilium's working set spikes trying to handle the chaos).
Lesson: "the most visible failing component" is not necessarily the cause. We should have validated more rigorously that fixing Cilium fixed the problem (it did not), rather than assuming we had found the cause when we had not.
2. We attributed the bug to the wrong of two variables.
When we noticed the firmware version difference, we filed a support ticket asking for a firmware update. But there was a second difference between the two servers: the Linux kernel point release. The newer server was on a slightly newer kernel that had landed in Debian's security repository.
Both kernel and firmware were plausible suspects. We ran a controlled test by downgrading the kernel — a useful negative result that ruled out the kernel and concentrated suspicion on the firmware/hardware. But we initially put more weight on the firmware hypothesis than the evidence justified. A 50/50 prior would have been more honest.
Lesson: When two variables differ between a working system and a failing one, do not silently assume which is the culprit. Run a controlled test, or explicitly state the uncertainty in the support request.
3. We did not have persistent kernel logging configured on a freshly provisioned server.
Each time storage-03 crashed, the kernel ring buffer was wiped on the subsequent boot — we lost the actual panic message until we manually enabled persistent journal storage three hours into the incident. That was the diagnostic step that immediately revealed the smoking gun.
Lesson: Persistent journal storage (/var/log/journal/) should be part of our standard server provisioning. Without it, we are blind to anything a crashed kernel logged in its dying breath. We are updating our storage-node runbook to enable this in the OS preparation phase.
Customer-facing actions during the incident
Our largest active uploader on the affected cluster was a customer in the middle of a multi-day data migration. We coordinated directly with them throughout the day:
- Initially we asked them to slow their upload rate to give the cluster capacity headroom. They responded within minutes and adjusted their upload pace voluntarily.
- During the maintenance window itself we further reduced their rate temporarily to minimize stress on the surviving storage server.
- After resolution we restored their full bandwidth and they have continued the migration uneventfully.
We are grateful to that customer for their cooperation, which materially reduced cluster stress while we worked the issue. This kind of partnership during incidents is exactly what we hope for and we want to acknowledge it.
Data integrity
No customer data was lost or corrupted. Our object storage uses erasure coding with redundancy across multiple drives, and during the incident our other storage server remained healthy throughout. Reads of objects whose chunks happened to span both storage servers occasionally failed during a crash window, but those reads succeeded normally on retry as soon as the node returned. No object was ever in a state where its data could not be reconstructed.
What we are changing
The narrow fix is replaced hardware, which is done. We are also making several broader improvements:
- Persistent kernel logging is now part of our default storage-node provisioning. This was the single most important diagnostic shortcut we wish we had used in hour one.
- NIC firmware version is now a documented pre-deployment check. Before adding any new storage node to the cluster, we verify the Intel firmware matches our known-good version (or newer), and request a hardware swap if it does not. We have updated our internal runbook accordingly.
- NIC ring buffer and offload tuning is now applied automatically on storage nodes via systemd. Default values are tuned for 1 Gigabit workloads and are insufficient for sustained 10 Gigabit traffic; we had been applying these manually after each install, but a runtime change does not survive reboot. Now they do.
- Cilium memory limits are durably bumped in our GitOps configuration rather than via direct cluster patches. The increase was correct — it was not the root cause, but it was a real undersizing — and it now lives in our infrastructure repository where it cannot be silently reverted.
- We are evaluating per-node Kubernetes-level health checks for sudden reboot loops. The cluster-side observability for "this node has rebooted N times today" was indirect during the incident; we would like a direct signal feeding our internal alerting.
- We are updating our object storage capacity policy to add new storage nodes earlier in the fill curve, rather than at the moment of urgency. The pre-existing storage host had been close to capacity at the start of the day, and the urgency to expand was what made the failure of the new node particularly painful for customers. Decoupling the two reduces future risk.
Apology
We are sorry. Object storage uptime is the kind of thing customers should not have to think about, and on April 30 we made a number of customers think about it. The pattern of intermittent failures is also worse for trust than a single clean outage, and we recognize that. The fix is in; the lessons are documented; we will do better.
If you experienced impact during this window and have questions or concerns, please reach out to us at support@danubedata.ro and we will respond personally.