May 31, 2026
Here are the five most common problems I encounter during infrastructure audits. They usually only reveal themselves under heavy load (like Black Friday or ad campaign launch days).
The default open files limit (often 1024) is a killer for web servers and databases. Always check `ulimit -n` and configure `/etc/security/limits.conf` or SystemD (`LimitNOFILE=...`).
Database servers hate swapping, but turning swap off completely is an invitation for the OOM killer in case of a sudden spike. Leave a small swap, but set `vm.swappiness=1` (or `10`) so the system only uses it as a last resort.
The application writes to disk, the disk fills up to 100%, the database stops. Check Docker logs (`max-size` in `daemon.json`) and configure `logrotate` for application logs.
A script hanging on a network connection blocks subsequent executions. Use the `timeout` command to kill a hung process after a specified time.
A classic, but still common. Always create a dedicated user for the service, e.g. `useradd -r -s /bin/false myapp`.