Lately lots of new fascinating technologies are used to build even more fascinating new solutions, and solutions nowadays even run on distributed environments, not just on single server. These servers usually communicate using TCP – standard, that has been here long before gigabit (or ten megabit) ethernet or dawn of LAMP, and needs to be kicked a bit, to work properly. In our environment we have over hundred of application servers which handle quite a number of requests, and put quite demanding stress on database servers. For that we had to change some of default kernel settings – and it did improve situation a bit. Here I’ll try to overview some of them…
One of first evil limitations we faced was the connection backlog – at higher loads incoming rates of DB connections ( more than 2000/s ) didn’t like when accept()’ing thread was delayed – long queue formed for it. Even when system is fast enough to switch contexts and allows accepting new connections, resource thrashing happens, as it has to deal with smaller backlog too often.
The default backlog setting is just 50 connections (it can be changed by mysqld’s back_log parameter), so it is usually wise to raise it even to few hundred or a thousand.
One should not note, that Linux is sneaky and has ‘somaxconn’ limit, which is set to 128 (at least on my 2.6.11 kernel). Additionally, there’s a queue of connection attempts, that can be negotiated at kernel level.
These sysctl (or /proc/sys/) settings should help with such limits:
net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 1024
Additionally one may offload network buffers into kernel space, hense avoiding thread context switching again:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# minimum, default and maximum values
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Interface buffering
net.core.netdev_max_backlog = 2500
# Transmit queue length for gigabit interfaces
ifconfig eth0 txqueuelen 1000
And one of final touches is removing retries and increasing keepalive rates. Though MySQL’s virtual I/O (vio) layer does have settings on connection timeouts, sometimes having connects timeout immediately, without even retrying, may be good idea – at least in gigabit LANs.
net.ipv4.tcp_syn_retries = 0
net.ipv4.tcp_synack_retries = 0
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 2
These settings may be used on other internal servers too – though some of TCP/IP magic, such like synack retries should be kept at higher numbers.
Most of tuning guides or manuals on the net stress LFN – long distance, fat network optimizations. All internal databases need just as lean transport layer as possible. UDP based transport and attitude may add even more to efficiency here.
There’re few issues that could be used to increase network efficiency – like employing socket filters on FreeBSD and initiating connection handshake at kernel level. As well, subsecond timeouts would allow failing much more gracefully.
Usually it takes just few milliseconds to detect a host failure, but most of TCP/IP stack thinks about long distance connections, rather than most of it’s use is inside datacenters. Efficiency of internal dataflow may be much more important than any external network tuning – and it has to be worked on.
- TCP Tuning guide, by Berkeley Lab