Sometimes people are looking for sysctl cargo cult values that bring high throughput and low latency with no trade-off and that works on every occasion. That's not realistic, although we can say that the newer kernel versions are very well tuned by default. In fact, you might hurt performance if you mess with the defaults.
This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow, it was heavily inspired by the illustrated guide to Linux networking stack and many of Marek Majkowski's posts.
Feel free to send corrections and suggestions! :)
MAC
(if not on promiscuous mode) and FCS
and decide to drop or to continuerx
until rx-usecs
timeout or rx-frames
hard IRQ
IRQ handler
that runs the driver's codeschedule a NAPI
, clear the hard IRQ
and returnsoft IRQ (NET_RX_SOFTIRQ)
netdev_budget_usecs
timeout or netdev_budget
and dev_weight
packetssk_buff
netif_receive_skb
)skb
to taps (i.e. tcpdump) and pass it to tc ingressnetdev_max_backlog
with its algorithm defined by default_qdisc
ip_rcv
and packets are handed to IPPREROUTING
)LOCAL_IN
)tcp_v4_rcv
)tcp_rmem
rules
tcp_moderate_rcvbuf
is enabled kernel will auto-tune the receive buffersendmsg
or other)tcp_wmem
sizeipv4
on tcp_write_xmit
and tcp_transmit_skb
)ip_queue_xmit
) does its work: build ip header and call netfilter (LOCAL_OUT
)POST_ROUTING
)ip_output
)dev_queue_xmit
)txqueuelen
length with its algorithm default_qdisc
ring buffer tx
soft IRQ (NET_TX_SOFTIRQ)
after tx-usecs
timeout or tx-frames
hard IRQ
to signal its completionsoft IRQ
) the NAPI poll systemIf you want to see the network tracing within Linux you can use perf.
docker run -it --rm --cap-add SYS_ADMIN --entrypoint bash ljishen/perf
apt-get update
apt-get install iputils-ping
# this is going to trace all events (not syscalls) to the subsystem net:* while performing the ping
perf trace --no-syscalls --event 'net:*' ping globo.com -c1 > /dev/null
ethtool -g ethX
ethtool -G ethX rx value tx value
ethtool -S ethX | grep -e "err" -e "drop" -e "over" -e "miss" -e "timeout" -e "reset" -e "restar" -e "collis" -e "over" | grep -v "\: 0"
ethtool -c ethX
ethtool -C ethX rx-usecs value tx-usecs value
cat /proc/interrupts
netdev_budget_usecs
have elapsed during the poll cycle or the number of packets processed reaches netdev_budget
.dropped
(# of packets that were dropped because netdev_max_backlog
was exceeded) and squeezed
(# of times ksoftirq ran out of netdev_budget
or time slice with work remaining).sysctl net.core.netdev_budget_usecs
sysctl -w net.core.netdev_budget_usecs value
cat /proc/net/softnet_stat
; or a better toolnetdev_budget
is the maximum number of packets taken from all interfaces in one polling cycle (NAPI poll). In one polling cycle interfaces which are registered to polling are probed in a round-robin manner. Also, a polling cycle may not exceed netdev_budget_usecs
microseconds, even if netdev_budget
has not been exhausted.sysctl net.core.netdev_budget
sysctl -w net.core.netdev_budget value
cat /proc/net/softnet_stat
; or a better tooldev_weight
is the maximum number of packets that kernel can handle on a NAPI interrupt, it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware aggregated packet is counted as one packet in this.sysctl net.core.dev_weight
sysctl -w net.core.dev_weight value
cat /proc/net/softnet_stat
; or a better toolnetdev_max_backlog
is the maximum number of packets, queued on the INPUT side (the ingress qdisc), when the interface receives packets faster than kernel can process them.sysctl net.core.netdev_max_backlog
sysctl -w net.core.netdev_max_backlog value
cat /proc/net/softnet_stat
; or a better tooltxqueuelen
is the maximum number of packets, queued on the OUTPUT side.ip link show dev ethX
ip link set dev ethX txqueuelen N
ip -s link
default_qdisc
is the default queuing discipline to use for network devices.sysctl net.core.default_qdisc
sysctl -w net.core.default_qdisc value
tc -s qdisc ls dev ethX
The policy that defines what is memory pressure is specified at tcp_mem and tcp_moderate_rcvbuf.
tcp_rmem
- min (size used under memory pressure), default (initial size), max (maximum size) - size of receive buffer used by TCP sockets.sysctl net.ipv4.tcp_rmem
sysctl -w net.ipv4.tcp_rmem="min default max"
; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstat
tcp_wmem
- min (size used under memory pressure), default (initial size), max (maximum size) - size of send buffer used by TCP sockets.sysctl net.ipv4.tcp_wmem
sysctl -w net.ipv4.tcp_wmem="min default max"
; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstat
tcp_moderate_rcvbuf
- If set, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer.sysctl net.ipv4.tcp_moderate_rcvbuf
sysctl -w net.ipv4.tcp_moderate_rcvbuf value
cat /proc/net/sockstat
Accept and SYN Queues are governed by net.core.somaxconn and net.ipv4.tcp_max_syn_backlog. Nowadays net.core.somaxconn caps both queue sizes.
sysctl net.core.somaxconn
- provides an upper limit on the value of the backlog parameter passed to the listen()
function, known in userspace as SOMAXCONN
. If you change this value, you should also change your application to a compatible value (i.e. nginx backlog).cat /proc/sys/net/ipv4/tcp_fin_timeout
- this specifies the number of seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification but required to prevent denial-of-service attacks.cat /proc/sys/net/ipv4/tcp_available_congestion_control
- shows the available congestion control choices that are registered.cat /proc/sys/net/ipv4/tcp_congestion_control
- sets the congestion control algorithm to be used for new connections.cat /proc/sys/net/ipv4/tcp_max_syn_backlog
- sets the maximum number of queued connection requests which have still not received an acknowledgment from the connecting client; if this number is exceeded, the kernel will begin dropping requests.cat /proc/sys/net/ipv4/tcp_syncookies
- enables/disables syn cookies, useful for protecting against syn flood attacks.cat /proc/sys/net/ipv4/tcp_slow_start_after_idle
- enables/disables tcp slow start.How to monitor:
netstat -atn | awk '/tcp/ {print $6}' | sort | uniq -c
- summary by statess -neopt state time-wait | wc -l
- counters by a specific state: established
, syn-sent
, syn-recv
, fin-wait-1
, fin-wait-2
, time-wait
, closed
, close-wait
, last-ack
, listening
, closing
netstat -st
- tcp stats summarynstat -a
- human-friendly tcp stats summarycat /proc/net/sockstat
- summarized socket statscat /proc/net/tcp
- detailed stats, see each field meaning at the kernel docscat /proc/net/netstat
- ListenOverflows
and ListenDrops
are important fields to keep an eye on
cat /proc/net/netstat | awk '(f==0) { i=1; while ( i<=NF) {n[i] = $i; i++ }; f=1; next} \ (f==1){ i=2; while ( i<=NF){ printf "%s = %d\n", n[i], $i; i++}; f=0} ' | grep -v "= 0
; a human readable /proc/net/netstat
,通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。
全能AI智能助手,随时解答生活与工作的多样问题
问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。
实时语音翻译/同声传译工具
Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。
一键生成PPT和Word,让学习生活更轻松
讯飞智文是 一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。