This blog record some useful tools for linux.
reference
- 一文读懂Linux进程、进程组、会话、僵尸
- Linux中父进程为何要苦苦地知道子进程的死亡原因?
- 浅谈 Linux 高负载的系统化分析
- CPU 使用率低高负载的原因
- Linux系统文件系统及文件基础篇
- 如何在Linux中使用awk工具详解
- 通过10个例子掌握Linux下lsof命令
- 学习操作系统
Linux performance tools
CPU
What is the Linux system load ? Can refer this article or this
Linux load averages are “system load averages” that show the running thread (task) demand on the system as an average number of running plus waiting threads. R + D state: Running + Uninterruptible Sleep
stress sysstat
We can do some test by command stress
1
2
3
4
5
6
7
8
9
10
11
12
13apt install stress sysstat
# stress i cpu
stress --cpu 1 --timeout 600
# check uptime load
watch -d uptime
# check ALL cpu status
mpstat -P ALL 5
# check every process or thread cpu usage
pidstat -u 5 1
A thread is the basic unit of scheduling, and the process is the basic unit of resource owners
- Thread
- smallest sequence of programming instructions that can be managed independently by a scheduler
- Has its own register e.g. PC (program counter), SP (stack pointer)
- Process
- instance of a computer porgram that is being executed
- A process can have one or multiple thread
- Most programs are single threaded
- Parallel computing
- Run program currently on one or more CPUs
- Multi-threading (shared-memory)
- Multi-processing (independent-memory)
context switch
1 | # interval 5s output 1 row data |
vmstat
only provide overall context switch of system, pidstat -w
shows process context switch
1 | # interval 5s output 1 row data |
we use sysbench
to simulate a system multi-threaded scheduling switch.
1 | sysbench --threads=10 --max-time=300 threads run |
cpu utilization
1 | cat /proc/stat | grep ^cpu |
many Z (zombie) state
1 | top |
1 | # find father of 3084 |
soft interrupt
Linux divides the interrupt handling process into two phases, the upper half and the lower half: The first part is used to quickly handle interrupts. It runs in interrupt disable mode and mainly deals with hardware-related or time-sensitive tasks. The second part is used to delay processing of the unfinished work in the upper half, usually running as a kernel thread.
1 | # /proc/softirqs provides the operation of soft interrupts; |
summary
Memory
buffer & cache
Buffers are temporary storage of raw disk blocks, that is, used to cache data on the disk, usually not particularly large (about 20MB). In this way, the kernel can centralize scattered writes and optimize disk writes in a unified manner. For example, multiple small writes can be combined into a single large write.
Cached is a page cache that reads files from disk, that is, it is used to cache data read from files. This way, the next time you access these file data, you can quickly get it directly from memory without having to access the slow disk again.
SReclaimable is part of Slab. Slab consists of two parts, the recyclable part is recorded with SReclaimable; the non-recyclable part is recorded with SUnreclaim.
1 | # Before using cachestat and cachetop, we must first install the bcc package |
Check the specify file cache, we can use pcstat1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21export GOPATH=~/go
export PATH=~/go/bin:$PATH
go get golang.org/x/sys/unix
go get github.com/tobert/pcstat/pcstat
# first try
pcstat /bin/ls
+---------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|---------+----------------+------------+-----------+---------|
| /bin/ls | 133792 | 33 | 0 | 000.000 |
+---------+----------------+------------+-----------+---------+
# second try
ls
pcstat /bin/ls
+---------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|---------+----------------+------------+-----------+---------|
| /bin/ls | 133792 | 33 | 33 | 100.000 |
+---------+----------------+------------+-----------+---------+
memory leak
Check the whole memory by
vmstat
1
2
3
4
5
6
7
8
9
10
11vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 6601824 97620 1098784 0 0 0 0 62 322 0 0 100 0 0
0 0 0 6601700 97620 1098788 0 0 0 0 57 251 0 0 100 0 0
0 0 0 6601320 97620 1098788 0 0 0 3 52 306 0 0 100 0 0
0 0 0 6601452 97628 1098788 0 0 0 27 63 326 0 0 100 0 0
2 0 0 6601328 97628 1098788 0 0 0 44 52 299 0 0 100 0 0
0 0 0 6601080 97628 1098792 0 0 0 0 56 285 0 0 100 0 0The free column of memory is constantly changing and is in a downward trend; the buffer and cache remain basically unchanged.
memleak
is a tool in the bcc package1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27/usr/share/bcc/tools/memleak -a -p $(pidof app)
WARNING: Couldn't find .text section in /app
WARNING: BCC can't handle sym look ups for /app
addr = 7f8f704732b0 size = 8192
addr = 7f8f704772d0 size = 8192
addr = 7f8f704712a0 size = 8192
addr = 7f8f704752c0 size = 8192
32768 bytes in 4 allocations from stack
[unknown] [app]
[unknown] [app]
start_thread+0xdb [libpthread-2.27.so]
# [unknown] is app is in docker
# copy exec file from docker, then to check
docker cp app:/app /app
/usr/share/bcc/tools/memleak -p $(pidof app) -a
Attaching to pid 12512, Ctrl+C to quit.
[03:00:41] Top 10 stacks with outstanding allocations:
addr = 7f8f70863220 size = 8192
addr = 7f8f70861210 size = 8192
addr = 7f8f7085b1e0 size = 8192
addr = 7f8f7085f200 size = 8192
addr = 7f8f7085d1f0 size = 8192
40960 bytes in 5 allocations from stack
fibonacci+0x1f [app]
child+0x4f [app]
start_thread+0xdb [libpthread-2.27.so]
swap
Data that has been modified by the application and has not yet been written to the disk (that is, dirty pages) must be written to the disk before memory can be released.These dirty pages can generally be written to disk in two ways.
- You can use the system call fsync in your application to synchronize dirty pages to disk;
- It can also be left to the system, and the kernel thread pdflush is responsible for refreshing these dirty pages.
Swap writes these infrequently accessed memory to disk, then releases this memory for use by other more needed processes. When the memory is accessed again, it is sufficient to re-read it from disk.
In the NUMA architecture, multiple processors are divided into different nodes, and each node has its own local memory space.
1 | numactl --hardware |
summary
What’s the difference between system and disk ?
A disk is a storage device (to be exact, a block device) that can be divided into different disk partitions. On a disk or disk partition, you can also create a file system and mount it in a directory on the system. In this way, the system can read and write files through this mount directory.
In other words, a disk is a block device that stores data and is the carrier of a file system. Therefore, the file system still needs to ensure the persistent storage of data through disk.
You will see this sentence in many places, everything in Linux is a file. In other words, you can access disks and files through the same file interface (such as open, read, write, close, etc.).
What we usually mean by “documents” is actually ordinary documents.
The disk or partition refers to the block device file.
I-O
The Linux file system allocates two data structures for each file, an index node
and a directory entry
. They are mainly used to record the meta information and directory structure of files.
1 | # slabtop - display kernel slab cache information in real time |
check io
1 | iostat -d -x 1 |
- % util is the disk I / O usage
- r/s + w/s is IOPS
- rkB/s + wkB/s is the throughput
- r_await + w_await is the response time
1 | # check process |
write issue
1 | top |
io_wait issue
1 | top |
sql slow
1 | # use top to check overview |
slow redis
1 | top |
summary
Network
- Bandwidth, which indicates the maximum transmission rate of the link. The unit is usually b/s (bits per second).
- Throughput, which indicates the amount of data successfully transmitted per unit of time. The unit is usually b/s (bits/second) or B/s (bytes/second). Throughput is limited by bandwidth, and throughput/bandwidth is the utilization of the network.
- Delay means the delay from the time the network request is sent until the remote response is received.
- PPS is the abbreviation of Packet Per Second (packet per second), which means the transmission rate in network packets.
basic command
1 | root@ubuntu:/home/feiyang# ifconfig |
- errors indicates the number of packets with errors, such as check errors, frame synchronization errors, etc.
- dropped indicates the number of dropped packets, that is, the packet has received the Ring Buffer, but the packet was lost due to insufficient memory and other reasons;
- overruns indicates the number of overrun packets, that is, the network I / O speed is too fast, causing the packets in the Ring Buffer to be too late to be processed (the queue is full) and the packet loss;
- carrier indicates the number of packets with carrirer errors, such as mismatch in duplex mode, problems with physical cables, etc .;
- collisions indicates the number of collision packets.
1 | root@ubuntu:/home/feiyang# netstat -nlp | head -n 3 |
When the socket is connected (Established),
Recv-Q indicates the number of bytes (that is, the length of the receive queue) that the socket buffer has not been taken away by the application.
Send-Q indicates the number of bytes (that is, the length of the send queue) that have not been acknowledged by the remote host.When the socket is in the listening state (Listening),
Recv-Q represents the current value of the syn backlog.
Send-Q represents the largest syn backlog value.
The syn backlog is the length of the semi-connected queue in the TCP protocol stack, and accordingly there is also a fully connected queue (accept queue)
1 | # netstat |
C10K and C1000K
- select or poll Apache
- epoll Nginx
- Asynchronous I/O
- one master, multiple worker
- multi-process listen same port, need enable SO_REUSEPORT
- C1000K’s solution is essentially built on epoll’s non-blocking I / O model.
- C10M: To solve this problem, the most important thing is to skip the lengthy path of the kernel protocol stack and send the network packets directly to the application to be processed. There are two common mechanisms here, DPDK and XDP.
- DPDK is the standard for user mode networks. It skips the kernel protocol stack and directly processes the network reception by the user mode process through polling.
- XDP (eXpress Data Path) is a high-performance network data path provided by the Linux kernel. It allows network packets to be processed before entering the kernel protocol stack, which can also bring higher performance. The bottom layer of XDP, like the bcc-tools we used before, is implemented based on the eBPF mechanism of the Linux kernel.
performance test
1 | # enable pktgen |
do some test
1 | # define function |
TCP test
1 | # Ubuntu |
HTTP test
1 | Ubuntu |
App test1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39git clone https://github.com/wg/wrk
cd wrk
apt-get install build-essential -y
make
sudo cp wrk /usr/local/bin/
wrk -c 1000 -t 2 http://192.168.0.30/
Running 10s test @ http://192.168.0.30/
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 65.83ms 174.06ms 1.99s 95.85%
Req/Sec 4.87k 628.73 6.78k 69.00%
96954 requests in 10.06s, 78.59MB read
Socket errors: connect 0, read 0, write 0, timeout 179
Requests/sec: 9641.31
Transfer/sec: 7.82MB
# setup add parameter
-- example script that demonstrates response handling and
-- retrieving an authentication token to set on all future
-- requests
token = nil
path = "/authenticate"
request = function()
return wrk.format("GET", path)
end
response = function(status, headers, body)
if not token and status == 200 then
token = headers["X-Token"]
path = "/resource"
wrk.headers["X-Token"] = token
end
end
wrk -c 1000 -t 2 -s auth.lua http://192.168.0.30/
DNS slow
- A record, used to translate the domain name into an IP address;
- CNAME record for creating aliases;
- The NS record indicates the name server address corresponding to the domain name.
1 | root@ubuntu:/home/feiyang/wrk# dig +trace +nodnssec feiyang233.club |
- no /etc/resolv.conf
1
2
3
4
5nslookup feiyang233.club
;; connection timed out; no servers could be reached
# add DNS
echo "nameserver 1.1.1.1" > /etc/resolv.conf - DNS unstableThe DNS server itself has problems, the response is slow and unstable;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19time nslookup time.geekbang.org
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: time.geekbang.org
Address: 39.106.233.176
real 0m10.349s
user 0m0.004s
sys 0m0.0
time nslookup time.geekbang.org
;; connection timed out; no servers could be reached
real 0m15.011s
user 0m0.006s
sys 0m0.006s
The network delay from the client to the DNS server is relatively large;
DNS request or response packets are lost by network devices on the link in some cases.
1 | ping -c3 8.8.8.8 |
dump traffic
1 |
|
The purpose of PTR reverse address resolution is to find out the domain name from the IP address, but in fact, not all IP addresses will define PTR records, so PTR queries are likely to fail.
1 | # check PTR |
1
2
3#tcpdump output format
Timestamp Protocol Source Address Source Port> Destination Address Destination Port Network Packet Details
时间戳 协议 源地址 源端口 > 目的地址 目的端口 网络包详细信息
anti-DDoS
DDoS(Distributed Denial of Service)
- Running out of bandwidth
- Running out of operating system resources
- Running out of application resources
1 | # -S set TCP SYN,-p port 80 |
1 | # check SYN_REC |
In a Linux server, you can increase the anti-attack capability of the server and reduce the impact of DDoS on normal services through various methods such as kernel tuning, DPDK, and XDP. In the application, you can use various levels of caching, WAF, CDN and other methods to mitigate the impact of DDoS on the application.
network slow
1 | traceroute --tcp -p 80 -n baidu.com |
Open this nginx.pcap in Wireshark, Statics -> Flow Graph,select “Limit to display filter” and setup Flow type to “TCP Flows”:
Blue area is very slow costs 40ms, 40ms is minimum timeout for TCP delayed acknowledgement (Delayed ACK).
An optimization mechanism for TCP ACK, that is, instead of sending an ACK for each request, you wait for a while (such as 40ms). If there are other packets that need to be sent during this period, send them with the ACK. Of course, if you can’t wait for other packets, then send ACK separately after timeout.
1 | # TCP_QUICKACK (since Linux 2.4.4) |
The Nagle algorithm is an optimization algorithm used in the TCP protocol to reduce the number of small packets sent, in order to improve the utilization of the actual bandwidth. The Nagle algorithm specifies that there can be at most one unconfirmed outstanding packet on a TCP connection; no other packets are sent until an ACK for this packet is received.
1 | #TCP_NODELAY |
NAT
Network Address and Port Translation1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75# SNAT
# MASQUERADE, change out ip to Linux wan ip
iptables -t nat -A POSTROUTING -s 192.168.0.0/16 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 192.168.0.2 -j SNAT --to-source 100.100.100.100
# DNAT
iptables -t nat -A PREROUTING -d 100.100.100.100 -j DNAT --to-destination 192.168.0.2
iptables -t nat -A POSTROUTING -s 192.168.0.2 -j SNAT --to-source 100.100.100.100
iptables -t nat -A PREROUTING -d 100.100.100.100 -j DNAT --to-destination 192.168.0.2
# enable forward function
# sysctl net.ipv4.ip_forward
# net.ipv4.ip_forward = 1
sysctl -w net.ipv4.ip_forward=1
# open files
ulimit -n
1024
# increase to 66535
ulimit -n 65536
ab -c 5000 -n 100000 -r -s 2 http://192.168.0.30/
...
Requests per second: 6576.21 [#/sec] (mean)
Time per request: 760.317 [ms] (mean)
Time per request: 0.152 [ms] (mean, across all concurrent requests)
Transfer rate: 5390.19 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 177 714.3 9 7338
Processing: 0 27 39.8 19 961
Waiting: 0 23 39.5 16 951
Total: 1 204 716.3 28 7349
# run new test container
docker run --name nginx --privileged -p 8080:8080 -itd feisky/nginx:nat
iptables -nL -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
...
Chain DOCKER (2 references)
target prot opt source destination
RETURN all -- 0.0.0.0/0 0.0.0.0/0
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:8080
# test again
ab -c 5000 -n 100000 -r -s 2 http://192.168.0.30:8080/
...
apr_pollset_poll: The timeout specified has expired (70007)
Total of 5602 requests completed
# set timeout is 30s
ab -c 5000 -n 10000 -r -s 30 http://192.168.0.30:8080/
...
Requests per second: 76.47 [#/sec] (mean)
Time per request: 65380.868 [ms] (mean)
Time per request: 13.076 [ms] (mean, across all concurrent requests)
Transfer rate: 44.79 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1300 5578.0 1 65184
Processing: 0 37916 59283.2 1 130682
Waiting: 0 2 8.7 1 414
Total: 1 39216 58711.6 1021 130682
...
we create a script to follow this iptable path1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29#! /usr/bin/env stap
############################################################
# Dropwatch.stp
# Author: Neil Horman <nhorman@redhat.com>
# An example script to mimic the behavior of the dropwatch utility
# http://fedorahosted.org/dropwatch
############################################################
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at %s\n",
@count(locations[l]), symname(l))
}
delete locations
}
run this script1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
stap --all-modules dropwatch.stp
Monitoring for dropped packets
10031 packets dropped at nf_hook_slow
676 packets dropped at tcp_v4_rcv
7284 packets dropped at nf_hook_slow
268 packets dropped at tcp_v4_rcv
# use perf to check
# record 30s crtl + c
$ perf record -a -g -- sleep 30
# print report
$ perf report -g graph,0
Slow in 3 point
- ipv4_conntrack_in
- br_nf_pre_routing
- iptable_nat_ipv4_in
1 | # check conntrack |
important
1 | # tcp time_wait settings check |
summary
After setting TCP_NODELAY for the TCP connection, you can disable the Nagle algorithm;
After TCP_CORK is enabled for a TCP connection, small packets can be aggregated into large packets before being sent (note that it will block the sending of small packets);
With SO_SNDBUF and SO_RCVBUF, you can adjust the size of the socket send buffer and receive buffer, respectively.
The three values of tcp_rmem and tcp_wmem are min, default, and max respectively. The system will automatically adjust the size of the TCP receive / send buffer according to these settings.
The three values of udp_mem are min, pressure, max. The system will automatically adjust the size of the UDP send buffer according to these settings.
Best Practice
docker
1 | # check running docker |
packet loss
1 | # check max connection |
During Linux startup, there are three special processes, that is, the three processes with the smallest PID numbers.
Process 0 is an idle process. This is also the first process created by the system. After initializing processes 1 and 2, it becomes an idle task. It runs when no other tasks are executing on the CPU.
Process 1 is the init process, which is usually the systemd process. It runs in user mode and is used to manage other user mode processes.
Process 2 is a kthreadd process, which runs in kernel mode and is used to manage kernel threads.
1 |
|
Dynamic Tracing
1 | cd /sys/kernel/debug/tracing |
1 | perf probe --add do_sys_open |
1 | # delete probe before leave |
socket
1 | netstat -s | grep socket |