ck: udp: add uhash4 for established connection

fix #32811304

This patch introduces uhash4(4-tuple hash with local/remote port/addr).
Normally the udp_table has two hash table, the port hash and portaddr
hash. But for UDP server, all sockets have the same local port and addr,
so they are all on the same hash slot. With thousands of connections,
the softirq CPU usage may be very high.

This feature is disabled globally by default and can be enabled by
adding uhash4_enable=1 to the boot parameters. And you can check it
via /sys/module/kernel/parameters/uhash4_enable.

Here are the test results of a QUIC server:

Server:
  - CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  - MEM: 125GB
  - NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network
         Connection (rev 01))
  - NIC driver: ixgbe
  - Concurrent connections: 3000-4000

Without uhash4(average in 10 secs):
CPU    %usr    %sys    %irq   %soft
all   35.53    7.89    0.00   26.31
  0    0.59    0.50    0.00   43.27
  1    0.64    0.75    0.00   43.24
  2    0.45    0.45    0.00   51.81
  3    0.34    0.45    0.00   52.78
  4    0.12    0.24    0.00   46.04
  5    0.24    0.35    0.00   64.34
  6    0.24    0.36    0.00   53.21
  7    0.48    0.48    0.00   46.42
  8    0.24    0.24    0.00   55.18
  9    0.00    0.24    0.00   57.84
 10    0.24    0.24    0.00   60.81
 11    0.12    0.46    0.00   53.88
 12    0.23    0.23    0.00   55.09
 13    0.34    0.11    0.00   59.14
 14    0.34    0.45    0.00   46.28
 15    0.22    0.22    0.00   53.77
 16    1.63    1.83    0.00    0.10
 17    1.83    1.83    0.00    0.10
 18   79.00   16.80    0.00    4.20
 19   81.22   15.38    0.00    3.40
 20   80.58   16.12    0.00    3.30
 21   80.88   15.42    0.00    3.70
 22   71.87   16.22    0.00    3.60
 23   63.45   18.17    0.00    3.82
 24   75.58   16.12    0.00    4.20
 25   74.25   16.03    0.00    4.41
 26   76.92   15.68    0.00    4.30
 27   71.97   16.82    0.00    4.40
 28   80.20   16.40    0.00    3.40
 29   74.97   16.62    0.00    4.10
 30   71.74   15.93    0.00    3.71
 31   72.65   15.53    0.00    3.41

With uhash4(average in 10 secs):
CPU    %usr    %sys    %irq   %soft
all   32.61    8.50    0.00    3.60
  0    1.34    1.13    0.00    3.09
  1    1.12    0.82    0.00    2.45
  2    0.21    0.10    0.00    4.88
  3    0.32    0.21    0.00    1.83
  4    0.32    0.21    0.00    2.12
  5    0.31    0.21    0.00    3.04
  6    0.53    0.42    0.00    3.47
  7    0.21    0.32    0.00    1.38
  8    0.41    0.31    0.00    5.57
  9    0.21    0.21    0.00    4.47
 10    0.42    0.21    0.00    4.18
 11    0.73    0.42    0.00    4.57
 12    0.72    0.31    0.00    4.22
 13    0.83    0.41    0.00    4.34
 14    0.41    0.52    0.00    4.54
 15    1.03    0.41    0.00    5.03
 16    0.70    1.00    0.00    0.10
 17    0.91    0.91    0.00    0.00
 18   78.90   17.30    0.00    3.80
 19   76.60   19.30    0.00    4.10
 20   75.20   19.00    0.00    4.00
 21   61.85   18.27    0.00    4.02
 22   63.93   18.34    0.00    3.71
 23   77.20   18.90    0.00    3.90
 24   76.48   18.42    0.00    4.30
 25   75.68   18.42    0.00    4.30
 26   77.18   18.52    0.00    4.30
 27   77.20   18.70    0.00    4.10
 28   75.80   18.60    0.00    4.10
 29   64.96   18.88    0.00    3.71
 30   60.93   17.22    0.00    4.13
 31   71.44   18.94    0.00    3.41

This patch is useful for TCP-like UDP server, but it can't go to
upstream now for two reasons:

1) If one connection joins in the uhash4 table, it would be detached
   from reuseport group unconditionally. I'm not sure whether it
   conflicts with something else or has some side effects. Besides,
   the connection cannot be removed from the uhash4 table unless it's
   closed since it's unsafe to restore its reuseport group.
2) The rehash function is not implemented now. If the client IP changed,
   the server may want to re-connect to the new IP with the same socket.
   It's useful for QUIC since QUIC supports connection migration. But
   the function is not supported now. Maybe I will implement it later.

The two restrictions are not big deals if you remember the uhash4 is
immutable and don't use the reuseport. Besides, ipv4-mapped ipv6 is
supported, but it's really not recommended.

**WARNING**
It's UNSAFE to implement a rehashable uhash4 based on current hlists,
that's why the rehash function haven't be implemented. If you need the
rehash function, you should be careful with the del/add operation on two
lists/hlists/... The current implementation doesn't guarantee that the
list-traversal primitive can safely run concurrently with socket moving.
If rehash is called frequently, it may even lead to soft lockup.

For more information, please read the document:
Documentation/networking/udp-uhash4.txt

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
This commit is contained in:
Cambda Zhu 2020-02-17 18:12:58 +08:00 committed by Qiao Ma
parent 89c553ff34
commit d291d19e50
7 changed files with 273 additions and 5 deletions

View File

@ -0,0 +1,42 @@
UDP 4-tuple Hash(uhash4)
========================
Normally the udp_table has two hash tables, one for port hash and one
for <port, addr> hash. Both of them are based on the local port/addr,
and the remote port/addr is not included. As a server, we often bind
on the same local <port, addr> and receive from thousands of clients
with different remote <port, addr>. If the server is running with a
TCP-like protocol such as QUIC, we need to distinguish connections
via <local port, local addr, remote port, remote addr>. With the
original hash table, the kernel has to compute the score for every
socket on the same hash slot, which leads to high softirq CPU usage.
To improve performance in certain cases, we introduce the uhash4,
4-tuple hash with <local port, local addr, remote port, remote addr>,
to the udp_table, and add a new UDP sockopt named UDP_HASH4 to enable
the function per socket. If it's enabled by a socket, the socket would
be added to a new hash table with 4-tuple hash when connecting to the
peer. And when a packet is received by kernel, the UDP lookup function
would search the uhash4 firstly, and compare the 4-tuple information.
However, to simplify the implementation and for safe, there're some
restrictions for uhash4:
1) I strongly suggest the following steps to enable the uhash4:
a. socket()
b. bind()
c. setsockopt(UDP_HASH4)
d. connect()
The setsockopt() may succeed while the uhash4 doesn't work, and
you should check the bind() parameters and then report bug to
'Cambda Zhu <cambda@linux.alibaba.com>'.
2) If it's enabled for a socket, you can not disable it. If you really
want it, please close the socket and create a new one.
3) The 4-tuple info is fixed and can not be changed even if you call
connect() again or the socket is rehashed. So the uhash4 may be
not correct, and the user has the responsibility to close the
socket in this case.
This feature is disabled globally by default and can be enabled by
adding uhash4_enable=1 to the boot parameters. And you can check it
via /sys/module/kernel/parameters/uhash4_enable.

View File

@ -41,6 +41,9 @@ struct udp_sock {
#define udp_port_hash inet.sk.__sk_common.skc_u16hashes[0]
#define udp_portaddr_hash inet.sk.__sk_common.skc_u16hashes[1]
#define udp_portaddr_node inet.sk.__sk_common.skc_portaddr_node
__u16 udp_lrpa_hash;
struct hlist_node udp_lrpa_node;
int pending; /* Any pending frames ? */
unsigned int corkflag; /* Cork is required */
__u8 encap_type; /* Is this an Encapsulation socket? */
@ -70,7 +73,8 @@ struct udp_sock {
#define UDPLITE_SEND_CC 0x2 /* set via udplite setsockopt */
#define UDPLITE_RECV_CC 0x4 /* set via udplite setsocktopt */
__u8 pcflag; /* marks socket as UDP-Lite if > 0 */
__u8 unused[3];
__u8 uhash4:1;
__u8 unused[2];
/*
* For encapsulation sockets.
*/
@ -151,6 +155,9 @@ static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb)
#define udp_portaddr_for_each_entry_rcu(__sk, list) \
hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
#define udp_lrpa_for_each_entry_rcu(__up, list) \
hlist_for_each_entry_rcu(__up, list, udp_lrpa_node)
#define IS_UDPLITE(__sk) (__sk->sk_protocol == IPPROTO_UDPLITE)
#endif /* _LINUX_UDP_H */

View File

@ -31,6 +31,7 @@ void udplitev6_exit(void);
int tcpv6_init(void);
void tcpv6_exit(void);
void udp6_hash4(struct sock *sk);
int udpv6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
/* this does all the common and the specific ctl work */

View File

@ -66,16 +66,20 @@ struct udp_hslot {
*
* @hash: hash table, sockets are hashed on (local port)
* @hash2: hash table, sockets are hashed on (local port, local address)
* @hash4: hash table, sockets are hashed on
* (local port, local address, remote port, remote address)
* @mask: number of slots in hash tables, minus 1
* @log: log2(number of slots in hash table)
*/
struct udp_table {
struct udp_hslot *hash;
struct udp_hslot *hash2;
struct udp_hslot *hash4;
unsigned int mask;
unsigned int log;
};
extern struct udp_table udp_table;
extern bool uhash4_enable;
void udp_table_init(struct udp_table *, const char *);
static inline struct udp_hslot *udp_hashslot(struct udp_table *table,
struct net *net, unsigned int num)
@ -92,6 +96,22 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
return &table->hash2[hash & table->mask];
}
static inline struct udp_hslot *udp_hashslot4(struct udp_table *table,
unsigned int hash)
{
return &table->hash4[hash & table->mask];
}
static inline bool udp_unhashed4(const struct sock *sk)
{
return hlist_unhashed(&udp_sk(sk)->udp_lrpa_node);
}
static inline bool udp_hashed4(const struct sock *sk)
{
return !udp_unhashed4(sk);
}
extern struct proto udp_prot;
extern atomic_long_t udp_memory_allocated;
@ -297,6 +317,8 @@ int udp_rcv(struct sk_buff *skb);
int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
int udp_init_sock(struct sock *sk);
int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
void udp4_hash4(struct sock *sk);
int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
int __udp_disconnect(struct sock *sk, int flags);
int udp_disconnect(struct sock *sk, int flags);
__poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);

View File

@ -35,6 +35,8 @@ struct udphdr {
#define UDP_SEGMENT 103 /* Set GSO segmentation size */
#define UDP_GRO 104 /* This socket can receive UDP GRO packets */
#define UDP_HASH4 200 /* Enable UDP hash4 */
/* UDP encapsulation types */
#define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
#define UDP_ENCAP_ESPINUDP 2 /* draft-ietf-ipsec-udp-encaps-06 */

View File

@ -456,6 +456,29 @@ static struct sock *udp4_lib_lookup2(struct net *net,
return result;
}
static struct sock *udp4_lib_lookup4(struct net *net,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned int hnum,
int dif, int sdif,
struct udp_table *udptable)
{
struct sock *sk;
struct udp_sock *up;
unsigned int hash4 = udp_ehashfn(net, daddr, hnum, saddr, sport);
struct udp_hslot *hslot4 = udp_hashslot4(udptable, hash4);
INET_ADDR_COOKIE(acookie, saddr, daddr);
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
udp_lrpa_for_each_entry_rcu(up, &hslot4->head) {
sk = (struct sock *)up;
if (INET_MATCH(sk, net, acookie, saddr,
daddr, ports, dif, sdif))
return sk;
}
return NULL;
}
static struct sock *udp4_lookup_run_bpf(struct net *net,
struct udp_table *udptable,
struct sk_buff *skb,
@ -491,6 +514,13 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
struct udp_hslot *hslot2;
struct sock *result, *sk;
if (uhash4_enable) {
result = udp4_lib_lookup4(net, saddr, sport, daddr,
hnum, dif, sdif, udptable);
if (result)
return result;
}
hash2 = ipv4_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
@ -1911,6 +1941,52 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
}
EXPORT_SYMBOL(udp_pre_connect);
/* call with sock lock */
void udp4_hash4(struct sock *sk)
{
struct net *net = sock_net(sk);
struct udp_table *udptable = sk->sk_prot->h.udp_table;
struct udp_hslot *hslot, *hslot4;
unsigned int hash;
if (sk_unhashed(sk) || udp_hashed4(sk) ||
inet_sk(sk)->inet_rcv_saddr == htonl(INADDR_ANY))
return;
hash = udp_ehashfn(net,
inet_sk(sk)->inet_rcv_saddr,
inet_sk(sk)->inet_num,
inet_sk(sk)->inet_daddr,
inet_sk(sk)->inet_dport);
hslot = udp_hashslot(udptable, net, udp_sk(sk)->udp_port_hash);
hslot4 = udp_hashslot4(udptable, hash);
udp_sk(sk)->udp_lrpa_hash = hash;
spin_lock_bh(&hslot->lock);
if (rcu_access_pointer(sk->sk_reuseport_cb))
reuseport_detach_sock(sk);
spin_lock(&hslot4->lock);
hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &hslot4->head);
hslot4->count++;
spin_unlock(&hslot4->lock);
spin_unlock_bh(&hslot->lock);
}
EXPORT_SYMBOL(udp4_hash4);
int udp_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
int res;
lock_sock(sk);
res = __ip4_datagram_connect(sk, uaddr, addr_len);
if (uhash4_enable && udp_sk(sk)->uhash4 && !res)
udp4_hash4(sk);
release_sock(sk);
return res;
}
EXPORT_SYMBOL(udp_connect);
int __udp_disconnect(struct sock *sk, int flags)
{
struct inet_sock *inet = inet_sk(sk);
@ -1952,7 +2028,7 @@ void udp_lib_unhash(struct sock *sk)
{
if (sk_hashed(sk)) {
struct udp_table *udptable = sk->sk_prot->h.udp_table;
struct udp_hslot *hslot, *hslot2;
struct udp_hslot *hslot, *hslot2, *hslot4;
hslot = udp_hashslot(udptable, sock_net(sk),
udp_sk(sk)->udp_port_hash);
@ -1970,6 +2046,14 @@ void udp_lib_unhash(struct sock *sk)
hlist_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
hslot2->count--;
spin_unlock(&hslot2->lock);
if (uhash4_enable && udp_hashed4(sk)) {
hslot4 = udp_hashslot4(udptable, udp_sk(sk)->udp_lrpa_hash);
spin_lock(&hslot4->lock);
hlist_del_init_rcu(&udp_sk(sk)->udp_lrpa_node);
hslot4->count--;
spin_unlock(&hslot4->lock);
}
}
spin_unlock_bh(&hslot->lock);
}
@ -2675,6 +2759,14 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
up->accept_udp_l4 = valbool;
release_sock(sk);
break;
case UDP_HASH4:
if (!uhash4_enable)
return -EOPNOTSUPP;
if (val == 0 && up->uhash4)
return -EPERM;
if (val != 0 && !up->uhash4)
up->uhash4 = 1;
break;
/*
* UDP-Lite's partial checksum coverage (RFC 3828).
@ -2764,6 +2856,12 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname,
val = up->gro_enabled;
break;
case UDP_HASH4:
if (!uhash4_enable)
return -EOPNOTSUPP;
val = up->uhash4;
break;
/* The following two cannot be changed on UDP sockets, the return is
* always 0 (which corresponds to the full checksum coverage of UDP). */
case UDPLITE_SEND_CSCOV:
@ -2851,7 +2949,7 @@ struct proto udp_prot = {
.owner = THIS_MODULE,
.close = udp_lib_close,
.pre_connect = udp_pre_connect,
.connect = ip4_datagram_connect,
.connect = udp_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
.init = udp_init_sock,
@ -3145,12 +3243,22 @@ static int __init set_uhash_entries(char *str)
}
__setup("uhash_entries=", set_uhash_entries);
bool uhash4_enable;
EXPORT_SYMBOL(uhash4_enable);
core_param(uhash4_enable, uhash4_enable, bool, 0444);
void __init udp_table_init(struct udp_table *table, const char *name)
{
unsigned int i;
unsigned long uhash_size = sizeof(struct udp_hslot);
if (uhash4_enable)
uhash_size *= 3;
else
uhash_size *= 2;
table->hash = alloc_large_system_hash(name,
2 * sizeof(struct udp_hslot),
uhash_size,
uhash_entries,
21, /* one slot per 2 MB */
0,
@ -3170,6 +3278,14 @@ void __init udp_table_init(struct udp_table *table, const char *name)
table->hash2[i].count = 0;
spin_lock_init(&table->hash2[i].lock);
}
if (uhash4_enable) {
table->hash4 = table->hash2 + (table->mask + 1);
for (i = 0; i <= table->mask; i++) {
INIT_HLIST_HEAD(&table->hash4[i].head);
table->hash4[i].count = 0;
spin_lock_init(&table->hash4[i].lock);
}
}
}
u32 udp_flow_hashrnd(void)

View File

@ -189,6 +189,26 @@ static struct sock *udp6_lib_lookup2(struct net *net,
return result;
}
static struct sock *
udp6_lib_lookup4(struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, unsigned int hnum,
int dif, int sdif, struct udp_table *udptable)
{
struct sock *sk;
struct udp_sock *up;
unsigned int hash4 = udp6_ehashfn(net, daddr, hnum, saddr, sport);
struct udp_hslot *hslot4 = udp_hashslot4(udptable, hash4);
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
udp_lrpa_for_each_entry_rcu(up, &hslot4->head) {
sk = (struct sock *)up;
if (INET6_MATCH(sk, net, saddr, daddr, ports, dif, sdif))
return sk;
}
return NULL;
}
static inline struct sock *udp6_lookup_run_bpf(struct net *net,
struct udp_table *udptable,
struct sk_buff *skb,
@ -226,6 +246,13 @@ struct sock *__udp6_lib_lookup(struct net *net,
struct udp_hslot *hslot2;
struct sock *result, *sk;
if (uhash4_enable) {
result = udp6_lib_lookup4(net, saddr, sport, daddr,
hnum, dif, sdif, udptable);
if (result)
return result;
}
hash2 = ipv6_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = &udptable->hash2[slot2];
@ -1110,6 +1137,57 @@ static int udpv6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
return BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr);
}
/* call with sock lock */
void udp6_hash4(struct sock *sk)
{
struct net *net = sock_net(sk);
struct udp_table *udptable = sk->sk_prot->h.udp_table;
struct udp_hslot *hslot, *hslot4;
unsigned int hash;
if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
udp4_hash4(sk);
return;
}
if (sk_unhashed(sk) || udp_hashed4(sk) ||
ipv6_addr_any(&sk->sk_v6_rcv_saddr))
return;
hash = udp6_ehashfn(net,
&sk->sk_v6_rcv_saddr,
inet_sk(sk)->inet_num,
&sk->sk_v6_daddr,
inet_sk(sk)->inet_dport);
hslot = udp_hashslot(udptable, net, udp_sk(sk)->udp_port_hash);
hslot4 = udp_hashslot4(udptable, hash);
udp_sk(sk)->udp_lrpa_hash = hash;
spin_lock_bh(&hslot->lock);
if (rcu_access_pointer(sk->sk_reuseport_cb))
reuseport_detach_sock(sk);
spin_lock(&hslot4->lock);
hlist_add_head_rcu(&udp_sk(sk)->udp_lrpa_node, &hslot4->head);
hslot4->count++;
spin_unlock(&hslot4->lock);
spin_unlock_bh(&hslot->lock);
}
EXPORT_SYMBOL(udp6_hash4);
int udpv6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
int res;
lock_sock(sk);
res = __ip6_datagram_connect(sk, uaddr, addr_len);
if (uhash4_enable && udp_sk(sk)->uhash4 && !res)
udp6_hash4(sk);
release_sock(sk);
return res;
}
EXPORT_SYMBOL(udpv6_connect);
/**
* udp6_hwcsum_outgoing - handle outgoing HW checksumming
* @sk: socket we are sending on
@ -1702,7 +1780,7 @@ struct proto udpv6_prot = {
.owner = THIS_MODULE,
.close = udp_lib_close,
.pre_connect = udpv6_pre_connect,
.connect = ip6_datagram_connect,
.connect = udpv6_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
.init = udp_init_sock,