Monday, 6 October 2014

On iptables microbenchmarks, and performance gains with tc filtering

I am refining my techniques to face all kind of floods against game servers during a tournament. In the next occasion, I'm going to deal with a 10 Gbit interface (I was surprised myself how cheap they can be these days), so it starts to make sense wondering if your iptables setup is a performance bottleneck. In fact, even though I will be using external DDoS protection services, shit might just happen, and I don't want to have regrets in case that a flood hits directly my server. And in any case, this was fun to learn.

I have used perf as a benchmarking tool, and trafgen (of the netsniff-ng suite) to create synthetic traffic over a veth interface at a high rate (talking of figures around ~3M pps on my dual core Thinkpad x230). These are Good™ because
  • perf is a sampling profiler which has an humongous feature set (top-like interface, treats the user and kernel space more or less like the same thing, has DWARF symbols support and so proper callgraphs, tons of hardware level counters benchmarks...)
  • trafgen uses the packet mmap mechanism instead of being raw sockets based like practically all the other packet forging software is (nmap, zmap, masscan...): this means zero copy between user space and kernel space. It also takes care to pin different processes to CPUs, bypass the egress qdisc, and has a fairly good packet generation scripting system
  • a veth pair is just like a real interface, and packets sent from one peer are never dropped: this means that the measure of how fast your ruleset is equals measuring how long it takes to send a fixed number of packets through the veth pair.

 

 A benchmark of the benchmark

I'm not going through a comprehensive benchmarking of the packets path in the kernel. Instead, I'm going to show you how I found out that filtering at ingress traffic shaping level can lead to a ~25% (wrong) ~5% performance increase in dropping packets originated from DRDoS and UDP fragmented packets.

Important edit
As pointed out by user fw in #netfilter @ freenode, packets sent with the mmap method on a veth interface need a copy when they are hand over to netfilter. When the packet is dropped at tc level, this copy doesn't happen, so there's an extra "bogus" performance gain. To avoid that, I repeated the test with the option -t 0 of trafgen, which forces it to use sendto(), and there is still a performance gain but much smaller. The rest of this post is left untouched, except for the benchmark resulsts. Just note that memcpy doesn't appear anymore in the perf analysis.
It is still a good idea to use the mmap transmission if you are not testing a difference between an iptables-based and tc-based setup, because the raw speed of packet generation is higher than with sendto().

You need the kernel vmlinux image to have perf recognize debugging symbols (in fedora 'yum install kernel-debuginfo'). Some distros have an outdated netsniff-ng package, you may need to compile from sources to have the latest scripting system. Last, create the veth interface like this:
 ip link add A address 0:0:0:0:0:1 type veth peer name B address 0:0:0:0:0:2  
 ip link set A up  
 ip link set B up  
 #make sure packets are accepted no matter what source IP they have  
 echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter   
 echo 0 > /proc/sys/net/ipv4/conf/A/rp_filter   
 echo 0 > /proc/sys/net/ipv4/conf/B/rp_filter   
 #blackhole everything that comes to interface B  
 ip rule add prio 1000 from all lookup local  
 ip rule del prio 0  
 ip rule add prio 0 iif B blackhole  
Next, the actual filtering of DRDoS. I'm going to use ipset, because ipsets can be used from both iptables and tc (traffic shaping). Create a set of blacklisted source ports:
 ipset create drdos bitmap:port range 0-65535  
 ipset add drdos 123    #ntp monlist
 ipset add drdos 1900   #ssdp m-search
 ipset add drdos 53     #dns
then you have two ways to block this traffic:
  • iptables level:
  •  # Note: -f matches only from the second fragment onward
     iptables -t raw -A PREROUTING -i B -f -j DROP
     iptables -t raw -A PREROUTING -i B -p udp -m set --match-set drdos src -j DROP  
    
  • tc level:
  •  tc qdisc add dev B handle ffff: ingress  
     tc filter add dev B parent ffff: protocol ip basic match not u32 \( u16 0x0 0x3FFF at 0x6 \) or ipset \( drdos src \) action drop
    
(it is perfectly legal to filter in the raw table). To delete the tc filter use
 tc qdisc del dev B handle ffff: ingress  
To start bombarding the interface with a pseudo DNS DRDoS and UDP fragments attack, use trafgen in this way:
 #need to create a file for configuration or trafgen will use only one CPU  
 cat > /tmp/packet <<EOF  
# UDP DNS packet (truncated), no fragmentation
{
  0x00, 0x00, 0x00, 0x00, 0x00, 0x02, # MAC destination  
  0x00, 0x00, 0x00, 0x00, 0x00, 0x01, # MAC source  
  const16(0x0800),                    # protocol (IP)  
  0b01000101, 0,                      # IP version, TOS, etc.  
  const16(28),                        # Total length  
  drnd(2),                            # IP identification (random for each packet)
  0b00000000, 0,                      # No fragmentation
  64,                                 # ttl  
  17,                                 # proto udp  
  csumip(14, 33),                     # IP checksum  
  drnd(4),                            # source ip (random for each packet)
  10, 10, 10, 10,                     # dest ip  
  const16(53),                        # src port (DNS)  
  const16(43514),                     # dst port (attack port)  
  const16(8),                         # udp length  
  csumudp(14, 34),                    # udp checksum  
}
# IP fragment (smallest possible)
{
  0x00, 0x00, 0x00, 0x00, 0x00, 0x02, # MAC destination
  0x00, 0x00, 0x00, 0x00, 0x00, 0x01, # MAC source
  const16(0x0800),                    # protocol (IP)
  0b01000101, 0,                      # IP version, TOS, etc.
  const16(20),                        # Total length
  drnd(2),                            # IP identification (random for each packet)
  0b00100000, 42,                     # Fragmentation stuff
  64,                                 # ttl
  17,                                 # proto udp
  csumip(14, 33),                     # IP checksum
  drnd(4),                            # source ip (random for each packet)
  10, 10, 10, 10,                     # dest ip
}  
EOF  
 trafgen -t 0 -c /tmp/packet -o A  
The packets are much smaller than the ones you would see in a real attack because usually the bottleneck is not bandwidth but processing per packet.
Last, start perf:
 perf top -p `pidof trafgen | tr ' ' ,`  
 # or with call graphs  
 perf top --call-graph dwarf -p `pidof trafgen | tr ' ' ,`  

Benchmark result

CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz (dual core with HyperThreading)
RAM: 4GB DDR3 @ 1600 MHz
OS: Fedora x86_64 kernel 3.16.3-200
command line:
 trafgen -c /tmp/packet -t 0 -o A -P4 -n 50000000  
iptables:
real 0m22.540s
user 0m8.085s
sys 0m51.687s
tc:
real 0m23.645s
user 0m8.446s
sys 0m54.394s
That is 5% performance gain (system time is counted, user time accounts only for packet generation).

Analysis with perf


perf gives some insight in this. These two screenshots show the most important hogs with iptables and tc respectively:


 
It is evident that ipt_do_table itself is a hog, and that it causes an extra memcpy to take quite some time. The origin of the memcpy can be tracked with callgrapsh:

Basically just entering ip_rcv is a performance cost.

On the other hand, what is that _raw_spin_lock that appears on the tc graph? Again, looking at the callgraph helps:

I'm not a kernel hacker but to me this is a hint that veth supports only one "hardware" queue. For this reason pulling one packet from the device has to be serialized, and the more time you spend in the qdisc the more likely you have lock contention; in fact, if you check the _raw_spin_lock usage in the iptables case, you still see it with after __netif_receive_skb_core, but since no complicate tests are being done in the qdisc, you rarely get two threads pulling at the same time. After that, netfilter is fully parallelized.

2 comments:

  1. Can you explain what following command doing?

    tc filter add dev B parent ffff: protocol ip basic match not u32 \( u16 0x0 0x3FFF at 0x6 \) or ipset \( drdos src \) action drop

    ReplyDelete
  2. It is the (rather involute) way of tc to express rules, plus the action to take if the rule matches (drop). See man ematch (I admin sometimes I just look at the kernel sources, it's too badly documented).

    ReplyDelete