Monday, 6 October 2014

On iptables microbenchmarks, and performance gains with tc filtering

I am refining my techniques to face all kind of floods against game servers during a tournament. In the next occasion, I'm going to deal with a 10 Gbit interface (I was surprised myself how cheap they can be these days), so it starts to make sense wondering if your iptables setup is a performance bottleneck. In fact, even though I will be using external DDoS protection services, shit might just happen, and I don't want to have regrets in case that a flood hits directly my server. And in any case, this was fun to learn.

I have used perf as a benchmarking tool, and trafgen (of the netsniff-ng suite) to create synthetic traffic over a veth interface at a high rate (talking of figures around ~3M pps on my dual core Thinkpad x230). These are Good™ because
  • perf is a sampling profiler which has an humongous feature set (top-like interface, treats the user and kernel space more or less like the same thing, has DWARF symbols support and so proper callgraphs, tons of hardware level counters benchmarks...)
  • trafgen uses the packet mmap mechanism instead of being raw sockets based like practically all the other packet forging software is (nmap, zmap, masscan...): this means zero copy between user space and kernel space. It also takes care to pin different processes to CPUs, bypass the egress qdisc, and has a fairly good packet generation scripting system
  • a veth pair is just like a real interface, and packets sent from one peer are never dropped: this means that the measure of how fast your ruleset is equals measuring how long it takes to send a fixed number of packets through the veth pair.


 A benchmark of the benchmark

I'm not going through a comprehensive benchmarking of the packets path in the kernel. Instead, I'm going to show you how I found out that filtering at ingress traffic shaping level can lead to a ~25% (wrong) ~5% performance increase in dropping packets originated from DRDoS and UDP fragmented packets.

Important edit
As pointed out by user fw in #netfilter @ freenode, packets sent with the mmap method on a veth interface need a copy when they are hand over to netfilter. When the packet is dropped at tc level, this copy doesn't happen, so there's an extra "bogus" performance gain. To avoid that, I repeated the test with the option -t 0 of trafgen, which forces it to use sendto(), and there is still a performance gain but much smaller. The rest of this post is left untouched, except for the benchmark resulsts. Just note that memcpy doesn't appear anymore in the perf analysis.
It is still a good idea to use the mmap transmission if you are not testing a difference between an iptables-based and tc-based setup, because the raw speed of packet generation is higher than with sendto().

You need the kernel vmlinux image to have perf recognize debugging symbols (in fedora 'yum install kernel-debuginfo'). Some distros have an outdated netsniff-ng package, you may need to compile from sources to have the latest scripting system. Last, create the veth interface like this:
 ip link add A address 0:0:0:0:0:1 type veth peer name B address 0:0:0:0:0:2  
 ip link set A up  
 ip link set B up  
 #make sure packets are accepted no matter what source IP they have  
 echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter   
 echo 0 > /proc/sys/net/ipv4/conf/A/rp_filter   
 echo 0 > /proc/sys/net/ipv4/conf/B/rp_filter   
 #blackhole everything that comes to interface B  
 ip rule add prio 1000 from all lookup local  
 ip rule del prio 0  
 ip rule add prio 0 iif B blackhole  
Next, the actual filtering of DRDoS. I'm going to use ipset, because ipsets can be used from both iptables and tc (traffic shaping). Create a set of blacklisted source ports:
 ipset create drdos bitmap:port range 0-65535  
 ipset add drdos 123    #ntp monlist
 ipset add drdos 1900   #ssdp m-search
 ipset add drdos 53     #dns
then you have two ways to block this traffic:
  • iptables level:
  •  # Note: -f matches only from the second fragment onward
     iptables -t raw -A PREROUTING -i B -f -j DROP
     iptables -t raw -A PREROUTING -i B -p udp -m set --match-set drdos src -j DROP  
  • tc level:
  •  tc qdisc add dev B handle ffff: ingress  
     tc filter add dev B parent ffff: protocol ip basic match not u32 \( u16 0x0 0x3FFF at 0x6 \) or ipset \( drdos src \) action drop
(it is perfectly legal to filter in the raw table). To delete the tc filter use
 tc qdisc del dev B handle ffff: ingress  
To start bombarding the interface with a pseudo DNS DRDoS and UDP fragments attack, use trafgen in this way:
 #need to create a file for configuration or trafgen will use only one CPU  
 cat > /tmp/packet <<EOF  
# UDP DNS packet (truncated), no fragmentation
  0x00, 0x00, 0x00, 0x00, 0x00, 0x02, # MAC destination  
  0x00, 0x00, 0x00, 0x00, 0x00, 0x01, # MAC source  
  const16(0x0800),                    # protocol (IP)  
  0b01000101, 0,                      # IP version, TOS, etc.  
  const16(28),                        # Total length  
  drnd(2),                            # IP identification (random for each packet)
  0b00000000, 0,                      # No fragmentation
  64,                                 # ttl  
  17,                                 # proto udp  
  csumip(14, 33),                     # IP checksum  
  drnd(4),                            # source ip (random for each packet)
  10, 10, 10, 10,                     # dest ip  
  const16(53),                        # src port (DNS)  
  const16(43514),                     # dst port (attack port)  
  const16(8),                         # udp length  
  csumudp(14, 34),                    # udp checksum  
# IP fragment (smallest possible)
  0x00, 0x00, 0x00, 0x00, 0x00, 0x02, # MAC destination
  0x00, 0x00, 0x00, 0x00, 0x00, 0x01, # MAC source
  const16(0x0800),                    # protocol (IP)
  0b01000101, 0,                      # IP version, TOS, etc.
  const16(20),                        # Total length
  drnd(2),                            # IP identification (random for each packet)
  0b00100000, 42,                     # Fragmentation stuff
  64,                                 # ttl
  17,                                 # proto udp
  csumip(14, 33),                     # IP checksum
  drnd(4),                            # source ip (random for each packet)
  10, 10, 10, 10,                     # dest ip
 trafgen -t 0 -c /tmp/packet -o A  
The packets are much smaller than the ones you would see in a real attack because usually the bottleneck is not bandwidth but processing per packet.
Last, start perf:
 perf top -p `pidof trafgen | tr ' ' ,`  
 # or with call graphs  
 perf top --call-graph dwarf -p `pidof trafgen | tr ' ' ,`  

Benchmark result

CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz (dual core with HyperThreading)
RAM: 4GB DDR3 @ 1600 MHz
OS: Fedora x86_64 kernel 3.16.3-200
command line:
 trafgen -c /tmp/packet -t 0 -o A -P4 -n 50000000  
real 0m22.540s
user 0m8.085s
sys 0m51.687s
real 0m23.645s
user 0m8.446s
sys 0m54.394s
That is 5% performance gain (system time is counted, user time accounts only for packet generation).

Analysis with perf

perf gives some insight in this. These two screenshots show the most important hogs with iptables and tc respectively:

It is evident that ipt_do_table itself is a hog, and that it causes an extra memcpy to take quite some time. The origin of the memcpy can be tracked with callgrapsh:

Basically just entering ip_rcv is a performance cost.

On the other hand, what is that _raw_spin_lock that appears on the tc graph? Again, looking at the callgraph helps:

I'm not a kernel hacker but to me this is a hint that veth supports only one "hardware" queue. For this reason pulling one packet from the device has to be serialized, and the more time you spend in the qdisc the more likely you have lock contention; in fact, if you check the _raw_spin_lock usage in the iptables case, you still see it with after __netif_receive_skb_core, but since no complicate tests are being done in the qdisc, you rarely get two threads pulling at the same time. After that, netfilter is fully parallelized.

Friday, 3 October 2014

How to workaround missing multihoming support for UDP server applications using conntrack

It pretty common to design UDP server applications to have a single socket bound on<someport>, that receive datagrams and then replies with sendmsg/sendto. Unfortunately, this design cannot work with servers that have multiple interface, as the outgoing datagram will just pick up the default gateway. But there is a way to exploit conntrack, policy based routing and veth interfaces it to have your packets routed correctly. This solution was suggested by jkroon on Freenode (but I've modified some details).

The idea is to have conntrack remember which interface the first packet came in from, and force an automatic source address translation on the reply packets. Do this:
  • enable ip forwarding, disable return patch filtering by default:
  •  echo 1 > /proc/sys/net/ipv4/ip_forward  
     echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter  
  • create a veth pair, assign to it a private addresses, disable rp_filter on one end:
  •  ip link add serverintf type veth peer name distributeintf  
     ip ad ad dev serverintf  
     ip ad ad dev distributeintf  
     ip link set serverintf up
     ip link set distributeintf up
     echo 0 > /proc/sys/net/ipv4/conf/serverintf/rp_filter  
     echo 0 > /proc/sys/net/ipv4/conf/distributeintf/rp_filter  
  • force your server application to bind  to (if that is not supported there are LD_PRELOAD tricks, or even network namespaces), and make packets coming out from that stick to serverintf even if they have a global destination:
  •  ip rule add from lookup 1010101  
     ip route add default dev serverintf via table 1010101  
  • use DNAT to route connections to the server:
  •  iptables -t nat -A PREROUTING -p udp -m multiport --dports $serverports -j DNAT --to-destination
     iptables -t raw -A PREROUTING -i distributeintf -j CT --notrack
     #for each interface eth$i with ip $ip and gateway $gateway
     echo 2 > /proc/sys/net/ipv4/conf/eth$i/rp_filter
     ip rule add from $ip lookup $((i + 1))
     ip route add default dev eth$i via $gateway table $((i + 1))  
The core of the trick here is to call conntrack in action with the DNAT target, so it reminds the connection tuple for us, then have the server send packets from and not, so the default gateway/interface is not picked. The roundtrip on the veth interfaces is necessary because in the second route lookup in the output path cannot change the interface chosen in the first lookup right after the application emits the packet.

For clarification, this is the path that packets travel.
  • incoming: packet comes from client to interface eth$i with destination ip$i, DNAT translates ip$i to and the packet is received. A NAT entry is created on the first packet, reminding that packets going from to client need to have the source translated to ip$i
  • outgoing: server sends response packet from on serverintf to client, conntrack changes the source address to ip$1 after POSTROUTING. The packet is looped to distributeintf, and forwarded to the correct eth$i thanks to policy routing. From now on, it's a safe journey.


When it comes to advanced routing there are some default sysctl settings that can stand in the way. I suggest you debug what's going on if something doesn't work by enabling martians logging (echo 1 > conf/*/log_martians).
If you don't have ip addresses assigned to your interface (in the case those are tunnels to which untouched packets are forwarded), you may need to adjust the logic that bumps a packet from distributeintf to the right interface. YMMV.

If you understand what's going on here you probably wonder why I use veth interfaces instead of lo: the reason is that a packet traveling in the PREROUTING chain on lo cannot be forwarded on other interfaces. It sticks to lo, and it is received locally no matter what.

Wednesday, 1 October 2014

Scan the internet for Autonomous Systems that can perform IP spoofing

I have always been interested in IP spoofing. I would say it's something "elegant", and it's a neat way to show how the Internet works, or rather how there are some inherent flaws with it. My greatest geek pride is a hack based on IP spoofing and source IP-port guessing, that allowed me to make players shit carrots while walking.

Unfortunately, IP spoofing enables shitkids to have in the virtual world the leverage they don't have in real life, and I'm talking about DRDoS. Lately I've been involved in protecting the very same game that I hacked against this kind of attacks. And of course, since I live in a constant moral greyzone, I couldn't miss experimenting in the other side of the front.

What and How

Performing IP spoofing requires a machine in an AS that allow such packets to be effectively sent out of its network. Finding one of these is no more a trivial task as it was years ago, and the knowledge of which providers allow that is usually not accessible for free. There are some places where you can find the worst scum of internet, and rent services from these fishy individuals, but it's usually a very unpleasing experience if you are there just for the sake of knowledge. So I started thinking how I could harvest for this kind of AS.

My idea is pretty simple:
  • pick a protocol that causes one query packet to be answered with an answer with controllable payload
  • force the above-mentioned payload to be the destination IP (the "visible global IP" of the host) of the query packet
  • inspect the response and check for weird mismatches between the payload and the source address.
It is true that the source address should match the payload even in AS that allows IP spoofing, but my worldwide scan shows that there is a lot of hosts that send out the weirdest shit, because connection tracking, NAT or routing altogether is not properly configured. And this post is exactly about these results.


First guess for the protocol? ICMP ping. Universal, reliable as the query payload is transmitted back as is, and generally not filtered.
I wrote my own simple C++ program that pings the internet in a pseudo random order, so that a single target network doesn't get a sudden spike of traffic (just like other tools do), and the result gathering was just tcpdump and some awkward bash scripts. I'm not going to share the code because I don't want lamers to gain from my ideas without effort, and also because I simply lost it. I decided to limit the results to host that sent non-globally routable source addresses, as there is no chance to incur in false positives: a simple mismatch of the payload and the source address is most probably caused by bad connection tracking and hosts with multiple network interfaces.
If you attempt to reproduce these results, be aware of two things. First you'll get abuse complaints, even for a ping scan, that has had no known vulnerabilities in the last 30 years I think. American university networks seem to be the most hysterical, and I would like to have a word about that in another future rant post. Second, I used a machine in a hosting provider which has different datacenters, and of the two that I tried only one was capable of receiving replies with weird invalid source addresses.


Here is a plot of the raw number of hits for reserved IP classes.

Judge yourself. I find this rather amusing.
One remarkable thing is the complete lack of, which every host should have attached to the loopback interface. In my opinion, this is due to the fact that a packet with this source address would need to be originated from the loopback interface, and at least the linux kernel seems to have hardcoded behaviors that make the packet "stick" to the interface (that is even policy based routing is ignored and the packet is directly looped to the host).

I'm not going  to provide the raw capture files, so don't bother asking.


Thanks to user dzrtguy on reddit that reminded me of the existence of bogons. I included them in the graph even though it doesn't strictly mean that these AS allow spoofing. Also beware that I ran this scan in May 2014, and I fetched the bogons list today October 1st, so there might be false positives or misses: the former is more likely as the bogon list should contain not yet assigned addresses.