This post builds up on my previous post where I talked about an interesting bug in my ping implementation. In case you haven’t read it I will explain it very briefly for better coherence.

I was sending raw echo-request packets to a target host but I wasn’t using the right tool for sending raw packets. As a result, the operating system’s network stack slapped a duplicate IP header on my IP packet making it look like IP | IP | ICMP. This caused a misalignment in bytes and the packet was dropped.

What I needed was a way to tell the OS, not to attach its IP header as the packet already had one, and the way of achieving that was by using raw sockets.

Understanding raw sockets

If you already know what raw sockets are, feel free to skip this part.

The Beej’s guide is hands down one of the best resources on networking, but since you’re here, here’s a rather simplistic explanation raw sockets.

An operating system is typically divided into two parts: user space and kernel space. User space is where applications and user-level processes run, while kernel space handles low-level, critical operations such as networking, disk access, hardware management, etc.

When a network packet is sent from the system, it passes through multiple layers of the network stack. At each such layer, the kernel is responsible for creating and attaching the appropriate headers. For instance, at the network layer it attaches an IP header, while at the data-link layer, it attaches a link-layer header, such as an Ethernet header. The key takeaway here is that it’s the sole responsibility of the kernel to create and attach headers, and all of this happens in the kernel space.

Raw sockets come in handy when a user (who stays in userland btw) wants to create the headers themself. It enables users to craft the packet headers in the user space.

“raw socket” is a generic term

There’re essentially two types of raw sockets, and I feel its necessary to know this distinction.

  • Firstly, there are raw sockets created at network layer. These are also called network sockets or L3 sockets.
  • Then there are raw sockets are created at data-link layer. These are called data-link sockets or L2 sockets or even packet sockets.

When we create a raw socket at the network layer, we are responsible for constructing the packet’s IP header and the payload — which typically includes transport layer headers (TCP/UDP) and any data from higher layers.

Likewise, when creating a raw socket at the data link layer (L2), we are responsible for constructing the data link layer header (e.g., Ethernet header) as well as the entire packet, including the IP header, transport layer header, and payload.

Okay, so I understand that user gets to create her own header and data using raw sockets. What happens after that? How does it reach the network stack?

When we call the socket() system routine, it returns a socket file descriptor, basically, a file descriptor associated with a socket. A file descriptor is a number that the OS uses to identify an open file or other I/O resources such as pipes, network connection, socket, etc. When a unix program does any sort of I/O, they do it by reading from or writing to a file descriptor. And since sockets are represented by file descriptors, network communication happens the same way as other I/O operations. There are methods defined, such as send() and recv() that are used to write data to or read data from a file descriptor.

Upon writing data to the file descriptor, it’s my understanding that the kernel processes it and sends it down to the appropriate layer. Meaning, if we’ve created an L3 raw socket, the kernel would pass it to the data-link layer, where the link-layer header would get attached. Eventually, it gets handed off to the NIC for transmission over physical medium as electric signals.

Here’s a doodle elaborating my thoughts. Please don’t take it seriously, it’s highly highly abstracted and may even be technically inaccurate if put under scrutiny. It’s just that, making these diagrams helps me briefly familiarize myself with unknown topics. At times, it also provides a level of clarity that was hard to attain by reading through texts.

explaining-raw-socket

Where were we?

With some decent understanding of raw sockets, I started searching for implementations in Go. That’s when I came across this blog by Graham King that demonstrates exactly what I needed to do. It creates two binaries, a sender and a receiver, and runs each of them on separate terminals. The sender program sends a raw echo-request packet on the loopback address, and when the system responds with an echo-reply, the receiver program receives it and prints the echo-reply bytes on stdout. We validate correctness by comparing the sequences in both terminals, ensuring that the byte corresponding to the Type field is 08 in one (echo-request) and 00 in another (echo-reply).

I was relieved as the article demonstrated exactly what I needed to do to upgrade my ping implementation using raw sockets. And yet, before refactoring, I copy pasted the program from the blog and tried running it on my machine, just to be sure that it works!

Has it ever been that simple?

As you might’ve guessed, it didn’t. The kernel clearly didn’t get my message and kept attaching its own IP header ahead of mine. It was the same story of IP | IP | ICMP all over again. I wasn’t able to see any reason why things wouldn’t work as I had expected. I was fairly confident that I hadn’t missed any details from the linux manual on raw sockets. Yet, for peace of mind, I revisited the manual several times, hoping to spot some minor detail I may have overlooked. But despite my efforts, nothing surfaced.

Frustrated by the lack of progress, I posted a question on Stack Overflow. A few days later, someone responded. I’ve put together pieces from their answer below.

stranger-advice

All this time. I repeat, all this time, I have been poring over the linux man pages while sitting comfortably on macOS.

steven-he-upset

How I’d been that oblivious, I still don’t know. It hit me that until that moment, I had never needed to write code with cross-system compatibility in mind. It’s a bit of a shame, I know — but it is what it is

Comparing behavior of raw sockets across systems

The immediate step was to figure out how raw sockets behave differently in Linux and macOS. I learned that macOS has its origin in FreeBSD, so I started referring to the FreeBSD manual on raw sockets. After some digging, I found this page from the FreeBSD wiki and these notes that summarises raw-socket related bugs and peculiarities on FreeBSD.

I’ve copied its contents here for reference and completeness.

FreeBSD socket bugs and peculiarities

Documented by Barath Raghavan, 11/2003 on FreeBSD 4.8-RELEASE

Writing to RAW sockets
----------------------
- ip_len and ip_off must be in host byte order


Reading from RAW sockets
------------------------
- ip_len does not include the IP header's length.  recvfrom() however
returns the packet's true length.  To get the true ip_len field do:
iphdr->ip_len += iphdr->ip_hl << 2;

- You may only read from RAW sockets bound with a protocol other than
IPPROTO_RAW

- ip_len is in host byte order

- You may only read packets for protocols or subprotocols that the kernel
does not process.  This includes things such as ICMP_ECHOREPLY and
ICMP_TIMESTAMP as well as nonstandard protocol numbers.


DIVERT sockets
--------------
- These differ in behavior from RAW sockets, but I haven't gotten a chance
to document their weirdness.


General Thoughts
----------------
- Linux RAW sockets are much better documented in modern Linux
distributions (Gentoo) and have no bugs that I've noticed.  Avoid FreeBSD
for raw sockets unless you have no choice.  If you need BSD, I've read
that OpenBSD has fixed several of these bugs and provides a raw socket
implementation similar to that of Linux.

To put it briefly, raw sockets in FreeBSD exhibit certain peculiarities. These are observed when we read from or write to a raw socket. I want to elaborate on each of these and here’s how I’m gonna do it. First, I would show you the final functional code that works on both systems and then, as we run the program, each of these peculiarities would become noticeable. We will then connect what we observe to the relevant section of the above text that talks about it.

Let’s get to that.

So this is how the files are structured in my project directory.

.
├── Makefile
├── cmd
│   ├── receiver
│   │   └── main.go
│   └── sender
│       └── main.go
├── go.mod
└── go.sum

This is the receiver program, copied exactly as it was in the original article (ignoring the print statements).

func main() {
	fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_ICMP)
	if err != nil {
		fmt.Printf("Error creating socket: %s\n", err)
		os.Exit(1)
	}
	f := os.NewFile(uintptr(fd), fmt.Sprintf("fd %d", fd))

	for {
		buf := make([]byte, 1024)
		fmt.Println("waiting to receive")
		n, err := f.Read(buf)
		if err != nil {
			fmt.Printf("Error reading: %s\n", err)
			break
		}
		fmt.Printf("% X\n", buf[:n])
	}
}

It is system agnostic. We will compile this into an executable and run it in a dedicated terminal.

$ go build -o receiver ./cmd/receiver
$ sudo ./receiver
waiting to receive

While the receiver waits to receive an echo-reply, let’s shift our attention to the sender program, which by the way, is not system agnostic. I slightly modified it to accommodate the behavior and expectations of Darwin-based systems. I would encourage you to briefly skim through the code once, especially the if-block that handles the system-specific adjustments.

func main() {
	fd, _ := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_RAW)
	addr := syscall.SockaddrInet4{
		Addr: [4]byte{127, 0, 0, 1},
	}

	ipHeader := []byte{
		0x45,       // versionIHL
		0x00,       // tos
		0x00, 0x00, // len
		0x00, 0x00, // id
		0x00, 0x00, // ffo
		0x40,       // ttl
		0x01,       // protocol
		0x00, 0x00, // checksum

		0x00, 0x00, 0x00, 0x00, // src
		0x7f, 0x00, 0x00, 0x01, // dest
	}
	data := []byte{0x08, 0x00, 0xf7, 0xff, 0x00, 0x00, 0x00, 0x00}

	if runtime.GOOS == "darwin" {
		// setting IP_HDRINCL socket option explicitly
		_ = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_HDRINCL, 1)

		// Populating total length in host byte order
		binary.LittleEndian.PutUint16(ipHeader[2:4], 28)
		
		// Filling the fields that are auto filled in linux systems but not on mac.
		copy(ipHeader[12:16], []byte{127, 0, 0, 1})
		binary.BigEndian.PutUint16(ipHeader[4:6], 11)
		binary.BigEndian.PutUint16(ipHeader[10:12], calculateChecksum(ipHeader))
	}

	p := append(ipHeader, data...)

	fmt.Printf("Transmitting bytes:\n% x\n", p)

	_ := syscall.Sendto(fd, p, 0, &addr)

	fmt.Printf("Sent %d bytes\n", len(p))
}

Let’s compile this. Ensure that you’re setting correct values for GOOS and GOARCH environment variables. This is essential to generate the correct build, otherwise, you will get exec format error. I am running this on my MacBook Pro which has an M1 chip.

$ GOOS=darwin GOARCH=arm64 go build -o sender-darwin ./cmd/sender

This creates a binary sender-darwin which we will be running in a separate terminal.

But before that, there is one key difference that needs to be said even before we can start sending out bytes. It lies around the socket creation step itself.

1. Creation of raw-socket

So, socket creation is done via a syscall that has the signature

socket = socket(domain, type, protocol)

What separates a raw socket from a regular one is an option called IP_HDRINCL. Normally, the IPv4 layer generates the IP header while sending a packet but when the IP_HDRINCL option is enabled on the socket, the packet must already contain an IP header. The way to set this option is by passing IPPROTO_RAW for the protocol argument in the above function call. Passing IP_PROTORAW implies that the IP_HDRINCL option is enabled. This is what the the linux manual says on raw sockets.

Thus, to create an L3 (network layer) raw socket in linux, the following should suffice.

fd, _ := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_RAW)

However, on macOS, the previous command alone isn’t sufficient. You also need to explicitly set the IP_HDRINCL option using the SetsockoptInt() function.

_ = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_HDRINCL, 1)

This is the first thing I handled inside the if block.

2. Writing to a raw socket

Before we start, here’s a useful tip. If you’re planning to run this on a mac and observe packets using a tool like wireshark or tcpdump, you might want to disable checksum offloading on your system. Otherwise, with checksum offloading enabled, what I’ve observed is that the checksum you calculate programmatically appears as zero in Wireshark/tcpdump output — which can be confusing while debugging. Here’s how you disable it on macOS.

➜ ~ sudo sysctl net.link.generic.system.hwcksum_{tx,rx}=0
net.link.generic.system.hwcksum_tx: 0 -> 0
net.link.generic.system.hwcksum_rx: 0 -> 0

# to check status
➜ ~ sudo sysctl net.link.generic.system.hwcksum_{tx,rx}
net.link.generic.system.hwcksum_tx: 0
net.link.generic.system.hwcksum_rx: 0

Keeping that in mind, let’s continue.

We will run the sender program on both mac and linux machines and observe the bytes we send out in each of them. This is what we see on mac. sender-mac

And here’s what we see when running the same program on Ubuntu. sender-linux

For the record, this works as the receiver program on both machines successfully receives the echo-reply. We will see that in the next section.

For now, let’s focus on the bytes printed out by the sender in both macOS and Ubuntu. There are quite a few observations to make here.

  • First thing you would notice is that there are a lot of 00 or null bytes in the Ubuntu terminal. These are bytes corresponding to the fields Total Length, Identification, Checksum, and Source address. I’ve color-coded them for clarity. sender-linux-copy I calculate these values only when the program is running on mac. For linux, I offload this responsibility to the kernel. The linux manual says that if IP_HDRINCL field is set, certain header fields are auto-filled.

    • Checksum and Total Length fields are always filled in.
    • Source address and Identification fields are filled only when they’re left zero.

    I use this to my advantage and leave their values to zero. This doesn’t happen in mac though, which is why I’ve to calculate their values by myself when my program is running on mac.
     
    If you look at the tcpdump output in Ubuntu, you will notice that the kernel has indeed filled the fields that I had left zero. tcpdump-linux-just-sender-bytes The bytes marked in cyan, corresponding to Total Length became 00 1c which is 28 in decimal. Identification bytes, marked in pink were set to fb 05. Checksum bytes, marked in yellow were set to 81 d9 and finally the source address was set to the loopback address 7f 00 00 01 which is 127.0.0.1.

  • Moving on, while writing to raw sockets, FreeBSD requires certain fields of the IP header such as Total Length and Offset to be in host byte order. For context, the Total Length field gives the length of the entire IP datagram. So 20 bytes for the IP header and 8 bytes for the payload (ICMP header) gives 28 bytes in total.

    binary.LittleEndian.PutUint16(ipHeader[2:4], 28) // host byte order
    

    Now, 28 when represented in two bytes is 00 1c, but as we’re using host-byte order for this field, we reverse the sequence to 1c 00 and that is exactly what we had sent out programmatically. See it again (in cyan). sender-mac-copy Naturally, this affects the checksum we compute. Bytes 1c 00 yield a checksum of 60 f0 while bytes 00 1c yield a checksum of 7c d4 which can be seen in the tcpdump output below.

    tcpdump-mac Note that this tcpdump output above, is from the mac terminal. It also shows Total Length bytes coming as 00 1c and not 1c 00 — which is what we had sent in host-byte order.
     
    What this observation indicates is that the point where tcpdump captures packets (which is right before packets are sent to the network adapter) lies beyond a certain boundary — outside which the peculiarities around FreeBSD raw sockets don’t surface. I suppose this might be the expected behavior and not a peculiarity per se; it’s just that I’m unable to put the exact term on this so-called “boundary”. If you know what it is, kindly reach out.
     
    Lastly, Offset too is supposed to be in host-byte order, but since it is 0 in this case, byte-ordering becomes irrelevant. Hence you don’t see me handling it.

  • Among other observations,

    • The identification bytes in pink 00 0b, which is 11 in decimal, it’s just a random value I set for the field. There is no other rationale behind this.

      binary.BigEndian.PutUint16(ipHeader[4:6], 11)
      
    • The checksum bytes 60 f0 were calculated manually

      binary.BigEndian.PutUint16(ipHeader[10:12], calculateChecksum(ipHeader))
      
    • and the bytes corresponding to source address were hardcoded.

      copy(ipHeader[12:16], []byte{127, 0, 0, 1})
      

3. Reading from a raw socket

After sending the raw echo request packet above, the receiver program running on both machines receives the echo-reply message. This can be validated by checking the byte corresponding to the ICMP Type field, which should be 00 for echo-reply. receiver-combined

A sidenote. We see that the receiver running on Ubuntu is capturing the echo-request too … and I really don’t know why that’s happening. My expectation was that it would receive just the echo-reply, like it happens on mac. If I ever find a reason behind this, I will write an update here.

There are a couple of quirks to note here as well.

  • In FreeBSD, when we read from raw sockets, the value of Total Length comes in host byte order. And it does not include the length of the IP header. In our case, the actual total length is 28 and if we subtract the length of the IP header 20 from it, we are left with 8, which is 00 08 if expressed in two bytes. And that, in host byte order becomes 08 00, which is exactly what the receiver program receives. receiver-mac-copy Even here, if you see the tcpdump output for the incoming packet (yes, on mac of course), you would see 00 1c instead of 08 00. receiver-tcpdump-mac This is just to reaffirm what I had mentioned earlier: the point where tools like tcpdump and Wireshark capture packets don’t observe the peculiarities around FreeBSD raw sockets.
     
    Of course, you won’t see such peculiarity in linux. The receiver program there receives a perfectly sensible 00 1c i.e. 28. receiver-linux-copy

Reflections

If you’ve read this far, I’m grateful. I’m sure that, noticing the minute differences in the images required some cognitive effort and patience. For that, I thank you.

I cannot be sure if all of this was actually helpful because, the gist of this blog can be summarised as — “Don’t use raw sockets on FreeBSD or mac; it’s not worth the effort”.

Among the good parts of course, there’s a lot that I’ve learned in the process. Also, my doodling skills on excalidraw have honestly improved. Isn’t that great?!

References