This post builds up on my previous post where I talked about an interesting bug in my ping implementation. In case you haven’t read it I will explain it very briefly for better coherence.
I was sending raw echo-request packets to a target host but I wasn’t using the right tool for sending raw packets. As a result, the operating system’s network stack slapped a duplicate IP header on my IP packet making it look like
IP | IP | ICMP
. This caused a misalignment in bytes and the packet was dropped.
What I needed was a way to tell the OS, not to attach its IP header as the packet already had one, and the way of achieving that was by using raw sockets.
Understanding raw sockets
If you already know what raw sockets are, feel free to skip this part.
The Beej’s guide is hands down one of the best resources on networking, but since you’re here, here’s a rather simplistic explanation raw sockets.
An operating system is typically divided into two parts: user space and kernel space. User space is where applications and user-level processes run, while kernel space handles low-level, critical operations such as networking, disk access, hardware management, etc.
When a network packet is sent from the system, it passes through multiple layers of the network stack. At each such layer, the kernel is responsible for creating and attaching the appropriate headers. For instance, at the network layer it attaches an IP header, while at the data-link layer, it attaches a link-layer header, such as an Ethernet header. The key takeaway here is that it’s the sole responsibility of the kernel to create and attach headers, and all of this happens in the kernel space.
Raw sockets come in handy when a user (who stays in userspace btw) wants to create the headers themself. It enables users to craft the packet headers in the user space.
“raw socket” is a generic term
There’re essentially two types of raw sockets, and it is important to know this distinction.
- Firstly, there are raw sockets created at network layer. These are also called network sockets or L3 sockets.
- Then there are raw sockets are created at data-link layer. These are called data-link sockets or L2 sockets or even packet sockets.
When we create a raw socket at the network layer, we are responsible for constructing the packet’s IP header and the payload — which typically includes transport layer headers (TCP/UDP) and any data from higher layers.
Likewise, when creating a raw socket at the data link layer (L2), we are responsible for constructing the data link layer header (e.g., Ethernet header) as well as the entire packet, including the IP header, transport layer header, and payload.
Okay, so I understand that user gets to create her own header and data using raw sockets. What happens after that? How does it reach the network stack?
When we call the socket()
system routine, it returns a socket file descriptor, basically, a file descriptor associated with a socket. A file descriptor is a number that the OS uses to identify an open file or other I/O resources such as pipes, network connection, socket, etc. When a unix program does any sort of I/O, they do it by reading from or writing to a file descriptor. And since sockets are represented by file descriptors, network communication happens the same way as other I/O operations. There are methods defined, such as send()
and recv()
that are used to write data to or read data from a file descriptor.
Upon writing data to the file descriptor, it’s my understanding that the kernel processes it and sends it down to the appropriate layer. Meaning, if we’ve created an L3 raw socket, the kernel would pass it to the data-link layer, where the link-layer header would get attached. Eventually, it gets handed off to the NIC for transmission over physical medium as electric signals.
Here’s a doodle elaborating my thoughts. Please don’t take it seriously, it’s highly highly abstracted and may even be technically inaccurate if put under scrutiny. It’s just that, making these diagrams helps me briefly familiarize myself with unknown topics. At times, it also provides a level of clarity that was hard to attain by reading through texts.
Where were we?
With some decent understanding of raw sockets, I started searching for implementations in Go. That’s when I came across this blog by Graham King that demonstrates exactly what I needed to do. It creates two binaries, a sender and a receiver, and runs each of them on separate terminals. The sender program sends a raw echo-request packet on the loopback address, and when the system responds with an echo-reply, the receiver program receives it and prints the echo-reply bytes on stdout. We validate correctness by comparing the sequences in both terminals, ensuring that the byte corresponding to the Type
field is 08
in one (echo-request) and 00
in another (echo-reply).
I was relieved as the article demonstrated exactly what I needed to do to upgrade my ping implementation using raw sockets. And yet, before refactoring, I copy pasted the program from the blog and tried running it on my machine, just to be sure that it works!
Has it ever been that simple?
As you might’ve guessed, it didn’t. The kernel clearly didn’t get my message and kept attaching its own IP header ahead of mine. It was the same story of IP | IP | ICMP
all over again. I wasn’t able to see any reason why things wouldn’t work as I had expected. I was fairly confident that I hadn’t missed any details from the linux manual on raw sockets. Yet, for peace of mind, I revisited the manual several times, hoping to spot some minor detail I may have overlooked. But despite my efforts, nothing surfaced.
Frustrated by the lack of progress, I posted a question on Stack Overflow. A few days later, someone responded. I’ve put together pieces from their answer below.
All this time. I repeat, all this time, I have been poring over the linux man pages while sitting comfortably on macOS.
How I’d been that oblivious, I still don’t know. It hit me that until that moment, I had never needed to write code with cross-system compatibility in mind. It’s a bit of a shame, I know — but it is what it is.
Comparing behavior of raw sockets across systems
The immediate step was to figure out how raw sockets behave differently in Linux and macOS. I learned that macOS has its origin in FreeBSD, so I started referring to the FreeBSD manual on raw sockets. After some digging, I found this page from the FreeBSD wiki and these notes that summarises raw-socket related bugs and peculiarities on FreeBSD.
I’ve copied its contents here for reference and completeness.
FreeBSD socket bugs and peculiarities
Documented by Barath Raghavan, 11/2003 on FreeBSD 4.8-RELEASE
Writing to RAW sockets
----------------------
- ip_len and ip_off must be in host byte order
Reading from RAW sockets
------------------------
- ip_len does not include the IP header's length. recvfrom() however
returns the packet's true length. To get the true ip_len field do:
iphdr->ip_len += iphdr->ip_hl << 2;
- You may only read from RAW sockets bound with a protocol other than
IPPROTO_RAW
- ip_len is in host byte order
- You may only read packets for protocols or subprotocols that the kernel
does not process. This includes things such as ICMP_ECHOREPLY and
ICMP_TIMESTAMP as well as nonstandard protocol numbers.
DIVERT sockets
--------------
- These differ in behavior from RAW sockets, but I haven't gotten a chance
to document their weirdness.
General Thoughts
----------------
- Linux RAW sockets are much better documented in modern Linux
distributions (Gentoo) and have no bugs that I've noticed. Avoid FreeBSD
for raw sockets unless you have no choice. If you need BSD, I've read
that OpenBSD has fixed several of these bugs and provides a raw socket
implementation similar to that of Linux.
To put it briefly, raw sockets in FreeBSD exhibit certain peculiarities. These are observed when we read from or write to a raw socket. I want to elaborate on each of these and here’s how I’m going to do it. First, I would show you the final functional code that works on both systems and then, as we run the program, each of these peculiarities would become noticeable. We will then connect what we observe to the relevant section of the above text that talks about it.
Let’s get to that.
So this is how the files are structured in my project directory.
.
├── Makefile
├── cmd
│ ├── receiver
│ │ └── main.go
│ └── sender
│ └── main.go
├── go.mod
└── go.sum
This is the receiver program, copied exactly as it was in the original article (ignoring the print statements).
func main() {
fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_ICMP)
if err != nil {
fmt.Printf("Error creating socket: %s\n", err)
os.Exit(1)
}
f := os.NewFile(uintptr(fd), fmt.Sprintf("fd %d", fd))
for {
buf := make([]byte, 1024)
fmt.Println("waiting to receive")
n, err := f.Read(buf)
if err != nil {
fmt.Printf("Error reading: %s\n", err)
break
}
fmt.Printf("% X\n", buf[:n])
}
}
It is system agnostic. We will compile this into an executable and run it in a dedicated terminal.
$ go build -o receiver ./cmd/receiver
$ sudo ./receiver
waiting to receive
While the receiver waits to receive an echo-reply, let’s shift our attention to the sender program, which by the way, is not system agnostic. I slightly modified it to accommodate the behavior and expectations of Darwin-based systems. I would encourage you to briefly skim through the code once, especially the if-block
that handles the system-specific adjustments.
func main() {
fd, _ := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_RAW)
addr := syscall.SockaddrInet4{
Addr: [4]byte{127, 0, 0, 1},
}
ipHeader := []byte{
0x45, // versionIHL
0x00, // tos
0x00, 0x00, // len
0x00, 0x00, // id
0x00, 0x00, // ffo
0x40, // ttl
0x01, // protocol
0x00, 0x00, // checksum
0x00, 0x00, 0x00, 0x00, // src
0x7f, 0x00, 0x00, 0x01, // dest
}
data := []byte{0x08, 0x00, 0xf7, 0xff, 0x00, 0x00, 0x00, 0x00}
if runtime.GOOS == "darwin" {
// setting IP_HDRINCL socket option explicitly
_ = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_HDRINCL, 1)
// Populating total length in host byte order
binary.LittleEndian.PutUint16(ipHeader[2:4], 28)
// Filling the fields that are auto filled in linux systems but not on mac.
copy(ipHeader[12:16], []byte{127, 0, 0, 1})
binary.BigEndian.PutUint16(ipHeader[4:6], 11)
binary.BigEndian.PutUint16(ipHeader[10:12], calculateChecksum(ipHeader))
}
p := append(ipHeader, data...)
fmt.Printf("Transmitting bytes:\n% x\n", p)
_ := syscall.Sendto(fd, p, 0, &addr)
fmt.Printf("Sent %d bytes\n", len(p))
}
Let’s compile this. Ensure that you’re setting correct values for GOOS and GOARCH environment variables. This is essential to generate the correct build, otherwise, you will get exec format error
. I am running this on my MacBook Pro which has an M1 chip.
$ GOOS=darwin GOARCH=arm64 go build -o sender-darwin ./cmd/sender
This creates a binary sender-darwin
which we will be running in a separate terminal.
But before that, there is one key difference that needs to be said even before we can start sending out bytes. It lies around the socket creation step itself.
1. Creation of raw socket
So, socket creation is done via a syscall that has the signature
socket = socket(domain, type, protocol)
What separates a raw socket from a regular one is an option called IP_HDRINCL
. Normally, the IPv4 layer generates the IP header while sending a packet but when the IP_HDRINCL
option is enabled on the socket, the packet must already contain an IP header. The way to set this option is by passing IPPROTO_RAW
for the protocol argument in the above function call. Passing IP_PROTORAW
implies that the IP_HDRINCL
option is enabled. This is what the the linux manual says on raw sockets.
Thus, to create an L3 (network layer) raw socket in linux, the following should suffice.
fd, _ := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_RAW)
However, on macOS, the previous command alone isn’t sufficient. You also need to explicitly set the IP_HDRINCL
option using the SetsockoptInt()
function.
_ = syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_HDRINCL, 1)
This is the first thing I handled inside the if
block.
2. Writing to a raw socket
Before we start, here’s a useful tip. If you’re planning to run this on a mac and observe packets using a tool like wireshark or tcpdump, you might want to disable checksum offloading on your system. Otherwise, with checksum offloading enabled, what I’ve observed is that the checksum you calculate programmatically appears as zero in Wireshark/tcpdump output — which can be confusing while debugging. Here’s how you disable it on macOS.
➜ ~ sudo sysctl net.link.generic.system.hwcksum_{tx,rx}=0
net.link.generic.system.hwcksum_tx: 0 -> 0
net.link.generic.system.hwcksum_rx: 0 -> 0
# to check status
➜ ~ sudo sysctl net.link.generic.system.hwcksum_{tx,rx}
net.link.generic.system.hwcksum_tx: 0
net.link.generic.system.hwcksum_rx: 0
Keeping that in mind, let’s continue.
We will run the sender program on both mac and linux machines and observe the bytes we send out in each of them. This is what we see on mac.
And here’s what we see when running the same program on Ubuntu.
For the record, this works as the receiver program on both machines successfully receives the echo-reply. We will see that in the next section.
For now, let’s focus on the bytes printed out by the sender in both macOS and Ubuntu. There are quite a few observations to make here.
-
First thing you would notice is that there are a lot of
00
or null bytes in the Ubuntu terminal. These are bytes corresponding to the fieldsTotal Length
,Identification
,Checksum
, andSource address
. I’ve color-coded them for clarity. I calculate these values only when the program is running on mac. For linux, I offload this responsibility to the kernel. The linux manual says that ifIP_HDRINCL
field is set, certain header fields are auto-filled.Checksum
andTotal Length
fields are always filled in.Source address
andIdentification
fields are filled only when they’re left zero.
I use this to my advantage and leave their values to zero. This doesn’t happen in mac though, which is why I’ve to calculate their values by myself when my program is running on mac.
If you look at the tcpdump output in Ubuntu, you will notice that the kernel has indeed filled the fields that I had set to zero. The bytes marked in cyan, corresponding toTotal Length
became00 1c
which is28
in decimal. Identification bytes, marked in pink were set tofb 05
. Checksum bytes, marked in yellow were set to81 d9
and finally the source address was set to the loopback address7f 00 00 01
which is127.0.0.1
. -
Moving on, while writing to raw sockets, FreeBSD requires certain fields of the IP header such as
Total Length
andOffset
to be in host byte order. For context, theTotal Length
field gives the length of the entire IP datagram. So20
bytes for the IP header and8
bytes for the payload (ICMP header) gives28
bytes in total.binary.LittleEndian.PutUint16(ipHeader[2:4], 28) // host byte order
Now,
28
when represented in two bytes is00 1c
, but as we’re using host-byte order for this field, we reverse the sequence to1c 00
and that is exactly what we had sent out programmatically. See it again (in cyan). Naturally, this affects the checksum we compute. Bytes1c 00
yield a checksum of60 f0
while bytes00 1c
yield a checksum of7c d4
which can be seen in the tcpdump output below.Note that this tcpdump output above, is from the mac terminal. It also shows
Total Length
bytes coming as00 1c
and not1c 00
— which is what we had sent in host-byte order.
What this observation indicates is that the point where tcpdump captures packets (which is right before packets are sent to the network adapter) lies beyond a certain boundary — outside which the peculiarities around FreeBSD raw sockets don’t surface. I suppose this might be the expected behavior and not a peculiarity per se; it’s just that I’m unable to put the exact term on this so-called “boundary”. If you know what it is, kindly reach out.
Lastly,Offset
too is supposed to be in host-byte order, but since it is0
in this case, byte-ordering becomes irrelevant. Hence you don’t see me handling it. -
Among other observations,
-
The identification bytes in pink
00 0b
, which is11
in decimal, it’s just a random value I set for the field. There is no other rationale behind this.binary.BigEndian.PutUint16(ipHeader[4:6], 11)
-
The checksum bytes
60 f0
were calculated manuallybinary.BigEndian.PutUint16(ipHeader[10:12], calculateChecksum(ipHeader))
-
and the bytes corresponding to source address were hardcoded.
copy(ipHeader[12:16], []byte{127, 0, 0, 1})
-
3. Reading from a raw socket
After sending the raw echo request packet above, the receiver program running on both machines receives the echo-reply message. This can be validated by checking the byte corresponding to the ICMP Type
field, which should be 00
for echo-reply.
A sidenote. We see that the receiver running on Ubuntu is capturing the echo-request too … and I really don’t know why that’s happening. My expectation was that it would receive just the echo-reply, like it happens on mac. If I ever find a reason behind this, I will write an update here.
There are a couple of quirks to note here as well.
- In FreeBSD, when we read from raw sockets, the value of
Total Length
comes in host byte order. And it does not include the length of the IP header. In our case, the actual total length is28
and if we subtract the length of the IP header20
from it, we are left with8
, which is00 08
if expressed in two bytes. And that, in host byte order becomes08 00
, which is exactly what the receiver program receives. Even here, if you see the tcpdump output for the incoming packet (yes, on mac of course), you would see00 1c
instead of08 00
. This is just to reaffirm what I had mentioned earlier: the point where tools like tcpdump and Wireshark capture packets don’t observe the peculiarities around FreeBSD raw sockets.
Of course, you won’t see such peculiarity in linux. The receiver program there receives a perfectly sensible00 1c
i.e.28
.
Reflections
If you’ve read this far, I’m grateful. I’m sure that, noticing the minute differences in the images required some cognitive effort and patience. For that, I thank you.
I hope this has been helpful and if I had to summarise this post in one line it would be: “Don’t use raw sockets on FreeBSD or mac; it’s not worth the effort”.
Among the good parts of course, there’s a lot that I’ve learned in the process. Also, my doodling skills on excalidraw have honestly improved. Isn’t that great?!