10G Ethernet XGMII

Posted by fmadio | 100G Ethernet

We previously discussed how packets are framed in hardware, specifically how do you know when a packet starts and ends using application level signaling. Now we enter the world of the XGMII protocol, this is a (relatively) sane protocol where you can decode it by eyeballing the data e.g. 0x1234 on the wire usually means the packet contains data 0x1234. XGMII stands for X(roman 10)-G-Media-Independant-Interface which is IEEE 802.3 Clause 46 ratified specification enabling a variety of PHY and MAC chips from different vendors to talk the exact same protocol. In the FPGA world where the MAC and PHY are implemented in the same chip technically this interface layer is not required, as the MAC and PHY will be grouped into a single IP Library with a single public interface. However in our own 10G, 40G, 100G ethernet capture system we did separate these layers because its a clear and obvious way to decompose the complexity of the problem.

The XGMII protocol is a formalized way for two hardware blocks (typically the MAC & PHY) to communicate when a packet starts/ends and if there`s any errors detected on the line. The full spec is defined in IEEE 802.3 Clause 46 but we will save you the legalize parse time and explain it in plain English.

Simply, it uses 8bits of control logic for every 64bits of data, which equates to 1 bit of control for every 1 byte of data. This 1 bit of control logic changes the interpretation of the 8bits(1 byte) of data. As you can imagine the interpretation is simple and shown in the table below.

control bit8bit data interpretation
0payload data
1control code

Seems simple? It is, one thing you learn working with hardware is everything is simple because ... it has to be otherwise the verification time skyrockets out and beyond the stratosphere (and your budget too!). When the control bit is 0 the data payload is ... the data payload, meaning the same bits you will see in the PCAP. When the control bit is 1, things get a little more complicated as each control data byte (data associated with 1bit of control) has specifically crafted 8bit codes on what it actually means. So lets expand our above interpretation table a little to include the major control codes.

control bit8bit datainterpretation
00x??payload data (anything goes)
10x07idle (no data payload)
10xFBstart of a packet
10xFDend of a packet

It adds a little more complexity but still keeps it simple and elegant. There`s a half a dozen additional control codes, for error detection and side band channels which you can read about in the specification. The above covers 99% of whats on the wire for a typical healthy network. Lets dig a little and go over a concrete example. In the "waves" screenshot below there is an example of a single packet using the XGMII protocol, where the packet starts at the red line and finishes at the dotted yellow line.

... the raw data transcribed into textual form below and control words highlighted in bold.

cyclecontrol bits (binary)64bit datamarker
1000111110x555555fb_07070707Red line
11111111110x07070707_070707fdYellow dotted line

Remember that this interface is pumping out data every single cycle the device is powered on, it does not care if there is valid packet data or its sitting idle - it always outputs something. In this case we see its idle in cycles 0 and 1, as the control bit is 1 and the data is control code 0x07, corresponding to an Idle operation. Then in cycle 1 we see control code = 1, data = 0xfb followed by a string of 0x55 data payloads. This signals the Start of a packet (control code 0xFB) which is followed by whats known as "Ethernet Preamble" aka 8 bytes of header before the ethernet MAC address starts. You can see the preamble continues into cycle 2 of the data and is terminated by a 0xd5 data payload. The full 8 bytes of preamble is 0xd5555555_555555fb.

An interesting side question is, why did the protocol designers choose the number 0x55 for preamble ? why not choose 0x01020304050607 or some other magic number? The answer is clear when you interpret the bits as binary, 0x55 == 0101_0101(binary), where the preamble is literally 1 0 1 0 1 0 1 ... 0 1 0 1 which I guess at some point in the past made it efficient to parse in hardware, or more friendly on DC balance/electrical cross talk. Note that 0xD5 = 1101_0101 in binary, which also makes detecting the end of the preamble very simple. These days we have wide buses, our own 100G ethernet capture hardware uses a massive 512b internal data bus which makes the simplicity of an alternating serious of 1`s and 0`s very much a thing of the past. For those heckling in the back thinking, saying its good for high speed serial transceiver's, they are by definition serial one bit outputs? Its not the case as this data gets mutated a few more times and looks completely different by the time it hits the wire.

Back to our example and if we decode the XGMII output into its payload, it will look like the following.

cycle64bit payload datadescription
00xd5555555_555555fbEthernet Preamble
10x00000000_00000000cycle 0 of 8 bytes of ethernet payload
20x00000000_00000001cycle 1 of 8 bytes of ethernet payload
30x00000000_00000002cycle 2 of 8 bytes of ethernet payload
40x00000000_00000003cycle 3 of 8 bytes of ethernet payload
50x00000000_00000004cycle 4 of 8 bytes of ethernet payload
60x00000000_00000005cycle 5 of 8 bytes of ethernet payload
70x00000000_00000006cycle 6 of 8 bytes of ethernet payload
80x00000000_00000007cycle 7 of 8 bytes of ethernet payload
90x3dc288afethernet frame check sequence

.. or in a more software friendly way (ethernet header in bold)

00000000 fb 55 55 55 55 55 55 d5 00 00 00 00 00 00 00 00 |.UUUUUU.........| 00000010 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 02 |................| 00000020 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 04 |................| 00000030 00 00 00 00 00 00 00 05 00 00 00 00 00 00 00 06 |................| 00000040 00 00 00 00 00 00 00 07 3d c2 88 af |........=...| 0000004c

which is finally starting to look like an ethernet packet. In this case the payload contents are a bit wonky (zero`s as MAC addresses/protocol not so kosher) because we wanted to keep the hardware explanation as simple as possible. For completeness sake here is the raw ethernet frame interpretation:

Ethernet Dst MAC0x00:0x00:0x00:0x00:0x00:0x00
Ethernet Src MAC0x00:0x00:0x00:0x00:0x00:0x00
Ethernet Protocol0x0000
Ethernet Payload0x00010000 0x00020000, ...

..and finally this is what the tcpdump output looks like when running our fmadio 40G ethernet capture system, in 4 ports @ 10G mode.

13:59:57.397981 00:00:00:00:00:00 > 00:00:00:00:00:00 Null Information, send seq 0, rcv seq 0, Flags [Response], length 54 0x0000: 0000 0000 0000 0000 0000 0000 0000 0001 ................ 0x0010: 0000 0000 0000 0002 0000 0000 0000 0003 ................ 0x0020: 0000 0000 0000 0004 0000 0000 0000 0005 ................ 0x0030: 0000 0000 0000 0006 0000 0000 0000 0007 ................

which isn't exactly a model ethernet packet citizen, but we walked through how an XGMII bus ends up as a PCAP. The process certainly is not this simple to convert XGMII into PCAP, in fact if you want 100% capture at line rate to disk its pretty dam hard. For now lets keep it focused on the Layer 1 dungeon, next up will walk though what this crazy idea of 64b/66b encoding is all about.