Post

Trademaxxing part1 : Parsing MoldUDP64

Trademaxxing part1 : Parsing MoldUDP64

Introduction

I’d say I’m a normal guy, but I’ve always liked very stupid projects.

You may not know it but I have a decent interest in finance. So today we are exploring how FPGAs are used to process market data in order to make decisions as fast as possible.

This will in turn allow us to implement shitty financial strats as HDL without any overhead, allowing us to dilapidate our money at lightning speeds and become poor ASAP.

WHY FPGAs ?

CPUs are stupid and full of BLOAT.

Unless you are using a baremetal program on the HOLY CORE (the best CPU ever), your delays are awful and undeterministic, making C++ developers and their fancy 1ms execution times cry as they see literal logic gates crush their useless and bloated perfs, which could have been made better by claude.ai anyway. GG for them.

Us hardware devs still have a couple of years to go before AI replaces us, this leaves us some time to mock software devs before we get replaced too, so let me enjoy it while it lasts

Anyway, when you wanna capture a spread and/or execute arbitrage strategies, the first to pass orders is the one who wins. These strategies are often dumb simple and to get the bags, it doesn’t come down to who has the best quant but rather who can buy/sell the fastest as the opportunities are only present for milliseconds.

This is why FPGAs are appreciated. The time it takes to RX an ethernet packet and TX an order can be very small, like under 100ns but that’s the whole point, you have to be first and no matter if it’s 1-2 nanoseconds before the guy next door, if you are first it’s GG and you get the bags.

MoldUDP64

Okay so in this post, we’ll try to figure out how to parse incoming ethernet frames. We won’t process any data yet nor keep track of the market state. No. The objective here is simply to understand how one can get market data, what it is and how to parse the incoming data to get to the PAYLOAD i.e. market data.

MoldUDP64 & ITCH.. What are those ?

ITCH is the Nasdaq protocol where they’ll broadcast market events. It’s received in FPGAs via ethernet and the frames follow the MoldUDP standard that looks like this:

1
Ethernet Frame -> IPv4 headers -> UDP Headers -> MoldUDP64 -> ITCH Messages

Or even better with an image:

MoldUDP64 Packet view

I won’t get into the specifics of IPv4 and UDP.

MoldUDP64 is simple and only contains a few fields:

Field NameLengthValueNotes
Session010Session ID, don’t care much
Sequence Number108First message’s sequence number. 1 per message. Can use that and compare with last seq message and check if we missed something. Each message has one so if we get seq = 10 and message count = 5 then the next frame should have number 16 as seq number.
Message Count182The number of messages in the packet

Our goal here will be to filter incoming packets, we only want:

  • IPv4 UDP packets
  • That contain MoldUDP64

So it’s mostly just a big forwarding funnel / pipeline (whatever you wanna call it).

The only thing “out of the ordinary” this pipeline can and will do is detect missing packets using the Sequence Number field, which contains the first message’s number.

E.g. if you miss a frame and seq number does not match the previous message you handled… it means you missed some orders.

Now Nasdaq being very smart gives you another server you can request missing packets from. Which we won’t do in the context of our demo, instead, when a missing packet is detected, we raise a sequential packet_gap_error we can later use as a reset signal or an interrupt if I decide to add a monitoring CPU to the system.

Anyway, that’s the idea of what we need to do to access the ITCH messages, again, we don’t wanna process them yet.

FPGA design

MoldUDP64 Parsing FPGA design

AXI stream is used as a standard connection between parsers.

So the design is pretty straightforward. Because this project is more of a demo, the ethernet speed is limited to 1Gbps as I plan on using a KC705 board:

KC705 image

Each parser revolves around a simple FSM and a byte counter, then looks at the upcoming data, parses the respective fields by transitioning states.

If a frame is not valid (has to be filtered out), it simply goes back to IDLE and waits for the data stream to end by monitoring the AXIS’ tlast.

If a frame passes the couple of checks, the parser transitions into FORWARDING state and will open the gate to the next parser, but it only passes its payload meaning the “bloat” headers are dropped at each step.

Here is an example, the simplest (UDP parsing) where we literally have no filter implemented:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
// udp_parser.sv

/* UDP Parser
*
* This parser is pretty much just a forwarder that cuts out the UDP headers
* As the UDP port, just like some IPv4 params, depend on the day-to-day instructions from Nasdaq,
* We'll just forward stuff for the most part, very simple design.
*
* BRH 03/2026
*/

module udp_parser (
    input logic clk,
    input logic rst_n,
    AXI_STREAM_BUS.Rx axis_in,
    AXI_STREAM_BUS.Tx axis_out
);

// these states represent 
typedef enum logic [3:0] {
    IDLE,
    SRC_PORT,
    DST_PORT,
    LENGTH,
    HEADER_CHECKSUM,
    FORWARDING
} state_t;
    
state_t state, next_state;
// we track ongoing frame
logic ongoing_frame;
// Global bytes counter
logic [7:0] bytes_counter;

always_ff @(posedge clk) begin
    if(~rst_n) begin
        state <= IDLE;
        ongoing_frame <= 0;
        bytes_counter <= 0;
    end else begin
        state <= next_state;

        // track ongoing frame
        ongoing_frame <= ongoing_frame;
        if(axis_in.tlast) begin
            ongoing_frame <= 0;
        end else if(state == IDLE && axis_in.tvalid) begin
            ongoing_frame <= 1;
        end

        // update byte counter
        if(axis_in.tvalid && (state != IDLE)) begin
            bytes_counter <= axis_in.tlast ? 0 : bytes_counter + 1;
        end else begin
            bytes_counter <= bytes_counter;
        end
    end
end

always_comb begin
    // default assignments
    next_state = state;
    axis_out.tvalid = 0;
    axis_out.tdata = 0;
    axis_out.tlast = 0;
    axis_in.tready = 1;

    case (state)
        IDLE : begin
            if(axis_in.tvalid && ~ongoing_frame) next_state = SRC_PORT;
            axis_in.tready = 0;
        end

        SRC_PORT : begin
            if(bytes_counter == 1 && axis_in.tvalid) next_state = DST_PORT;
        end

        DST_PORT : begin
            if(bytes_counter == 3 && axis_in.tvalid) next_state = LENGTH;
        end

        LENGTH : begin
            if(bytes_counter == 5 && axis_in.tvalid) next_state = HEADER_CHECKSUM;
        end

        HEADER_CHECKSUM : begin
            if(bytes_counter == 7 && axis_in.tvalid) next_state = FORWARDING;
        end

        FORWARDING : begin
            // as long as TLAST does not hit, marking the end of the ethernet payload
            // (RX parser only gives us the raw payload, nothing else)
            // we keep on forwarding the selected data to next parser
            axis_out.tvalid = axis_in.tvalid;
            axis_out.tdata = axis_in.tdata;
            axis_in.tready = axis_out.tready;
            axis_out.tlast = axis_in.tlast;
        end

        default: ;
    endcase

    if(axis_in.tlast && axis_in.tvalid) next_state = IDLE;
end
    
endmodule

In fact, the checks are kept minimal for the demo, lust like what most of my parsers do: they just count bytes until they can start forwarding, effectively just stripping the bloated protocol header.

Verification

In summary, the design is very simple and it’s just a matter of stripping off headers from a stream, nothing fancy.

You can see this simplicity really translate in simulation where we just see streams getting progressively stripped of their headers.

KC705 image

This simulation is based on a cocotb/verilator testbench. The RGMII signals are known to be valid & standard because I used the cocotbext-eth extension.

The frame data itself is a fake fixed example that contains a single ITCH R message at the end. It serves as a template to design the initial parsers, I asked claude to generate it from the various IPv4/UDP/MoldUDP64 specs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# test_trademaxxer.py 

# ...

@cocotb.test()
async def test_tardemaxxer_basic(dut):

    # ...

    rgmii_source = RgmiiSource(dut.rgmii_rxd, dut.rgmii_rx_ctl, dut.rgmii_rxc, dut.rst)

    raw_data = bytes([
        # -------------------------
        # Ethernet Header (14B)
        # -------------------------
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,  # dst MAC (broadcast)
        0xDE, 0xAD, 0xBE, 0xEF, 0x00, 0x01,  # src MAC
        0x08, 0x00,                           # ethertype IPv4

        # -------------------------
        # IP Header (20B)
        # -------------------------
        0x45,                   # version=4, IHL=5 (20B, no options)
        0x00,                   # DSCP/ECN
        0x00, 0x4B,             # total length = 75B (20 IP + 8 UDP + 20 MoldUDP64 + 2 msg len + 25 ITCH R)
        0x00, 0x00,             # identification
        0x00, 0x00,             # flags + fragment offset
        0x40,                   # TTL = 64
        0x11,                   # protocol = UDP
        0x00, 0x00,             # checksum (0 = disabled)
        0xC0, 0xA8, 0x01, 0x01, # src IP 192.168.1.1
        0xE9, 0x36, 0x0C, 0x6F, # dst IP 233.54.12.111 (multicast)

        # -------------------------
        # UDP Header (8B)
        # -------------------------
        0x30, 0x39,             # src port 12345
        0x67, 0x48,             # dst port 26456 (ITCH)
        0x00, 0x37,             # length = 55B (8 UDP + 20 MoldUDP64 + 2 msg len + 25 ITCH R)
        0x00, 0x00,             # checksum disabled

        # -------------------------
        # MoldUDP64 Header (20B)
        # -------------------------
        0x53, 0x32, 0x30, 0x31, 0x39, 0x30, 0x31, 0x33, 0x30, 0x20,  # session "S20190130 "
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01,               # sequence number = 1
        0x00, 0x01,                                                     # message count = 1

        # -------------------------
        # MoldUDP64 message envelope
        # -------------------------
        0x00, 0x19,             # message length = 25B (ITCH Stock Directory R)

        # -------------------------
        # ITCH Stock Directory (type R) - 25B
        # -------------------------
        0x52,                   # message type 'R'
        0x00, 0x2A,             # stock locate = 42  (AAPL = 42 today)
        0x00, 0x01,             # tracking number (bloat)
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00,  # timestamp (nanoseconds since midnight)
        0x41, 0x41, 0x50, 0x4C, 0x20, 0x20, 0x20, 0x20,  # symbol "AAPL    " (8B space padded)
        0x4E,                   # market category 'N' (Nasdaq)
        0x00,                   # financial status indicator
        0x00, 0x00, 0x00, 0x01, # round lot size = 1
        0x4E,                   # round lots only 'N'
    ])

    await rgmii_source.send(GmiiFrame.from_payload(raw_data))

    # ...
    

In the future, we’ll use real Nasdaq data as they give binary outputs that we can use for this purpose.

Upcoming Work

Now we have to design an ITCH bookkeeper, that will decode the nature of the incoming messages and act upon an order list + “price ladder”. The goal will be to only cover a single stock like AAPL to keep things simple as that is just a demo.

But that… is for another post !

Thank you for reading to this point. You can write a comment below if you have any questions.

Godspeed

-BRH

This post is licensed under CC BY 4.0 by the author.