External Events Processor Module Family#

Overview#

The External Events Processor family manages input spike events (axons) in the neuromorphic FPGA system. These modules maintain two Block RAMs in a double-buffering scheme: one for the “present” time step (currently being processed) and one for the “future” time step (accumulating new events). This architecture allows continuous operation without dropping input events.

Three variants exist:

external_events_processor.v - Base version with full pipeline hazard handling
external_events_processor_simple.v - Simplified single-core version with wider data paths
external_events_processor_v2.v - Enhanced version with debugging capabilities

Role in the Software/Hardware Stack#

Host Application (Python/C++)
         |
    [hs_bridge]
         |
  [PCIe Interface]
         |
  [Command Interpreter] -----> [External Events Processor] <---- External spike events
         |                              |
         |                     Present BRAM (8 or 16 axons/row)
         |                     Future BRAM (8 or 16 axons/row)
         |                              |
    [HBM Processor] <------ exec_bram_spiked (spike mask)
         |                              |
  [Internal Events] <-------- exec_bram_phase1_done
    Processor

Function:

Receive input spike events from external sources or command interpreter
Store events in double-buffered BRAMs (present/future)
Synchronize event delivery with HBM read operations
Clear processed events after reading
Handle pipeline hazards for concurrent writes during multi-core operation

Key Innovation: Double-buffering allows new events to accumulate in the “future” BRAM while the “present” BRAM is being read and cleared, ensuring no event loss during processing.

Variant Comparison#

Feature	Base Version	Simple Version	V2 Version
File	external_events_processor.v	external_events_processor_simple.v	external_events_processor_v2.v
Axons per row	8	16	8
Address width	14 bits	13 bits	14 bits
Data width	8 bits	16 bits	8 bits
Target	Multi-core	Single-core	Debug/verification
Future pipeline	3-stage hazard handling	Direct write (no hazards)	Simplified RMW
State machine	5 states	4 states	5 states + debug FSM
Debug features	None	None	CI interface, debug ports
Complexity	High	Low	High

Module Architecture (Base Version)#

                                    ┌─────────────────────────────────┐
                                    │  External Events Processor      │
                                    │                                 │
   setArray_go ──────────┐          │  ┌──────────────────────────┐  │
   setArray_addr[13:0] ──┼──────────┼─>│  Future BRAM Control     │  │
   setArray_data[7:0] ───┘          │  │  - 3-stage pipeline      │  │
                                    │  │  - Hazard detection      │  │
   exec_run ────────────────────────┼─>│  - waddr/wdata/wren[2:0] │  │
                                    │  └────────┬─────────────────┘  │
                                    │           │                     │
                                    │           v                     │
                                    │  ┌──────────────────────────┐  │
                                    │  │   BRAM Multiplexer       │  │
                                    │  │   bram_select toggle     │  │
    ┌───────────────────────────────┼─>│   - BRAM0 ←→ Present    │  │
    │                               │  │   - BRAM1 ←→ Future      │  │
    │  ┌────────────────────────────┼─<│                          │  │
    │  │                            │  └───────┬──────────────────┘  │
    │  │                            │          │                     │
    │  │                            │          v                     │
    │  │                            │  ┌──────────────────────────┐  │
    │  │   exec_hbm_rvalidready ────┼─>│ Present BRAM Control     │  │
    │  │                            │  │ - State machine (5)      │  │
    │  │                            │  │ - Pipeline fill          │  │
    │  │                            │  │ - Read & clear           │  │
    │  │                            │  │ - raddr/waddr tracking   │  │
    │  │                            │  └────────┬─────────────────┘  │
    │  │                            │           │                     │
    │  └───────── exec_bram_spiked[7:0] <───────┘                    │
    │                               │                                 │
    └───────── exec_bram_phase1_done ────────────────────────────────┤
                                    │                                 │
                                    └─────────────────────────────────┘

    BRAM0 (18Kb)                    BRAM1 (18Kb)
    ┌────────────┐                  ┌────────────┐
    │ 16384 × 8b │                  │ 16384 × 8b │
    │            │                  │            │
    │ Toggles:   │                  │ Toggles:   │
    │ Present ←→ │                  │ Future ←→  │
    │ Future     │                  │ Present    │
    └────────────┘                  └────────────┘

Data Flow (Two-Phase Operation)#

Phase 0: Setup (between time steps)

1. exec_run pulse triggers:
   - bram_select toggles (swaps present ←→ future)
   - State machine resets to IDLE

2. Present BRAM now contains accumulated events from previous "future"
3. Future BRAM ready to accumulate new events

Phase 1: Event Processing (during time step)

STATE_FILL_PIPE (cycles 0-2):
   ├─> Read BRAM addresses 0, 1, 2
   └─> Fill 3-stage pipeline (no writes yet)

STATE_READ_INPUTS (cycles 3 to completion):
   ├─> Wait for exec_hbm_rvalidready
   ├─> Read next BRAM address (bramPresent_raddr++)
   ├─> Write 0 to lagging address (bramPresent_waddr++)
   ├─> Output exec_bram_spiked[7:0] to downstream
   └─> Loop until bramPresent_waddr == BRAM_ADDR_LIMIT

STATE_PHASE1_DONE:
   └─> Assert exec_bram_phase1_done

Concurrent Future Writes (throughout processing):

setArray_go pulse:
   ├─> Check for pipeline hazards (same address in stages 0, 1, 2)
   ├─> Merge with in-flight data if hazard detected
   ├─> Propagate through 3-stage pipeline
   └─> Write to Future BRAM after 3 cycles

Interface Specification#

Base Version (external_events_processor.v)#

Parameters#

Parameter	Default	Description
`PIPE_DEPTH`	3	BRAM read pipeline depth (matches BRAM latency)

Clock and Reset#

Port	Direction	Width	Description
`clk`	Input	1	System clock (225 MHz)
`resetn`	Input	1	Active-low asynchronous reset

Configuration#

Port	Direction	Width	Description
`num_inputs`	Input	17	Total number of input axons (max 131,072)

External Event Input Interface#

Port	Direction	Width	Description
`setArray_go`	Input	1	Write pulse for new axon event
`setArray_addr`	Input	14	BRAM row address (8 axons per row)
`setArray_data`	Input	8	Bit mask (1=spike, 0=no spike)

Execution Control Interface#

Port	Direction	Width	Description
`exec_run`	Input	1	Start new time step (toggles BRAMs)
`exec_bram_phase1_ready`	Output	1	Pipeline filled, ready for reads
`exec_hbm_rvalidready`	Input	1	HBM data valid & ready (advance BRAM)
`exec_bram_spiked`	Output	8	Current spike mask (8 axons)
`exec_bram_phase1_done`	Output	1	All inputs read, phase 1 complete

BRAM0 Interface#

Port	Direction	Width	Description
`bram0_waddr`	Output	14	Write address
`bram0_wdata`	Output	8	Write data
`bram0_wren`	Output	1	Write enable
`bram0_raddr`	Output	14	Read address
`bram0_rden`	Output	1	Read enable
`bram0_rdata`	Input	8	Read data (3-cycle latency)

BRAM1 Interface#

Port	Direction	Width	Description
`bram1_waddr`	Output	14	Write address
`bram1_wdata`	Output	8	Write data
`bram1_wren`	Output	1	Write enable
`bram1_raddr`	Output	14	Read address
`bram1_rden`	Output	1	Read enable
`bram1_rdata`	Input	8	Read data (3-cycle latency)

Simple Version (external_events_processor_simple.v)#

Key differences from base version:

13-bit addresses: axonEvent_addr[12:0], bram0/1_*addr[12:0]
16-bit data: axonEvent_data[15:0], bram0/1_*data[15:0], exec_eep_spiked[15:0]
16 axons per row: axon_addr_limit = num_inputs[16:4] (not [16:3])
Additional output: hbm2eep_rden (HBM FIFO read enable)
Debug outputs: eep_curr_state[1:0], curr_bram_waddr[12:0]
Renamed ports: exec_eep_* instead of exec_bram_*

V2 Version (external_events_processor_v2.v)#

Additional interfaces (beyond base version):

Command Interpreter Debug Interface#

Port	Direction	Width	Description
`ci2eep_empty`	Input	1	Debug command FIFO empty flag
`ci2eep_dout`	Input	14	Debug read address from CI
`ci2eep_rden`	Output	1	Debug command FIFO read enable
`eep2ci_full`	Input	1	Debug response FIFO full flag
`eep2ci_din`	Output	22	Debug response data (addr + data)
`eep2ci_wren`	Output	1	Debug response FIFO write enable

Debug BRAM Read Ports#

Port	Direction	Width	Description
`bram0_raddr_dbg`	Output	14	Debug read address for BRAM0
`bram0_rdata_dbg`	Input	8	Debug read data from BRAM0
`bram1_raddr_dbg`	Output	14	Debug read address for BRAM1
`bram1_rdata_dbg`	Input	8	Debug read data from BRAM1

Detailed Logic Description#

Base Version State Machine#

Present BRAM Control FSM#

States:

STATE_RESET        = 3'd0  // Reset addresses and flags
STATE_IDLE         = 3'd1  // Wait for exec_run
STATE_FILL_PIPE    = 3'd2  // Fill 3-stage BRAM read pipeline
STATE_READ_INPUTS  = 3'd3  // Read inputs, clear memory, sync with HBM
STATE_PHASE1_DONE  = 3'd4  // Signal completion

State Transitions:

    RESET
      |
      v
    IDLE <────────────────┐
      |                   │
      | exec_run          │
      v                   │
   FILL_PIPE              │
      |                   │
      | raddr >= 3        │
      v                   │
  READ_INPUTS             │
      |                   │
      | waddr == limit    │
      v                   │
  PHASE1_DONE ────────────┘

State Behaviors:

STATE_RESET:
    bramPresent_addr_rst = 1'b1        // Reset raddr and waddr to 0
    next_state = STATE_IDLE

STATE_IDLE:
    if (exec_run)
        bramPresent_addr_rst = 1'b1    // Reset for new time step
        next_state = STATE_FILL_PIPE

STATE_FILL_PIPE:
    if (bramPresent_raddr < PIPE_DEPTH)
        bramPresent_rden = 1'b1        // Issue read
        bramPresent_addr_inc = 1'b1    // Increment raddr
    else
        next_state = STATE_READ_INPUTS // Pipeline full

STATE_READ_INPUTS:
    if (exec_hbm_rvalidready)          // HBM ready for next data
        bramPresent_rden = 1'b1        // Read next address
        bramPresent_addr_inc = 1'b1    // Increment both raddr and waddr
        if (bramPresent_waddr == BRAM_ADDR_LIMIT)
            next_state = STATE_PHASE1_DONE

STATE_PHASE1_DONE:
    next_state = STATE_IDLE            // Return to idle

Address Management (Present BRAM)#

The module maintains two addresses with different roles:

Read Address (raddr) - Leading edge:

// Advances PIPE_DEPTH cycles ahead of write address
// Points to data that will be available after pipeline latency
always @(posedge clk) begin
    if (~resetn | exec_run | bramPresent_addr_rst)
        bramPresent_raddr <= 14'd0;
    else if (bramPresent_addr_inc)
        bramPresent_raddr <= bramPresent_raddr + 1'b1;
end

Write Address (waddr) - Lagging edge:

// Trails read address by PIPE_DEPTH cycles
// Points to data currently emerging from pipeline
always @(posedge clk) begin
    if (~resetn | exec_run | bramPresent_addr_rst)
        bramPresent_waddr <= 14'd0;
    else if (bramPresent_addr_inc && exec_bram_phase1_ready)
        bramPresent_waddr <= bramPresent_waddr + 1'b1;
end

Address Relationship:

Cycle 0-2 (FILL_PIPE):
   raddr: 0→1→2→3
   waddr: 0→0→0→0  (not advancing until exec_bram_phase1_ready)

Cycle 3+ (READ_INPUTS):
   raddr: 3→4→5→6→...
   waddr: 0→1→2→3→...  (maintaining 3-cycle lag)

Why Two Addresses?

BRAM has 3-cycle read latency
raddr issues read requests
waddr writes zeros to addresses whose data has emerged from pipeline
This implements “read first” behavior: read data, then clear it

Future BRAM Pipeline Hazard Handling#

The base version implements a sophisticated 3-stage pipeline to handle concurrent writes to the same BRAM address during the PIPE_DEPTH filling phase.

Problem: If two setArray_go pulses target the same address within 3 cycles, data could be lost.

Solution: Three-stage pipeline with hazard detection and data merging.

// Pipeline registers
reg [13:0] bramFuture_waddr [2:0];  // Stages 2→1→0
reg        bramFuture_wren  [2:0];
reg  [7:0] bramFuture_wdata [2:0];

// Stage assignments (stage 2 is newest, stage 0 is oldest)
always @(posedge clk) begin
    if (~resetn) begin
        // Initialize all stages
        bramFuture_wdata[2] <= 8'd0;
        bramFuture_wdata[1] <= 8'd0;
        bramFuture_wdata[0] <= 8'd0;
        // ... (similar for waddr, wren)
    end else if (setArray_go) begin
        // Check for hazards at each stage
        if (setArray_addr == bramFuture_waddr[2]) begin
            // Hazard in stage 2: merge immediately
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2] | setArray_data;
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b0;
        end else if (setArray_addr == bramFuture_waddr[1]) begin
            // Hazard in stage 1: merge with stage 1 data
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1] | setArray_data;
            bramFuture_wren[2]  <= 1'b0;
        end else if (setArray_addr == bramFuture_waddr[0]) begin
            // Hazard in stage 0: data will merge at BRAM (commented out)
            // Current code doesn't merge (see lines 95-96, 103-104)
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b0;
        end else begin
            // No hazard: normal pipeline operation
            bramFuture_wdata[2] <= setArray_data;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b1;
        end

        // Always propagate addresses and enables
        bramFuture_waddr[2] <= setArray_addr;
        bramFuture_waddr[1] <= bramFuture_waddr[2];
        bramFuture_waddr[0] <= bramFuture_waddr[1];
        bramFuture_wren[1]  <= bramFuture_wren[2];
        bramFuture_wren[0]  <= bramFuture_wren[1];
    end else begin
        // No new write: propagate with zeros
        bramFuture_wdata[2] <= 8'd0;
        bramFuture_wdata[1] <= bramFuture_wdata[2];
        bramFuture_wdata[0] <= bramFuture_wdata[1];
        // ... (propagate addresses/enables)
    end
end

Hazard Example:

Cycle | setArray_go | addr | data | Stage2    | Stage1    | Stage0    | Action
------|-------------|------|------|-----------|-----------|-----------|------------------
 |      1      | 100  | 0x01 | 100/0x01  |    -/-    |    -/-    | New write
 |      1      | 100  | 0x02 | 100/0x00  | 100/0x03  |    -/-    | Hazard! Merge 0x01|0x02=0x03
 |      1      | 200  | 0x04 | 200/0x04  | 100/0x00  | 100/0x03  | No hazard
 |      0      |  -   |  -   |    -/0x00 | 200/0x04  | 100/0x00  | Propagate
 |      0      |  -   |  -   |    -/0x00 |    -/0x00 | 200/0x04  | Write 100(0x03)
 |      0      |  -   |  -   |    -/0x00 |    -/0x00 |    -/0x00 | Write 200(0x04)

Note: Lines 95-96 and 103-104 show debugging modifications that bypass the final merge operation:

// Original (with full hazard handling):
// assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;

// Debug version (simpler, may lose events):
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];

Simple Version Logic#

The simple version removes complex hazard handling for single-core operation:

Key Simplifications:

Direct Future Write (no pipeline):

// No pipeline registers - direct assignment
assign bramFuture_waddr = axonEvent_addr_reg;
assign bramFuture_wdata = axonEvent_data_reg;
assign bramFuture_wren  = axonEvent_set_reg;

No Future BRAM Read (unless debugging):

assign bramFuture_raddr = 13'd0;
assign bramFuture_rden  = 1'b0;  // Disabled to avoid pipeline issues

Simplified State Machine (4 states instead of 5):

// Removed STATE_PHASE1_DONE, completion detected in STATE_READ_INPUTS
STATE_READ_INPUTS: begin
    if (exec_hbm_rvalidready) begin
        bramPresent_rden = 1'b1;
        bramPresent_wren = 1'b1;
        if (bramPresent_waddr == axon_addr_limit) begin
            phase1_done_set = 1'b1;
            next_state = STATE_IDLE;  // Direct transition
        end
    end
end

16 Axons Per Row:

// Base version: 8 axons per row
// BRAM_ADDR_LIMIT = num_inputs[16:3]  // Divide by 8

// Simple version: 16 axons per row
// axon_addr_limit = num_inputs[16:4]  // Divide by 16

Address Calculation Example:

num_inputs = 17'd131072 (max neurons)
Base: BRAM_ADDR_LIMIT = 131072 >> 3 = 16384 rows
Simple: axon_addr_limit = 131072 >> 4 = 8192 rows

Registered Input Events:

// Better place-and-route by registering inputs
always @(posedge clk) begin
    if (~resetn) begin
        axonEvent_set_reg  <= 1'b0;
        axonEvent_addr_reg <= 13'd0;
        axonEvent_data_reg <= 16'd0;
    end else begin
        axonEvent_set_reg  <= axonEvent_set;
        axonEvent_addr_reg <= axonEvent_addr;
        axonEvent_data_reg <= axonEvent_data;
    end
end

V2 Version Enhancements#

The V2 version adds debug capabilities while simplifying the future write logic:

Simplified Future Write (Read-Modify-Write)#

Instead of complex pipeline hazard detection, V2 uses RMW:

// Read the current value
assign bramFuture_raddr = setArray_addr[16:3];  // Note: only uses upper bits
assign bramFuture_rden  = ci2eep_rden | setArray_go | bramFuture_wren[2] | bramFuture_wren[1] | bramFuture_wren[0];
assign bramFuture_rdata = bram_select ? bram0_rdata : bram1_rdata;

// Merge with new data via OR operation
assign bramFuture_wdata = bramFuture_rdata | setArray_data;

// Propagate through 3-stage pipeline (addresses and enables only)
always @(posedge clk) begin
    if (~resetn) begin
        bramFuture_waddr[2] <= 14'd0;
        bramFuture_waddr[1] <= 14'd0;
        bramFuture_waddr[0] <= 14'd0;
        bramFuture_wren[2]  <= 1'b0;
        bramFuture_wren[1]  <= 1'b0;
        bramFuture_wren[0]  <= 1'b0;
    end else begin
        bramFuture_waddr[2] <= setArray_addr;
        bramFuture_waddr[1] <= bramFuture_waddr[2];
        bramFuture_waddr[0] <= bramFuture_waddr[1];
        bramFuture_wren[2]  <= setArray_go;
        bramFuture_wren[1]  <= bramFuture_wren[2];
        bramFuture_wren[0]  <= bramFuture_wren[1];
    end
end

Why This Works:

Always read before write (RMW pattern)
OR operation merges new spikes with existing ones
Simpler than explicit hazard detection
Relies on BRAM “read first” mode

Debug State Machine#

V2 adds a separate FSM for debug access:

Debug States:

DBG_STATE_RESET   = 3'd0  // Reset debug logic
DBG_STATE_IDLE    = 3'd1  // Wait for debug command
DBG_STATE_WAIT_0  = 3'd2  // Issue first read
DBG_STATE_WAIT_1  = 3'd3  // Wait cycle 2
DBG_STATE_WAIT_2  = 3'd4  // Wait cycle 3
DBG_STATE_WAIT_3  = 3'd5  // Wait cycle 4
DBG_STATE_DONE    = 3'd6  // Send response, pop command

Debug Flow:

Command Interpreter writes address to ci2eep FIFO
Debug FSM detects ~ci2eep_empty
Issue BRAM read (bram0/1_raddr_dbg = ci2eep_dout)
Wait 4 cycles for pipeline (WAIT_0→1→2→3)
Write response to eep2ci FIFO (address + data)
Pop command from ci2eep (ci2eep_rden = 1)
Return to IDLE

Debug Interface Behavior:

always @(*) begin
    ci2eep_rden = 1'b0;
    eep2ci_wren = 1'b0;
    dbg_next_state = dbg_curr_state;

    case (dbg_curr_state)
        DBG_STATE_IDLE: begin
            if (~ci2eep_empty)
                dbg_next_state = DBG_STATE_WAIT_0;
        end
        DBG_STATE_WAIT_0: begin
            if (~eep2ci_full)
                eep2ci_wren = 1'b1;  // Write response
            dbg_next_state = DBG_STATE_WAIT_1;
        end
        // ... (similar for WAIT_1, WAIT_2, WAIT_3)
        DBG_STATE_DONE: begin
            ci2eep_rden = 1'b1;      // Pop command
            dbg_next_state = DBG_STATE_IDLE;
        end
    endcase
end

// Response format: {address, data}
assign eep2ci_din = bram_select ? {bram0_raddr_dbg, bram0_rdata_dbg}
                                : {bram1_raddr_dbg, bram1_rdata_dbg};

// Debug read addresses (always driven)
assign bram0_raddr_dbg = ci2eep_dout;
assign bram1_raddr_dbg = ci2eep_dout;

Memory Map#

Base and V2 Versions (8 axons/row)#

BRAM Organization:

Depth: 16,384 rows (14-bit address)
Width: 8 bits (1 bit per axon)
Total Capacity: 131,072 axons
Dual BRAMs: BRAM0 and BRAM1 (double buffering)

Address Mapping:

Axon ID Range     | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 7             | 0x0000       | [0] to [7]
8 - 15            | 0x0001       | [0] to [7]
16 - 23           | 0x0002       | [0] to [7]
...               | ...          | ...
131064 - 131071   | 0x3FFF       | [0] to [7]

Bit Encoding:

Bit = 1: Axon spiked
Bit = 0: No spike

Address Calculation:

bram_addr = axon_id[16:3];      // Upper 14 bits
bit_pos   = axon_id[2:0];       // Lower 3 bits

Simple Version (16 axons/row)#

BRAM Organization:

Depth: 8,192 rows (13-bit address)
Width: 16 bits (1 bit per axon)
Total Capacity: 131,072 axons
Dual BRAMs: BRAM0 and BRAM1

Address Mapping:

Axon ID Range     | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 15            | 0x0000       | [0] to [15]
16 - 31           | 0x0001       | [0] to [15]
32 - 47           | 0x0002       | [0] to [15]
...               | ...          | ...
131056 - 131071   | 0x1FFF       | [0] to [15]

Address Calculation:

bram_addr = axon_id[16:4];      // Upper 13 bits
bit_pos   = axon_id[3:0];       // Lower 4 bits

Memory Utilization:

Base/V2: 16,384 × 8 = 128 Kb per BRAM → 256 Kb total
Simple: 8,192 × 16 = 128 Kb per BRAM → 256 Kb total
Same total capacity, different organization

Timing Diagrams#

Time Step Transition (Double Buffer Swap)#

         ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────
clk      ┘     └─────┘     └─────┘     └─────┘     └─────┘

exec_run ──────┐     ┌─────────────────────────────────────────
               └─────┘

         Time Step N-1    │ Time Step N
         ─────────────────┼─────────────────────────────────────
                          │
bram_select = 0           │ Toggle → 1
BRAM0 = Present           │ BRAM0 → Future
BRAM1 = Future            │ BRAM1 → Present
                          │
State:   READ_INPUTS      │ IDLE → FILL_PIPE → READ_INPUTS

bramPresent = BRAM0       │ bramPresent = BRAM1
bramFuture  = BRAM1       │ bramFuture  = BRAM0

Present BRAM Read Sequence (Base & V2)#

Cycle    0    1    2    3    4    5    6    7    8    9
         ─────┬────┬────┬────┬────┬────┬────┬────┬────┬────
State    FILL │FILL│FILL│READ│READ│READ│READ│READ│READ│...
         ────┬┴────┴────┴────┴────┴────┴────┴────┴────┴────

rden     ────┐    ┌───┐    ┌───┐    ┌───┐    ┌───┐    ┌───
         ────└────┘   └────┘   └────┘   └────┘   └────┘

raddr        0    1    2    3    4    5    6    7    8    9

wren     ───────────────────┐    ┌───┐    ┌───┐    ┌───┐
         ───────────────────└────┘   └────┘   └────┘   └────

waddr        0    0    0    0    1    2    3    4    5    6

                     ┌─────────┐
phase1_ready ────────┘         └──────────────────────────────
                          (asserted in READ_INPUTS state)

hbm_rvalidready ────────────┐    ┌───┐    ┌───┐    ┌───┐
                ────────────└────┘   └────┘   └────┘   └────

exec_bram_spiked    X    X    X    D0   D1   D2   D3   D4   D5
                    (pipeline latency = 3 cycles)

BRAM Data Flow:
  Cycle 0: Issue read addr 0  →  (3-cycle latency)  →  Cycle 3: Data 0 emerges
  Cycle 1: Issue read addr 1  →  (3-cycle latency)  →  Cycle 4: Data 1 emerges
  Cycle 2: Issue read addr 2  →  (3-cycle latency)  →  Cycle 5: Data 2 emerges
  Cycle 3: Issue read addr 3, Write 0 to addr 0
  Cycle 4: Issue read addr 4, Write 0 to addr 1
  ...

Future BRAM Write with Hazard (Base Version)#

Cycle        0    1    2    3    4    5    6    7
             ────┬────┬────┬────┬────┬────┬────┬────
setArray_go  ───┐    ┌───┐    ┌───────────────────
             ───└────┘   └────┘

setArray_addr    100  200  100  -    -    -    -    -
setArray_data    0x01 0x04 0x02 -    -    -    -    -

Pipeline Stage 2:
  waddr          100  200  100  -    -    -    -    -
  wdata          0x01 0x04 0x00 -    -    -    -    -
  wren           1    1    0    -    -    -    -    -

Pipeline Stage 1:
  waddr          -    100  200  100  -    -    -    -
  wdata          -    0x01 0x04 0x03 -    -    -    - ← Merged!
  wren           -    1    1    0    -    -    -    -

Pipeline Stage 0:
  waddr          -    -    100  200  100  -    -    -
  wdata          -    -    0x01 0x04 0x00 -    -    -
  wren           -    -    1    1    0    -    -    -

BRAM Write:
  Addr 100 ──────────────────────────────────────┐
    Data = 0x01 (from cycle 2) ───────────────────┘

  Addr 200 ─────────────────────────────────────────┐
    Data = 0x04 (from cycle 3) ───────────────────┘

  Addr 100 ────────────────────────────────────────────┐
    Data = 0x03 (merged 0x01|0x02 from cycle 4) ───────┘

Note: Cycle 2 write to addr 100 detected hazard with cycle 0, merged data in stage 1.

Simple Version: No Pipeline (Direct Write)#

Cycle        0    1    2    3    4    5
             ────┬────┬────┬────┬────┬────
axonEvent_set ──┐    ┌───┐    ┌──────────
             ───└────┘   └────┘

axonEvent_addr   100  200  300  -    -    -
axonEvent_data   0x0001 0x0004 0x0008 -    -    -

(Register inputs for better timing)
             ────┬────┬────┬────┬────┬────
*_set_reg    ───────┐    ┌───┐    ┌──────
                ───└────┘   └────┘

*_addr_reg       -    100  200  300  -    -
*_data_reg       -    0x0001 0x0004 0x0008 -    -

bramFuture_waddr -    100  200  300  -    -
bramFuture_wdata -    0x0001 0x0004 0x0008 -    -
bramFuture_wren  -    1    1    1    -    -

BRAM Write:      -    100  200  300  -    -
                      0x0001 0x0004 0x0008

No hazard handling! Single-core only.
If same address written twice within pipeline depth, later write overwrites earlier.

V2 Debug Read Sequence#

Cycle        0    1    2    3    4    5    6    7
             ────┬────┬────┬────┬────┬────┬────┬────
DBG_State    IDLE│WAIT│WAIT│WAIT│WAIT│DONE│IDLE│...
                 │  0 │  1 │  2 │  3 │    │    │

ci2eep_empty ────┐                                  ┌───
             ────└──────────────────────────────────┘

ci2eep_dout      0x1234 (stays valid until rden)

ci2eep_rden  ────────────────────────────────────┐  ┌───
             ────────────────────────────────────└──┘

bram*_raddr_dbg  0x1234 (always driven)

eep2ci_wren  ───────┐    ┌───┐    ┌───┐    ┌───┐  ┌───
             ───────└────┘   └────┘   └────┘   └──┘
                 (writes in WAIT_0 through WAIT_3 if not full)

eep2ci_din       X    {0x1234,D}  (data D emerges after latency)

Debug transaction:
  Cycle 0: Detect command available
  Cycle 1-4: Issue reads, wait for pipeline, write response
  Cycle 5: Pop command FIFO
  Cycle 6: Return to IDLE

Cross-References#

Upstream Modules#

command_interpreter.v (command_interpreter.md):
- Generates setArray_go, setArray_addr, setArray_data signals
- V2: Provides debug FIFO interfaces (ci2eep_*, eep2ci_*)
- Controls when external events are injected
pcie2fifos.v (pcie2fifos.md):
- Ultimate source of external events from host
- Events flow: PCIe → Command Interpreter → External Events Processor

Downstream Modules#

hbm_processor.v (hbm_processor.md):
- Receives exec_bram_spiked (spike mask)
- Provides exec_hbm_rvalidready (synchronization signal)
- Uses spike masks to fetch pointer chains from HBM
internal_events_processor.v (internal_events_processor.md):
- Receives exec_bram_phase1_done (completion signal)
- Coordinates two-phase execution (external then internal events)

Peer Modules#

pointer_fifo_controller.v (pointer_fifo_controller.md):
- Works with spike masks from this module
- Controls flow of pointer data to HBM processor

Module Comparison: When to Use Each Variant#

Use Base Version When:#

Multi-core architecture with multiple cores writing to same future BRAM
Concurrent writes to the same BRAM address are expected
Data integrity is critical and no events can be lost
Pipeline hazards need explicit detection and merging
8 axons per row organization preferred

Trade-offs:

✅ Full hazard handling
✅ No data loss in multi-core scenarios
❌ More complex logic
❌ Higher resource usage (pipeline registers)
⚠️ Debugging modifications present (lines 95-96, 103-104)

Use Simple Version When:#

Single-core architecture with only one writer to future BRAM
Lower resource usage is priority
Wider data paths (16-bit) preferred for bandwidth
No concurrent writes to same address expected
Simpler logic easier to verify and debug

Trade-offs:

✅ Minimal resource usage
✅ 2× data width (16 vs 8 bits)
✅ Simpler state machine (4 vs 5 states)
✅ Better timing due to registered inputs
❌ No hazard protection
❌ Data loss if concurrent writes occur
❌ Single-core only

Use V2 Version When:#

Debug and verification required
BRAM inspection needed during runtime
Command interpreter interface for test patterns
Read-modify-write approach acceptable
Production debugging of neuromorphic algorithms

Trade-offs:

✅ Debug capabilities (FIFO interface)
✅ Simplified future write logic (RMW vs explicit hazards)
✅ Direct BRAM inspection via debug ports
❌ Additional debug FSM (more resources)
❌ Extra FIFO interfaces
❌ Not optimized for performance

Performance Characteristics#

Base and V2 Versions#

Throughput:

Read Rate: 1 BRAM address per exec_hbm_rvalidready cycle
Effective Rate: Limited by HBM bandwidth (~450 MHz possible, typically 225 MHz)
Pipeline Fill: 3 cycles (one-time cost per time step)
Total Time: 3 + num_inputs[16:3] cycles per time step

Example (131,072 neurons):

BRAM addresses = 131072 / 8 = 16384
Pipeline fill  = 3 cycles
Total cycles   = 3 + 16384 = 16387 cycles
At 225 MHz     = 16387 / 225e6 = 72.8 µs

Future Write Latency:

Base: 3 cycles (pipeline depth) from setArray_go to BRAM write
V2: 3 cycles (pipeline depth) from setArray_go to BRAM write
Hazard Penalty: 0 cycles (merged in pipeline)

Simple Version#

Throughput:

Read Rate: 1 BRAM address per exec_hbm_rvalidready cycle
Effective Rate: 225 MHz typical
Pipeline Fill: 3 cycles
Total Time: 3 + num_inputs[16:4] cycles per time step

Example (131,072 neurons):

BRAM addresses = 131072 / 16 = 8192
Pipeline fill  = 3 cycles
Total cycles   = 3 + 8192 = 8195 cycles
At 225 MHz     = 8195 / 225e6 = 36.4 µs  (2× faster than base!)

Future Write Latency:

Direct: 1 cycle from axonEvent_set to registered write
Total: 2 cycles (register + BRAM write)

Resource Usage Comparison:

Resource	Base	Simple	V2
LUTs (approx.)	500	250	600
Flip-Flops	200	120	280
BRAM18K	2	2	2
Pipeline Regs	3×(14+8+1)	0	3×(14+1)

Common Issues and Debugging#

Issue 1: Events Lost During Time Step Transition#

Symptoms:

External events written near exec_run pulse disappear
Inconsistent spike counts between time steps

Root Cause:

Writing to future BRAM while bram_select is toggling
Race condition between write and buffer swap

Debug:

// Check timing of setArray_go relative to exec_run
// Add ILA probe:
ila_0 your_ila (
    .clk(clk),
    .probe0(exec_run),
    .probe1(setArray_go),
    .probe2(setArray_addr),
    .probe3(bram_select)
);

Solution:

Ensure setArray_go never occurs within 3 cycles of exec_run
Add FIFO between command interpreter and external events processor
Stall writes during buffer swap

Issue 2: Pipeline Hazards Not Detected (Base Version)#

Symptoms:

Expected spike data doesn’t appear
OR of multiple writes shows only one bit set

Root Cause:

Hazard detection logic not functioning
Debugging modifications (lines 95-96, 103-104) bypass merging

Debug:

// Monitor pipeline stages
(* mark_debug = "true" *) reg [13:0] bramFuture_waddr_dbg [2:0];
(* mark_debug = "true" *) reg  [7:0] bramFuture_wdata_dbg [2:0];
(* mark_debug = "true" *) reg        bramFuture_wren_dbg  [2:0];

always @(posedge clk) begin
    bramFuture_waddr_dbg <= bramFuture_waddr;
    bramFuture_wdata_dbg <= bramFuture_wdata;
    bramFuture_wren_dbg  <= bramFuture_wren;
end

Solution:

Restore original BRAM write logic:

// Change:
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];

// To:
assign bram0_wdata = ~bram_select ? bramPresent_wdata :
                     bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;

Issue 3: Address Limit Calculation Wrong#

Symptoms:

Phase 1 completes too early or too late
Not all neurons receive input events

Root Cause:

Incorrect calculation of BRAM address limit
Mismatch between num_inputs and actual neuron count

Debug:

// Check address limit
// Base:   BRAM_ADDR_LIMIT = num_inputs[16:3]  (divide by 8)
// Simple: axon_addr_limit = num_inputs[16:4]  (divide by 16)

// Add assertion:
assert property (@(posedge clk) disable iff (~resetn)
    (curr_state == STATE_READ_INPUTS && bramPresent_waddr == BRAM_ADDR_LIMIT)
    |=> (curr_state == STATE_PHASE1_DONE)
);

Solution:

Verify num_inputs matches neuron configuration
Base: Ensure multiple of 8
Simple: Ensure multiple of 16
Add +1 if rounding needed:

// If num_inputs not exact multiple
assign BRAM_ADDR_LIMIT = (num_inputs[16:3]) + |num_inputs[2:0];  // Round up

Issue 4: BRAM Read Latency Mismatch#

Symptoms:

Data appears corrupted or delayed
exec_bram_spiked shows wrong values

Root Cause:

BRAM configured with latency ≠ PIPE_DEPTH (3)
Pipeline depth parameter doesn’t match actual BRAM

Debug:

// Verify BRAM configuration in IP customization:
// - Read Latency: should be 3
// - Primitive Type: should match PIPE_DEPTH

// Check if raddr/waddr maintain proper offset:
assert property (@(posedge clk) disable iff (~resetn)
    (curr_state == STATE_READ_INPUTS && exec_bram_phase1_ready)
    |-> (bramPresent_raddr == bramPresent_waddr + PIPE_DEPTH)
);

Solution:

Reconfigure BRAM IP for 3-cycle latency
Or update PIPE_DEPTH parameter to match BRAM:

external_events_processor #(
    .PIPE_DEPTH(2)  // If BRAM has 2-cycle latency
) eep_inst (
    // ...
);

Issue 5: V2 Debug Reads Return Stale Data#

Symptoms:

Debug responses show old/incorrect BRAM data
Debug state machine stuck in WAIT states

Root Cause:

Insufficient wait cycles for BRAM read latency
Debug FSM transitions too quickly

Debug:

// Monitor debug state progression
(* mark_debug = "true" *) reg [2:0] dbg_state_history [7:0];

always @(posedge clk) begin
    dbg_state_history[7:1] <= dbg_state_history[6:0];
    dbg_state_history[0]   <= dbg_curr_state;
end

Solution:

Ensure 4 WAIT states (WAIT_0→WAIT_1→WAIT_2→WAIT_3)
Add extra wait state if needed:

localparam [2:0] DBG_STATE_WAIT_4 = 3'd7;

// In state machine:
DBG_STATE_WAIT_3: begin
    if (~eep2ci_full)
        eep2ci_wren = 1'b1;
    dbg_next_state = DBG_STATE_WAIT_4;  // Extra cycle
end
DBG_STATE_WAIT_4: begin
    dbg_next_state = DBG_STATE_DONE;
end

Safety and Edge Cases#

Edge Case 1: num_inputs = 0#

Behavior:

BRAM_ADDR_LIMIT = 0
State machine immediately transitions FILL_PIPE → READ_INPUTS → PHASE1_DONE
No BRAM accesses occur

Safety:

✅ No undefined behavior
✅ Module functions correctly (zero inputs processed)
⚠️ Wastes cycles (should be caught at system level)

Edge Case 2: num_inputs Not Multiple of 8 (or 16)#

Example: num_inputs = 17'd100

Base Version:

BRAM_ADDR_LIMIT = 100 >> 3 = 12
Actual coverage   = 12 * 8 = 96 axons
Missing           = 4 axons (96-99 not processed)

Fix:

// Round up to nearest multiple
assign BRAM_ADDR_LIMIT = (num_inputs + 7) >> 3;  // Ceiling division

Edge Case 3: Concurrent setArray_go and exec_run#

Scenario:

Cycle N:   exec_run = 1 (toggle bram_select)
Cycle N:   setArray_go = 1 (write to future BRAM)

Problem:

bram_select changes, may write to wrong BRAM

Current Design:

bram_select registered on exec_run edge
setArray_go writes on same edge
Race condition! Indeterminate which BRAM receives write

Solution:

Pipeline exec_run by 1 cycle:

reg exec_run_pipe;

always @(posedge clk) begin
    if (~resetn)
        exec_run_pipe <= 1'b0;
    else
        exec_run_pipe <= exec_run;
end

// Use exec_run_pipe for bram_select toggle
always @(posedge clk) begin
    if (~resetn)
        bram_select <= 1'b0;
    else if (exec_run_pipe)  // Changed from exec_run
        bram_select <= ~bram_select;
end

Edge Case 4: BRAM Write During Pipeline Fill#

Scenario:

STATE_FILL_PIPE: bramPresent_wren = 0 (not asserted yet)
Future writes:    bramFuture_wren[0] = 1 (trying to write)

Problem (Multi-core):

If multiple cores write to future BRAM during present BRAM pipeline fill
Potential for lost writes if exceeding BRAM write bandwidth

Current Design:

Single write port per BRAM
Future writes serialized through pipeline
Safe as long as write rate ≤ 1 per 3 cycles

Solution (if needed):

Use dual-port BRAM (separate read/write ports)
Or implement write FIFO to buffer concurrent writes

Safety Check: Phase 1 Completion Detection#

Assertion:

// Ensure phase1_done only asserted when all addresses processed
property phase1_done_check;
    @(posedge clk) disable iff (~resetn)
    (exec_bram_phase1_done) |-> (bramPresent_waddr == BRAM_ADDR_LIMIT);
endproperty
assert_phase1: assert property (phase1_done_check);

Safety Check: No Writes During Buffer Swap#

Assertion:

// Ensure no future writes during exec_run
property no_write_during_swap;
    @(posedge clk) disable iff (~resetn)
    (exec_run) |-> (bramFuture_wren[0] == 1'b0);
endproperty
assert_no_write: assert property (no_write_during_swap);

Future Enhancement Opportunities#

1. Configurable Data Width#

Allow parameterization of axons per row:

module external_events_processor #(
    parameter PIPE_DEPTH = 3,
    parameter AXONS_PER_ROW = 8  // 8, 16, 32, etc.
)(
    // Derive address and data widths
    localparam ADDR_BITS = 17 - $clog2(AXONS_PER_ROW);
    localparam DATA_BITS = AXONS_PER_ROW;

    input [ADDR_BITS-1:0] setArray_addr,
    input [DATA_BITS-1:0] setArray_data,
    // ...
);

2. Burst Mode for Faster Pipeline Fill#

Current: Fill pipeline sequentially (3 cycles) Enhancement: Issue all 3 reads in 1 cycle (if BRAM supports)

STATE_FILL_PIPE: begin
    if (bramPresent_raddr == 0) begin
        // Issue all 3 reads at once
        bram_raddr[0] = 14'd0;
        bram_raddr[1] = 14'd1;
        bram_raddr[2] = 14'd2;
        bram_rden[0]  = 1'b1;
        bram_rden[1]  = 1'b1;
        bram_rden[2]  = 1'b1;
        next_state = STATE_READ_INPUTS;
    end
end

3. Event Timestamping#

Add timestamp to each event for precise temporal resolution:

// Expand data width: [7:0] data + [15:0] timestamp
input [23:0] setArray_data,  // {timestamp, spike_mask}

// BRAM organization: 24 bits per row

4. Event Compression#

Sparse events (few spikes per row) waste bandwidth:

// Instead of full bit mask, store indices
// Example: Spikes at axons 5, 17, 42
// Compressed: {3'b011, 6'd42, 6'd17, 6'd5}  // Count + indices

5. Multi-Buffer (>2 BRAMs)#

Allow more than 2 time steps in flight:

parameter NUM_BUFFERS = 4;  // Quad buffering

reg [1:0] bram_select;      // 2-bit select (4 buffers)

always @(posedge clk) begin
    if (exec_run)
        bram_select <= (bram_select + 1) & 2'b11;  // Circular
end

6. AXI4-Stream Interface#

Replace custom interface with standard AXI4-Stream:

// Input events
input         s_axis_tvalid,
output        s_axis_tready,
input  [31:0] s_axis_tdata,  // {addr, data}
input         s_axis_tlast,

// Output spikes
output        m_axis_tvalid,
input         m_axis_tready,
output [31:0] m_axis_tdata,  // Spike mask + metadata

7. Configurable Pipeline Depth#

Auto-detect BRAM latency at synthesis:

// Query BRAM IP for latency
localparam BRAM_LATENCY = bram0.READ_LATENCY_A;  // From BRAM IP

external_events_processor #(
    .PIPE_DEPTH(BRAM_LATENCY)  // Match automatically
) eep (
    // ...
);

Key Terms and Definitions#

Term	Definition
Axon	Input neuron connection; source of spike events
Double Buffering	Two-buffer scheme (present/future) allowing simultaneous read and write
Present BRAM	BRAM being read during current time step (then cleared)
Future BRAM	BRAM accumulating events for next time step
bram_select	Toggle bit selecting which physical BRAM is present vs. future
Pipeline Depth	Number of cycles between BRAM read request and data availability (typically 3)
Pipeline Fill	Initial phase where read pipeline is populated before writes begin
Leading Address	Read address (raddr) - advances pipeline depth ahead of write address
Lagging Address	Write address (waddr) - clears data after it emerges from pipeline
Spike Mask	Bit vector where each bit represents spike (1) or no-spike (0) for an axon
Phase 1	External event processing (vs. Phase 2: internal/synaptic events)
exec_run	Control pulse starting new time step, toggling present ←→ future BRAMs
exec_hbm_rvalidready	Synchronization signal from HBM indicating data consumed, advance BRAM
setArray_go	Write pulse for external event (from command interpreter or other source)
Pipeline Hazard	Conflict when concurrent writes target same BRAM address within pipeline depth
RMW (Read-Modify-Write)	Pattern of reading current value, modifying, then writing back
Hazard Detection	Logic identifying when new write conflicts with in-flight writes
Data Merging	Combining multiple writes to same address via OR operation
Time Step	Discrete computation cycle in neuromorphic algorithm (milliseconds typically)
Axon Event	External spike arriving at input neuron
Axons Per Row	Number of axons packed into single BRAM address (8 or 16 bits)
Address Limit	Maximum BRAM address to read/write (depends on num_inputs)

Conclusion#

The External Events Processor family provides flexible solutions for managing input spike events in neuromorphic systems:

Base version: Full-featured with pipeline hazard handling for multi-core
Simple version: Streamlined single-core variant with lower resource usage
V2 version: Debug-enhanced variant for verification and development

Key Design Principles:

Double buffering prevents event loss during time step transitions
Pipeline management ensures correct synchronization with BRAM latency
Hazard detection/merging (base) or simplified RMW (V2) prevents data corruption
State machine coordinates read-clear cycles with downstream modules

Selection Guide:

Multi-core system with concurrent writes → Base version
Single-core system, resource-constrained → Simple version
Debug/verification needed → V2 version

For questions or issues, cross-reference with command_interpreter.md (upstream) and hbm_processor.md (downstream) for complete system understanding.

External Events Processor Module Family#

Overview#

Role in the Software/Hardware Stack#

Variant Comparison#

Module Architecture (Base Version)#

Data Flow (Two-Phase Operation)#

Interface Specification#

Base Version (external_events_processor.v)#

Parameters#

Clock and Reset#

Configuration#

External Event Input Interface#

Execution Control Interface#

BRAM0 Interface#

BRAM1 Interface#

Simple Version (external_events_processor_simple.v)#

V2 Version (external_events_processor_v2.v)#

Command Interpreter Debug Interface#

Debug BRAM Read Ports#

Detailed Logic Description#

Base Version State Machine#

Present BRAM Control FSM#

Address Management (Present BRAM)#

Future BRAM Pipeline Hazard Handling#

Simple Version Logic#

V2 Version Enhancements#

Simplified Future Write (Read-Modify-Write)#

Debug State Machine#

Memory Map#

Base and V2 Versions (8 axons/row)#

Simple Version (16 axons/row)#

Timing Diagrams#

Time Step Transition (Double Buffer Swap)#

Present BRAM Read Sequence (Base & V2)#

Future BRAM Write with Hazard (Base Version)#

Simple Version: No Pipeline (Direct Write)#

V2 Debug Read Sequence#

Cross-References#

Upstream Modules#

Downstream Modules#

Peer Modules#

Module Comparison: When to Use Each Variant#

Use Base Version When:#

Use Simple Version When:#

Use V2 Version When:#

Performance Characteristics#

Base and V2 Versions#

Simple Version#

Common Issues and Debugging#

Issue 1: Events Lost During Time Step Transition#

Issue 2: Pipeline Hazards Not Detected (Base Version)#

Issue 3: Address Limit Calculation Wrong#

Issue 4: BRAM Read Latency Mismatch#

Issue 5: V2 Debug Reads Return Stale Data#

Safety and Edge Cases#

Edge Case 1: num_inputs = 0#

Edge Case 2: num_inputs Not Multiple of 8 (or 16)#

Edge Case 3: Concurrent setArray_go and exec_run#

Edge Case 4: BRAM Write During Pipeline Fill#

Safety Check: Phase 1 Completion Detection#

Safety Check: No Writes During Buffer Swap#

Future Enhancement Opportunities#

1. Configurable Data Width#

2. Burst Mode for Faster Pipeline Fill#

3. Event Timestamping#

4. Event Compression#

5. Multi-Buffer (>2 BRAMs)#

6. AXI4-Stream Interface#

7. Configurable Pipeline Depth#

Key Terms and Definitions#

Conclusion#

This Page