External Events Processor Module Family#

Overview#

The External Events Processor family manages input spike events (axons) in the neuromorphic FPGA system. These modules maintain two Block RAMs in a double-buffering scheme: one for the “present” time step (currently being processed) and one for the “future” time step (accumulating new events). This architecture allows continuous operation without dropping input events.

Three variants exist:

  1. external_events_processor.v - Base version with full pipeline hazard handling

  2. external_events_processor_simple.v - Simplified single-core version with wider data paths

  3. external_events_processor_v2.v - Enhanced version with debugging capabilities

Role in the Software/Hardware Stack#

Host Application (Python/C++)
         |
    [hs_bridge]
         |
  [PCIe Interface]
         |
  [Command Interpreter] -----> [External Events Processor] <---- External spike events
         |                              |
         |                     Present BRAM (8 or 16 axons/row)
         |                     Future BRAM (8 or 16 axons/row)
         |                              |
    [HBM Processor] <------ exec_bram_spiked (spike mask)
         |                              |
  [Internal Events] <-------- exec_bram_phase1_done
    Processor

Function:

  • Receive input spike events from external sources or command interpreter

  • Store events in double-buffered BRAMs (present/future)

  • Synchronize event delivery with HBM read operations

  • Clear processed events after reading

  • Handle pipeline hazards for concurrent writes during multi-core operation

Key Innovation: Double-buffering allows new events to accumulate in the “future” BRAM while the “present” BRAM is being read and cleared, ensuring no event loss during processing.


Variant Comparison#

Feature

Base Version

Simple Version

V2 Version

File

external_events_processor.v

external_events_processor_simple.v

external_events_processor_v2.v

Axons per row

8

16

8

Address width

14 bits

13 bits

14 bits

Data width

8 bits

16 bits

8 bits

Target

Multi-core

Single-core

Debug/verification

Future pipeline

3-stage hazard handling

Direct write (no hazards)

Simplified RMW

State machine

5 states

4 states

5 states + debug FSM

Debug features

None

None

CI interface, debug ports

Complexity

High

Low

High


Module Architecture (Base Version)#

                                    ┌─────────────────────────────────┐
                                    │  External Events Processor      │
                                    │                                 │
   setArray_go ──────────┐          │  ┌──────────────────────────┐  │
   setArray_addr[13:0] ──┼──────────┼─>│  Future BRAM Control     │  │
   setArray_data[7:0] ───┘          │  │  - 3-stage pipeline      │  │
                                    │  │  - Hazard detection      │  │
   exec_run ────────────────────────┼─>│  - waddr/wdata/wren[2:0] │  │
                                    │  └────────┬─────────────────┘  │
                                    │           │                     │
                                    │           v                     │
                                    │  ┌──────────────────────────┐  │
                                    │  │   BRAM Multiplexer       │  │
                                    │  │   bram_select toggle     │  │
    ┌───────────────────────────────┼─>│   - BRAM0 ←→ Present    │  │
    │                               │  │   - BRAM1 ←→ Future      │  │
    │  ┌────────────────────────────┼─<│                          │  │
    │  │                            │  └───────┬──────────────────┘  │
    │  │                            │          │                     │
    │  │                            │          v                     │
    │  │                            │  ┌──────────────────────────┐  │
    │  │   exec_hbm_rvalidready ────┼─>│ Present BRAM Control     │  │
    │  │                            │  │ - State machine (5)      │  │
    │  │                            │  │ - Pipeline fill          │  │
    │  │                            │  │ - Read & clear           │  │
    │  │                            │  │ - raddr/waddr tracking   │  │
    │  │                            │  └────────┬─────────────────┘  │
    │  │                            │           │                     │
    │  └───────── exec_bram_spiked[7:0] <───────┘                    │
    │                               │                                 │
    └───────── exec_bram_phase1_done ────────────────────────────────┤
                                    │                                 │
                                    └─────────────────────────────────┘

    BRAM0 (18Kb)                    BRAM1 (18Kb)
    ┌────────────┐                  ┌────────────┐
    │ 16384 × 8b │                  │ 16384 × 8b │
    │            │                  │            │
    │ Toggles:   │                  │ Toggles:   │
    │ Present ←→ │                  │ Future ←→  │
    │ Future     │                  │ Present    │
    └────────────┘                  └────────────┘

Data Flow (Two-Phase Operation)#

Phase 0: Setup (between time steps)

1. exec_run pulse triggers:
   - bram_select toggles (swaps present ←→ future)
   - State machine resets to IDLE

2. Present BRAM now contains accumulated events from previous "future"
3. Future BRAM ready to accumulate new events

Phase 1: Event Processing (during time step)

STATE_FILL_PIPE (cycles 0-2):
   ├─> Read BRAM addresses 0, 1, 2
   └─> Fill 3-stage pipeline (no writes yet)

STATE_READ_INPUTS (cycles 3 to completion):
   ├─> Wait for exec_hbm_rvalidready
   ├─> Read next BRAM address (bramPresent_raddr++)
   ├─> Write 0 to lagging address (bramPresent_waddr++)
   ├─> Output exec_bram_spiked[7:0] to downstream
   └─> Loop until bramPresent_waddr == BRAM_ADDR_LIMIT

STATE_PHASE1_DONE:
   └─> Assert exec_bram_phase1_done

Concurrent Future Writes (throughout processing):

setArray_go pulse:
   ├─> Check for pipeline hazards (same address in stages 0, 1, 2)
   ├─> Merge with in-flight data if hazard detected
   ├─> Propagate through 3-stage pipeline
   └─> Write to Future BRAM after 3 cycles

Interface Specification#

Base Version (external_events_processor.v)#

Parameters#

Parameter

Default

Description

PIPE_DEPTH

3

BRAM read pipeline depth (matches BRAM latency)

Clock and Reset#

Port

Direction

Width

Description

clk

Input

1

System clock (225 MHz)

resetn

Input

1

Active-low asynchronous reset

Configuration#

Port

Direction

Width

Description

num_inputs

Input

17

Total number of input axons (max 131,072)

External Event Input Interface#

Port

Direction

Width

Description

setArray_go

Input

1

Write pulse for new axon event

setArray_addr

Input

14

BRAM row address (8 axons per row)

setArray_data

Input

8

Bit mask (1=spike, 0=no spike)

Execution Control Interface#

Port

Direction

Width

Description

exec_run

Input

1

Start new time step (toggles BRAMs)

exec_bram_phase1_ready

Output

1

Pipeline filled, ready for reads

exec_hbm_rvalidready

Input

1

HBM data valid & ready (advance BRAM)

exec_bram_spiked

Output

8

Current spike mask (8 axons)

exec_bram_phase1_done

Output

1

All inputs read, phase 1 complete

BRAM0 Interface#

Port

Direction

Width

Description

bram0_waddr

Output

14

Write address

bram0_wdata

Output

8

Write data

bram0_wren

Output

1

Write enable

bram0_raddr

Output

14

Read address

bram0_rden

Output

1

Read enable

bram0_rdata

Input

8

Read data (3-cycle latency)

BRAM1 Interface#

Port

Direction

Width

Description

bram1_waddr

Output

14

Write address

bram1_wdata

Output

8

Write data

bram1_wren

Output

1

Write enable

bram1_raddr

Output

14

Read address

bram1_rden

Output

1

Read enable

bram1_rdata

Input

8

Read data (3-cycle latency)

Simple Version (external_events_processor_simple.v)#

Key differences from base version:

  • 13-bit addresses: axonEvent_addr[12:0], bram0/1_*addr[12:0]

  • 16-bit data: axonEvent_data[15:0], bram0/1_*data[15:0], exec_eep_spiked[15:0]

  • 16 axons per row: axon_addr_limit = num_inputs[16:4] (not [16:3])

  • Additional output: hbm2eep_rden (HBM FIFO read enable)

  • Debug outputs: eep_curr_state[1:0], curr_bram_waddr[12:0]

  • Renamed ports: exec_eep_* instead of exec_bram_*

V2 Version (external_events_processor_v2.v)#

Additional interfaces (beyond base version):

Command Interpreter Debug Interface#

Port

Direction

Width

Description

ci2eep_empty

Input

1

Debug command FIFO empty flag

ci2eep_dout

Input

14

Debug read address from CI

ci2eep_rden

Output

1

Debug command FIFO read enable

eep2ci_full

Input

1

Debug response FIFO full flag

eep2ci_din

Output

22

Debug response data (addr + data)

eep2ci_wren

Output

1

Debug response FIFO write enable

Debug BRAM Read Ports#

Port

Direction

Width

Description

bram0_raddr_dbg

Output

14

Debug read address for BRAM0

bram0_rdata_dbg

Input

8

Debug read data from BRAM0

bram1_raddr_dbg

Output

14

Debug read address for BRAM1

bram1_rdata_dbg

Input

8

Debug read data from BRAM1


Detailed Logic Description#

Base Version State Machine#

Present BRAM Control FSM#

States:

STATE_RESET        = 3'd0  // Reset addresses and flags
STATE_IDLE         = 3'd1  // Wait for exec_run
STATE_FILL_PIPE    = 3'd2  // Fill 3-stage BRAM read pipeline
STATE_READ_INPUTS  = 3'd3  // Read inputs, clear memory, sync with HBM
STATE_PHASE1_DONE  = 3'd4  // Signal completion

State Transitions:

    RESET
      |
      v
    IDLE <────────────────┐
      |                   │
      | exec_run          │
      v                   │
   FILL_PIPE              │
      |                   │
      | raddr >= 3        │
      v                   │
  READ_INPUTS             │
      |                   │
      | waddr == limit    │
      v                   │
  PHASE1_DONE ────────────┘

State Behaviors:

STATE_RESET:
    bramPresent_addr_rst = 1'b1        // Reset raddr and waddr to 0
    next_state = STATE_IDLE

STATE_IDLE:
    if (exec_run)
        bramPresent_addr_rst = 1'b1    // Reset for new time step
        next_state = STATE_FILL_PIPE

STATE_FILL_PIPE:
    if (bramPresent_raddr < PIPE_DEPTH)
        bramPresent_rden = 1'b1        // Issue read
        bramPresent_addr_inc = 1'b1    // Increment raddr
    else
        next_state = STATE_READ_INPUTS // Pipeline full

STATE_READ_INPUTS:
    if (exec_hbm_rvalidready)          // HBM ready for next data
        bramPresent_rden = 1'b1        // Read next address
        bramPresent_addr_inc = 1'b1    // Increment both raddr and waddr
        if (bramPresent_waddr == BRAM_ADDR_LIMIT)
            next_state = STATE_PHASE1_DONE

STATE_PHASE1_DONE:
    next_state = STATE_IDLE            // Return to idle

Address Management (Present BRAM)#

The module maintains two addresses with different roles:

Read Address (raddr) - Leading edge:

// Advances PIPE_DEPTH cycles ahead of write address
// Points to data that will be available after pipeline latency
always @(posedge clk) begin
    if (~resetn | exec_run | bramPresent_addr_rst)
        bramPresent_raddr <= 14'd0;
    else if (bramPresent_addr_inc)
        bramPresent_raddr <= bramPresent_raddr + 1'b1;
end

Write Address (waddr) - Lagging edge:

// Trails read address by PIPE_DEPTH cycles
// Points to data currently emerging from pipeline
always @(posedge clk) begin
    if (~resetn | exec_run | bramPresent_addr_rst)
        bramPresent_waddr <= 14'd0;
    else if (bramPresent_addr_inc && exec_bram_phase1_ready)
        bramPresent_waddr <= bramPresent_waddr + 1'b1;
end

Address Relationship:

Cycle 0-2 (FILL_PIPE):
   raddr: 0→1→2→3
   waddr: 0→0→0→0  (not advancing until exec_bram_phase1_ready)

Cycle 3+ (READ_INPUTS):
   raddr: 3→4→5→6→...
   waddr: 0→1→2→3→...  (maintaining 3-cycle lag)

Why Two Addresses?

  • BRAM has 3-cycle read latency

  • raddr issues read requests

  • waddr writes zeros to addresses whose data has emerged from pipeline

  • This implements “read first” behavior: read data, then clear it

Future BRAM Pipeline Hazard Handling#

The base version implements a sophisticated 3-stage pipeline to handle concurrent writes to the same BRAM address during the PIPE_DEPTH filling phase.

Problem: If two setArray_go pulses target the same address within 3 cycles, data could be lost.

Solution: Three-stage pipeline with hazard detection and data merging.

// Pipeline registers
reg [13:0] bramFuture_waddr [2:0];  // Stages 2→1→0
reg        bramFuture_wren  [2:0];
reg  [7:0] bramFuture_wdata [2:0];

// Stage assignments (stage 2 is newest, stage 0 is oldest)
always @(posedge clk) begin
    if (~resetn) begin
        // Initialize all stages
        bramFuture_wdata[2] <= 8'd0;
        bramFuture_wdata[1] <= 8'd0;
        bramFuture_wdata[0] <= 8'd0;
        // ... (similar for waddr, wren)
    end else if (setArray_go) begin
        // Check for hazards at each stage
        if (setArray_addr == bramFuture_waddr[2]) begin
            // Hazard in stage 2: merge immediately
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2] | setArray_data;
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b0;
        end else if (setArray_addr == bramFuture_waddr[1]) begin
            // Hazard in stage 1: merge with stage 1 data
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1] | setArray_data;
            bramFuture_wren[2]  <= 1'b0;
        end else if (setArray_addr == bramFuture_waddr[0]) begin
            // Hazard in stage 0: data will merge at BRAM (commented out)
            // Current code doesn't merge (see lines 95-96, 103-104)
            bramFuture_wdata[2] <= 8'd0;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b0;
        end else begin
            // No hazard: normal pipeline operation
            bramFuture_wdata[2] <= setArray_data;
            bramFuture_wdata[1] <= bramFuture_wdata[2];
            bramFuture_wdata[0] <= bramFuture_wdata[1];
            bramFuture_wren[2]  <= 1'b1;
        end

        // Always propagate addresses and enables
        bramFuture_waddr[2] <= setArray_addr;
        bramFuture_waddr[1] <= bramFuture_waddr[2];
        bramFuture_waddr[0] <= bramFuture_waddr[1];
        bramFuture_wren[1]  <= bramFuture_wren[2];
        bramFuture_wren[0]  <= bramFuture_wren[1];
    end else begin
        // No new write: propagate with zeros
        bramFuture_wdata[2] <= 8'd0;
        bramFuture_wdata[1] <= bramFuture_wdata[2];
        bramFuture_wdata[0] <= bramFuture_wdata[1];
        // ... (propagate addresses/enables)
    end
end

Hazard Example:

Cycle | setArray_go | addr | data | Stage2    | Stage1    | Stage0    | Action
------|-------------|------|------|-----------|-----------|-----------|------------------
  0   |      1      | 100  | 0x01 | 100/0x01  |    -/-    |    -/-    | New write
  1   |      1      | 100  | 0x02 | 100/0x00  | 100/0x03  |    -/-    | Hazard! Merge 0x01|0x02=0x03
  2   |      1      | 200  | 0x04 | 200/0x04  | 100/0x00  | 100/0x03  | No hazard
  3   |      0      |  -   |  -   |    -/0x00 | 200/0x04  | 100/0x00  | Propagate
  4   |      0      |  -   |  -   |    -/0x00 |    -/0x00 | 200/0x04  | Write 100(0x03)
  5   |      0      |  -   |  -   |    -/0x00 |    -/0x00 |    -/0x00 | Write 200(0x04)

Note: Lines 95-96 and 103-104 show debugging modifications that bypass the final merge operation:

// Original (with full hazard handling):
// assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;

// Debug version (simpler, may lose events):
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];

Simple Version Logic#

The simple version removes complex hazard handling for single-core operation:

Key Simplifications:

  1. Direct Future Write (no pipeline):

// No pipeline registers - direct assignment
assign bramFuture_waddr = axonEvent_addr_reg;
assign bramFuture_wdata = axonEvent_data_reg;
assign bramFuture_wren  = axonEvent_set_reg;
  1. No Future BRAM Read (unless debugging):

assign bramFuture_raddr = 13'd0;
assign bramFuture_rden  = 1'b0;  // Disabled to avoid pipeline issues
  1. Simplified State Machine (4 states instead of 5):

// Removed STATE_PHASE1_DONE, completion detected in STATE_READ_INPUTS
STATE_READ_INPUTS: begin
    if (exec_hbm_rvalidready) begin
        bramPresent_rden = 1'b1;
        bramPresent_wren = 1'b1;
        if (bramPresent_waddr == axon_addr_limit) begin
            phase1_done_set = 1'b1;
            next_state = STATE_IDLE;  // Direct transition
        end
    end
end
  1. 16 Axons Per Row:

// Base version: 8 axons per row
// BRAM_ADDR_LIMIT = num_inputs[16:3]  // Divide by 8

// Simple version: 16 axons per row
// axon_addr_limit = num_inputs[16:4]  // Divide by 16

Address Calculation Example:

  • num_inputs = 17'd131072 (max neurons)

  • Base: BRAM_ADDR_LIMIT = 131072 >> 3 = 16384 rows

  • Simple: axon_addr_limit = 131072 >> 4 = 8192 rows

  1. Registered Input Events:

// Better place-and-route by registering inputs
always @(posedge clk) begin
    if (~resetn) begin
        axonEvent_set_reg  <= 1'b0;
        axonEvent_addr_reg <= 13'd0;
        axonEvent_data_reg <= 16'd0;
    end else begin
        axonEvent_set_reg  <= axonEvent_set;
        axonEvent_addr_reg <= axonEvent_addr;
        axonEvent_data_reg <= axonEvent_data;
    end
end

V2 Version Enhancements#

The V2 version adds debug capabilities while simplifying the future write logic:

Simplified Future Write (Read-Modify-Write)#

Instead of complex pipeline hazard detection, V2 uses RMW:

// Read the current value
assign bramFuture_raddr = setArray_addr[16:3];  // Note: only uses upper bits
assign bramFuture_rden  = ci2eep_rden | setArray_go | bramFuture_wren[2] | bramFuture_wren[1] | bramFuture_wren[0];
assign bramFuture_rdata = bram_select ? bram0_rdata : bram1_rdata;

// Merge with new data via OR operation
assign bramFuture_wdata = bramFuture_rdata | setArray_data;

// Propagate through 3-stage pipeline (addresses and enables only)
always @(posedge clk) begin
    if (~resetn) begin
        bramFuture_waddr[2] <= 14'd0;
        bramFuture_waddr[1] <= 14'd0;
        bramFuture_waddr[0] <= 14'd0;
        bramFuture_wren[2]  <= 1'b0;
        bramFuture_wren[1]  <= 1'b0;
        bramFuture_wren[0]  <= 1'b0;
    end else begin
        bramFuture_waddr[2] <= setArray_addr;
        bramFuture_waddr[1] <= bramFuture_waddr[2];
        bramFuture_waddr[0] <= bramFuture_waddr[1];
        bramFuture_wren[2]  <= setArray_go;
        bramFuture_wren[1]  <= bramFuture_wren[2];
        bramFuture_wren[0]  <= bramFuture_wren[1];
    end
end

Why This Works:

  • Always read before write (RMW pattern)

  • OR operation merges new spikes with existing ones

  • Simpler than explicit hazard detection

  • Relies on BRAM “read first” mode

Debug State Machine#

V2 adds a separate FSM for debug access:

Debug States:

DBG_STATE_RESET   = 3'd0  // Reset debug logic
DBG_STATE_IDLE    = 3'd1  // Wait for debug command
DBG_STATE_WAIT_0  = 3'd2  // Issue first read
DBG_STATE_WAIT_1  = 3'd3  // Wait cycle 2
DBG_STATE_WAIT_2  = 3'd4  // Wait cycle 3
DBG_STATE_WAIT_3  = 3'd5  // Wait cycle 4
DBG_STATE_DONE    = 3'd6  // Send response, pop command

Debug Flow:

1. Command Interpreter writes address to ci2eep FIFO
2. Debug FSM detects ~ci2eep_empty
3. Issue BRAM read (bram0/1_raddr_dbg = ci2eep_dout)
4. Wait 4 cycles for pipeline (WAIT_0→1→2→3)
5. Write response to eep2ci FIFO (address + data)
6. Pop command from ci2eep (ci2eep_rden = 1)
7. Return to IDLE

Debug Interface Behavior:

always @(*) begin
    ci2eep_rden = 1'b0;
    eep2ci_wren = 1'b0;
    dbg_next_state = dbg_curr_state;

    case (dbg_curr_state)
        DBG_STATE_IDLE: begin
            if (~ci2eep_empty)
                dbg_next_state = DBG_STATE_WAIT_0;
        end
        DBG_STATE_WAIT_0: begin
            if (~eep2ci_full)
                eep2ci_wren = 1'b1;  // Write response
            dbg_next_state = DBG_STATE_WAIT_1;
        end
        // ... (similar for WAIT_1, WAIT_2, WAIT_3)
        DBG_STATE_DONE: begin
            ci2eep_rden = 1'b1;      // Pop command
            dbg_next_state = DBG_STATE_IDLE;
        end
    endcase
end

// Response format: {address, data}
assign eep2ci_din = bram_select ? {bram0_raddr_dbg, bram0_rdata_dbg}
                                : {bram1_raddr_dbg, bram1_rdata_dbg};

// Debug read addresses (always driven)
assign bram0_raddr_dbg = ci2eep_dout;
assign bram1_raddr_dbg = ci2eep_dout;

Memory Map#

Base and V2 Versions (8 axons/row)#

BRAM Organization:

  • Depth: 16,384 rows (14-bit address)

  • Width: 8 bits (1 bit per axon)

  • Total Capacity: 131,072 axons

  • Dual BRAMs: BRAM0 and BRAM1 (double buffering)

Address Mapping:

Axon ID Range     | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 7             | 0x0000       | [0] to [7]
8 - 15            | 0x0001       | [0] to [7]
16 - 23           | 0x0002       | [0] to [7]
...               | ...          | ...
131064 - 131071   | 0x3FFF       | [0] to [7]

Bit Encoding:

  • Bit = 1: Axon spiked

  • Bit = 0: No spike

Address Calculation:

bram_addr = axon_id[16:3];      // Upper 14 bits
bit_pos   = axon_id[2:0];       // Lower 3 bits

Simple Version (16 axons/row)#

BRAM Organization:

  • Depth: 8,192 rows (13-bit address)

  • Width: 16 bits (1 bit per axon)

  • Total Capacity: 131,072 axons

  • Dual BRAMs: BRAM0 and BRAM1

Address Mapping:

Axon ID Range     | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 15            | 0x0000       | [0] to [15]
16 - 31           | 0x0001       | [0] to [15]
32 - 47           | 0x0002       | [0] to [15]
...               | ...          | ...
131056 - 131071   | 0x1FFF       | [0] to [15]

Address Calculation:

bram_addr = axon_id[16:4];      // Upper 13 bits
bit_pos   = axon_id[3:0];       // Lower 4 bits

Memory Utilization:

  • Base/V2: 16,384 × 8 = 128 Kb per BRAM → 256 Kb total

  • Simple: 8,192 × 16 = 128 Kb per BRAM → 256 Kb total

  • Same total capacity, different organization


Timing Diagrams#

Time Step Transition (Double Buffer Swap)#

         ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────
clk      ┘     └─────┘     └─────┘     └─────┘     └─────┘

exec_run ──────┐     ┌─────────────────────────────────────────
               └─────┘

         Time Step N-1    │ Time Step N
         ─────────────────┼─────────────────────────────────────
                          │
bram_select = 0           │ Toggle → 1
BRAM0 = Present           │ BRAM0 → Future
BRAM1 = Future            │ BRAM1 → Present
                          │
State:   READ_INPUTS      │ IDLE → FILL_PIPE → READ_INPUTS

bramPresent = BRAM0       │ bramPresent = BRAM1
bramFuture  = BRAM1       │ bramFuture  = BRAM0

Present BRAM Read Sequence (Base & V2)#

Cycle    0    1    2    3    4    5    6    7    8    9
         ─────┬────┬────┬────┬────┬────┬────┬────┬────┬────
State    FILL │FILL│FILL│READ│READ│READ│READ│READ│READ│...
         ────┬┴────┴────┴────┴────┴────┴────┴────┴────┴────

rden     ────┐    ┌───┐    ┌───┐    ┌───┐    ┌───┐    ┌───
         ────└────┘   └────┘   └────┘   └────┘   └────┘

raddr        0    1    2    3    4    5    6    7    8    9

wren     ───────────────────┐    ┌───┐    ┌───┐    ┌───┐
         ───────────────────└────┘   └────┘   └────┘   └────

waddr        0    0    0    0    1    2    3    4    5    6

                     ┌─────────┐
phase1_ready ────────┘         └──────────────────────────────
                          (asserted in READ_INPUTS state)

hbm_rvalidready ────────────┐    ┌───┐    ┌───┐    ┌───┐
                ────────────└────┘   └────┘   └────┘   └────

exec_bram_spiked    X    X    X    D0   D1   D2   D3   D4   D5
                    (pipeline latency = 3 cycles)

BRAM Data Flow:
  Cycle 0: Issue read addr 0  →  (3-cycle latency)  →  Cycle 3: Data 0 emerges
  Cycle 1: Issue read addr 1  →  (3-cycle latency)  →  Cycle 4: Data 1 emerges
  Cycle 2: Issue read addr 2  →  (3-cycle latency)  →  Cycle 5: Data 2 emerges
  Cycle 3: Issue read addr 3, Write 0 to addr 0
  Cycle 4: Issue read addr 4, Write 0 to addr 1
  ...

Future BRAM Write with Hazard (Base Version)#

Cycle        0    1    2    3    4    5    6    7
             ────┬────┬────┬────┬────┬────┬────┬────
setArray_go  ───┐    ┌───┐    ┌───────────────────
             ───└────┘   └────┘

setArray_addr    100  200  100  -    -    -    -    -
setArray_data    0x01 0x04 0x02 -    -    -    -    -

Pipeline Stage 2:
  waddr          100  200  100  -    -    -    -    -
  wdata          0x01 0x04 0x00 -    -    -    -    -
  wren           1    1    0    -    -    -    -    -

Pipeline Stage 1:
  waddr          -    100  200  100  -    -    -    -
  wdata          -    0x01 0x04 0x03 -    -    -    - ← Merged!
  wren           -    1    1    0    -    -    -    -

Pipeline Stage 0:
  waddr          -    -    100  200  100  -    -    -
  wdata          -    -    0x01 0x04 0x00 -    -    -
  wren           -    -    1    1    0    -    -    -

BRAM Write:
  Addr 100 ──────────────────────────────────────┐
    Data = 0x01 (from cycle 2) ───────────────────┘

  Addr 200 ─────────────────────────────────────────┐
    Data = 0x04 (from cycle 3) ───────────────────┘

  Addr 100 ────────────────────────────────────────────┐
    Data = 0x03 (merged 0x01|0x02 from cycle 4) ───────┘

Note: Cycle 2 write to addr 100 detected hazard with cycle 0, merged data in stage 1.

Simple Version: No Pipeline (Direct Write)#

Cycle        0    1    2    3    4    5
             ────┬────┬────┬────┬────┬────
axonEvent_set ──┐    ┌───┐    ┌──────────
             ───└────┘   └────┘

axonEvent_addr   100  200  300  -    -    -
axonEvent_data   0x0001 0x0004 0x0008 -    -    -

(Register inputs for better timing)
             ────┬────┬────┬────┬────┬────
*_set_reg    ───────┐    ┌───┐    ┌──────
                ───└────┘   └────┘

*_addr_reg       -    100  200  300  -    -
*_data_reg       -    0x0001 0x0004 0x0008 -    -

bramFuture_waddr -    100  200  300  -    -
bramFuture_wdata -    0x0001 0x0004 0x0008 -    -
bramFuture_wren  -    1    1    1    -    -

BRAM Write:      -    100  200  300  -    -
                      0x0001 0x0004 0x0008

No hazard handling! Single-core only.
If same address written twice within pipeline depth, later write overwrites earlier.

V2 Debug Read Sequence#

Cycle        0    1    2    3    4    5    6    7
             ────┬────┬────┬────┬────┬────┬────┬────
DBG_State    IDLE│WAIT│WAIT│WAIT│WAIT│DONE│IDLE│...
                 │  0 │  1 │  2 │  3 │    │    │

ci2eep_empty ────┐                                  ┌───
             ────└──────────────────────────────────┘

ci2eep_dout      0x1234 (stays valid until rden)

ci2eep_rden  ────────────────────────────────────┐  ┌───
             ────────────────────────────────────└──┘

bram*_raddr_dbg  0x1234 (always driven)

eep2ci_wren  ───────┐    ┌───┐    ┌───┐    ┌───┐  ┌───
             ───────└────┘   └────┘   └────┘   └──┘
                 (writes in WAIT_0 through WAIT_3 if not full)

eep2ci_din       X    {0x1234,D}  (data D emerges after latency)

Debug transaction:
  Cycle 0: Detect command available
  Cycle 1-4: Issue reads, wait for pipeline, write response
  Cycle 5: Pop command FIFO
  Cycle 6: Return to IDLE

Cross-References#

Upstream Modules#

  • command_interpreter.v (command_interpreter.md):

    • Generates setArray_go, setArray_addr, setArray_data signals

    • V2: Provides debug FIFO interfaces (ci2eep_*, eep2ci_*)

    • Controls when external events are injected

  • pcie2fifos.v (pcie2fifos.md):

    • Ultimate source of external events from host

    • Events flow: PCIe → Command Interpreter → External Events Processor

Downstream Modules#

  • hbm_processor.v (hbm_processor.md):

    • Receives exec_bram_spiked (spike mask)

    • Provides exec_hbm_rvalidready (synchronization signal)

    • Uses spike masks to fetch pointer chains from HBM

  • internal_events_processor.v (internal_events_processor.md):

    • Receives exec_bram_phase1_done (completion signal)

    • Coordinates two-phase execution (external then internal events)

Peer Modules#

  • pointer_fifo_controller.v (pointer_fifo_controller.md):

    • Works with spike masks from this module

    • Controls flow of pointer data to HBM processor


Module Comparison: When to Use Each Variant#

Use Base Version When:#

  • Multi-core architecture with multiple cores writing to same future BRAM

  • Concurrent writes to the same BRAM address are expected

  • Data integrity is critical and no events can be lost

  • Pipeline hazards need explicit detection and merging

  • 8 axons per row organization preferred

Trade-offs:

  • ✅ Full hazard handling

  • ✅ No data loss in multi-core scenarios

  • ❌ More complex logic

  • ❌ Higher resource usage (pipeline registers)

  • ⚠️ Debugging modifications present (lines 95-96, 103-104)

Use Simple Version When:#

  • Single-core architecture with only one writer to future BRAM

  • Lower resource usage is priority

  • Wider data paths (16-bit) preferred for bandwidth

  • No concurrent writes to same address expected

  • Simpler logic easier to verify and debug

Trade-offs:

  • ✅ Minimal resource usage

  • ✅ 2× data width (16 vs 8 bits)

  • ✅ Simpler state machine (4 vs 5 states)

  • ✅ Better timing due to registered inputs

  • ❌ No hazard protection

  • ❌ Data loss if concurrent writes occur

  • ❌ Single-core only

Use V2 Version When:#

  • Debug and verification required

  • BRAM inspection needed during runtime

  • Command interpreter interface for test patterns

  • Read-modify-write approach acceptable

  • Production debugging of neuromorphic algorithms

Trade-offs:

  • ✅ Debug capabilities (FIFO interface)

  • ✅ Simplified future write logic (RMW vs explicit hazards)

  • ✅ Direct BRAM inspection via debug ports

  • ❌ Additional debug FSM (more resources)

  • ❌ Extra FIFO interfaces

  • ❌ Not optimized for performance


Performance Characteristics#

Base and V2 Versions#

Throughput:

  • Read Rate: 1 BRAM address per exec_hbm_rvalidready cycle

  • Effective Rate: Limited by HBM bandwidth (~450 MHz possible, typically 225 MHz)

  • Pipeline Fill: 3 cycles (one-time cost per time step)

  • Total Time: 3 + num_inputs[16:3] cycles per time step

Example (131,072 neurons):

BRAM addresses = 131072 / 8 = 16384
Pipeline fill  = 3 cycles
Total cycles   = 3 + 16384 = 16387 cycles
At 225 MHz     = 16387 / 225e6 = 72.8 µs

Future Write Latency:

  • Base: 3 cycles (pipeline depth) from setArray_go to BRAM write

  • V2: 3 cycles (pipeline depth) from setArray_go to BRAM write

  • Hazard Penalty: 0 cycles (merged in pipeline)

Simple Version#

Throughput:

  • Read Rate: 1 BRAM address per exec_hbm_rvalidready cycle

  • Effective Rate: 225 MHz typical

  • Pipeline Fill: 3 cycles

  • Total Time: 3 + num_inputs[16:4] cycles per time step

Example (131,072 neurons):

BRAM addresses = 131072 / 16 = 8192
Pipeline fill  = 3 cycles
Total cycles   = 3 + 8192 = 8195 cycles
At 225 MHz     = 8195 / 225e6 = 36.4 µs  (2× faster than base!)

Future Write Latency:

  • Direct: 1 cycle from axonEvent_set to registered write

  • Total: 2 cycles (register + BRAM write)

Resource Usage Comparison:

Resource

Base

Simple

V2

LUTs (approx.)

500

250

600

Flip-Flops

200

120

280

BRAM18K

2

2

2

Pipeline Regs

3×(14+8+1)

0

3×(14+1)


Common Issues and Debugging#

Issue 1: Events Lost During Time Step Transition#

Symptoms:

  • External events written near exec_run pulse disappear

  • Inconsistent spike counts between time steps

Root Cause:

  • Writing to future BRAM while bram_select is toggling

  • Race condition between write and buffer swap

Debug:

// Check timing of setArray_go relative to exec_run
// Add ILA probe:
ila_0 your_ila (
    .clk(clk),
    .probe0(exec_run),
    .probe1(setArray_go),
    .probe2(setArray_addr),
    .probe3(bram_select)
);

Solution:

  • Ensure setArray_go never occurs within 3 cycles of exec_run

  • Add FIFO between command interpreter and external events processor

  • Stall writes during buffer swap

Issue 2: Pipeline Hazards Not Detected (Base Version)#

Symptoms:

  • Expected spike data doesn’t appear

  • OR of multiple writes shows only one bit set

Root Cause:

  • Hazard detection logic not functioning

  • Debugging modifications (lines 95-96, 103-104) bypass merging

Debug:

// Monitor pipeline stages
(* mark_debug = "true" *) reg [13:0] bramFuture_waddr_dbg [2:0];
(* mark_debug = "true" *) reg  [7:0] bramFuture_wdata_dbg [2:0];
(* mark_debug = "true" *) reg        bramFuture_wren_dbg  [2:0];

always @(posedge clk) begin
    bramFuture_waddr_dbg <= bramFuture_waddr;
    bramFuture_wdata_dbg <= bramFuture_wdata;
    bramFuture_wren_dbg  <= bramFuture_wren;
end

Solution:

  • Restore original BRAM write logic:

// Change:
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];

// To:
assign bram0_wdata = ~bram_select ? bramPresent_wdata :
                     bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;

Issue 3: Address Limit Calculation Wrong#

Symptoms:

  • Phase 1 completes too early or too late

  • Not all neurons receive input events

Root Cause:

  • Incorrect calculation of BRAM address limit

  • Mismatch between num_inputs and actual neuron count

Debug:

// Check address limit
// Base:   BRAM_ADDR_LIMIT = num_inputs[16:3]  (divide by 8)
// Simple: axon_addr_limit = num_inputs[16:4]  (divide by 16)

// Add assertion:
assert property (@(posedge clk) disable iff (~resetn)
    (curr_state == STATE_READ_INPUTS && bramPresent_waddr == BRAM_ADDR_LIMIT)
    |=> (curr_state == STATE_PHASE1_DONE)
);

Solution:

  • Verify num_inputs matches neuron configuration

  • Base: Ensure multiple of 8

  • Simple: Ensure multiple of 16

  • Add +1 if rounding needed:

// If num_inputs not exact multiple
assign BRAM_ADDR_LIMIT = (num_inputs[16:3]) + |num_inputs[2:0];  // Round up

Issue 4: BRAM Read Latency Mismatch#

Symptoms:

  • Data appears corrupted or delayed

  • exec_bram_spiked shows wrong values

Root Cause:

  • BRAM configured with latency ≠ PIPE_DEPTH (3)

  • Pipeline depth parameter doesn’t match actual BRAM

Debug:

// Verify BRAM configuration in IP customization:
// - Read Latency: should be 3
// - Primitive Type: should match PIPE_DEPTH

// Check if raddr/waddr maintain proper offset:
assert property (@(posedge clk) disable iff (~resetn)
    (curr_state == STATE_READ_INPUTS && exec_bram_phase1_ready)
    |-> (bramPresent_raddr == bramPresent_waddr + PIPE_DEPTH)
);

Solution:

  • Reconfigure BRAM IP for 3-cycle latency

  • Or update PIPE_DEPTH parameter to match BRAM:

external_events_processor #(
    .PIPE_DEPTH(2)  // If BRAM has 2-cycle latency
) eep_inst (
    // ...
);

Issue 5: V2 Debug Reads Return Stale Data#

Symptoms:

  • Debug responses show old/incorrect BRAM data

  • Debug state machine stuck in WAIT states

Root Cause:

  • Insufficient wait cycles for BRAM read latency

  • Debug FSM transitions too quickly

Debug:

// Monitor debug state progression
(* mark_debug = "true" *) reg [2:0] dbg_state_history [7:0];

always @(posedge clk) begin
    dbg_state_history[7:1] <= dbg_state_history[6:0];
    dbg_state_history[0]   <= dbg_curr_state;
end

Solution:

  • Ensure 4 WAIT states (WAIT_0→WAIT_1→WAIT_2→WAIT_3)

  • Add extra wait state if needed:

localparam [2:0] DBG_STATE_WAIT_4 = 3'd7;

// In state machine:
DBG_STATE_WAIT_3: begin
    if (~eep2ci_full)
        eep2ci_wren = 1'b1;
    dbg_next_state = DBG_STATE_WAIT_4;  // Extra cycle
end
DBG_STATE_WAIT_4: begin
    dbg_next_state = DBG_STATE_DONE;
end

Safety and Edge Cases#

Edge Case 1: num_inputs = 0#

Behavior:

  • BRAM_ADDR_LIMIT = 0

  • State machine immediately transitions FILL_PIPE → READ_INPUTS → PHASE1_DONE

  • No BRAM accesses occur

Safety:

  • ✅ No undefined behavior

  • ✅ Module functions correctly (zero inputs processed)

  • ⚠️ Wastes cycles (should be caught at system level)

Edge Case 2: num_inputs Not Multiple of 8 (or 16)#

Example: num_inputs = 17'd100

Base Version:

BRAM_ADDR_LIMIT = 100 >> 3 = 12
Actual coverage   = 12 * 8 = 96 axons
Missing           = 4 axons (96-99 not processed)

Fix:

// Round up to nearest multiple
assign BRAM_ADDR_LIMIT = (num_inputs + 7) >> 3;  // Ceiling division

Edge Case 3: Concurrent setArray_go and exec_run#

Scenario:

Cycle N:   exec_run = 1 (toggle bram_select)
Cycle N:   setArray_go = 1 (write to future BRAM)

Problem:

  • bram_select changes, may write to wrong BRAM

Current Design:

  • bram_select registered on exec_run edge

  • setArray_go writes on same edge

  • Race condition! Indeterminate which BRAM receives write

Solution:

  • Pipeline exec_run by 1 cycle:

reg exec_run_pipe;

always @(posedge clk) begin
    if (~resetn)
        exec_run_pipe <= 1'b0;
    else
        exec_run_pipe <= exec_run;
end

// Use exec_run_pipe for bram_select toggle
always @(posedge clk) begin
    if (~resetn)
        bram_select <= 1'b0;
    else if (exec_run_pipe)  // Changed from exec_run
        bram_select <= ~bram_select;
end

Edge Case 4: BRAM Write During Pipeline Fill#

Scenario:

STATE_FILL_PIPE: bramPresent_wren = 0 (not asserted yet)
Future writes:    bramFuture_wren[0] = 1 (trying to write)

Problem (Multi-core):

  • If multiple cores write to future BRAM during present BRAM pipeline fill

  • Potential for lost writes if exceeding BRAM write bandwidth

Current Design:

  • Single write port per BRAM

  • Future writes serialized through pipeline

  • Safe as long as write rate ≤ 1 per 3 cycles

Solution (if needed):

  • Use dual-port BRAM (separate read/write ports)

  • Or implement write FIFO to buffer concurrent writes

Safety Check: Phase 1 Completion Detection#

Assertion:

// Ensure phase1_done only asserted when all addresses processed
property phase1_done_check;
    @(posedge clk) disable iff (~resetn)
    (exec_bram_phase1_done) |-> (bramPresent_waddr == BRAM_ADDR_LIMIT);
endproperty
assert_phase1: assert property (phase1_done_check);

Safety Check: No Writes During Buffer Swap#

Assertion:

// Ensure no future writes during exec_run
property no_write_during_swap;
    @(posedge clk) disable iff (~resetn)
    (exec_run) |-> (bramFuture_wren[0] == 1'b0);
endproperty
assert_no_write: assert property (no_write_during_swap);

Future Enhancement Opportunities#

1. Configurable Data Width#

Allow parameterization of axons per row:

module external_events_processor #(
    parameter PIPE_DEPTH = 3,
    parameter AXONS_PER_ROW = 8  // 8, 16, 32, etc.
)(
    // Derive address and data widths
    localparam ADDR_BITS = 17 - $clog2(AXONS_PER_ROW);
    localparam DATA_BITS = AXONS_PER_ROW;

    input [ADDR_BITS-1:0] setArray_addr,
    input [DATA_BITS-1:0] setArray_data,
    // ...
);

2. Burst Mode for Faster Pipeline Fill#

Current: Fill pipeline sequentially (3 cycles) Enhancement: Issue all 3 reads in 1 cycle (if BRAM supports)

STATE_FILL_PIPE: begin
    if (bramPresent_raddr == 0) begin
        // Issue all 3 reads at once
        bram_raddr[0] = 14'd0;
        bram_raddr[1] = 14'd1;
        bram_raddr[2] = 14'd2;
        bram_rden[0]  = 1'b1;
        bram_rden[1]  = 1'b1;
        bram_rden[2]  = 1'b1;
        next_state = STATE_READ_INPUTS;
    end
end

3. Event Timestamping#

Add timestamp to each event for precise temporal resolution:

// Expand data width: [7:0] data + [15:0] timestamp
input [23:0] setArray_data,  // {timestamp, spike_mask}

// BRAM organization: 24 bits per row

4. Event Compression#

Sparse events (few spikes per row) waste bandwidth:

// Instead of full bit mask, store indices
// Example: Spikes at axons 5, 17, 42
// Compressed: {3'b011, 6'd42, 6'd17, 6'd5}  // Count + indices

5. Multi-Buffer (>2 BRAMs)#

Allow more than 2 time steps in flight:

parameter NUM_BUFFERS = 4;  // Quad buffering

reg [1:0] bram_select;      // 2-bit select (4 buffers)

always @(posedge clk) begin
    if (exec_run)
        bram_select <= (bram_select + 1) & 2'b11;  // Circular
end

6. AXI4-Stream Interface#

Replace custom interface with standard AXI4-Stream:

// Input events
input         s_axis_tvalid,
output        s_axis_tready,
input  [31:0] s_axis_tdata,  // {addr, data}
input         s_axis_tlast,

// Output spikes
output        m_axis_tvalid,
input         m_axis_tready,
output [31:0] m_axis_tdata,  // Spike mask + metadata

7. Configurable Pipeline Depth#

Auto-detect BRAM latency at synthesis:

// Query BRAM IP for latency
localparam BRAM_LATENCY = bram0.READ_LATENCY_A;  // From BRAM IP

external_events_processor #(
    .PIPE_DEPTH(BRAM_LATENCY)  // Match automatically
) eep (
    // ...
);

Key Terms and Definitions#

Term

Definition

Axon

Input neuron connection; source of spike events

Double Buffering

Two-buffer scheme (present/future) allowing simultaneous read and write

Present BRAM

BRAM being read during current time step (then cleared)

Future BRAM

BRAM accumulating events for next time step

bram_select

Toggle bit selecting which physical BRAM is present vs. future

Pipeline Depth

Number of cycles between BRAM read request and data availability (typically 3)

Pipeline Fill

Initial phase where read pipeline is populated before writes begin

Leading Address

Read address (raddr) - advances pipeline depth ahead of write address

Lagging Address

Write address (waddr) - clears data after it emerges from pipeline

Spike Mask

Bit vector where each bit represents spike (1) or no-spike (0) for an axon

Phase 1

External event processing (vs. Phase 2: internal/synaptic events)

exec_run

Control pulse starting new time step, toggling present ←→ future BRAMs

exec_hbm_rvalidready

Synchronization signal from HBM indicating data consumed, advance BRAM

setArray_go

Write pulse for external event (from command interpreter or other source)

Pipeline Hazard

Conflict when concurrent writes target same BRAM address within pipeline depth

RMW (Read-Modify-Write)

Pattern of reading current value, modifying, then writing back

Hazard Detection

Logic identifying when new write conflicts with in-flight writes

Data Merging

Combining multiple writes to same address via OR operation

Time Step

Discrete computation cycle in neuromorphic algorithm (milliseconds typically)

Axon Event

External spike arriving at input neuron

Axons Per Row

Number of axons packed into single BRAM address (8 or 16 bits)

Address Limit

Maximum BRAM address to read/write (depends on num_inputs)


Conclusion#

The External Events Processor family provides flexible solutions for managing input spike events in neuromorphic systems:

  • Base version: Full-featured with pipeline hazard handling for multi-core

  • Simple version: Streamlined single-core variant with lower resource usage

  • V2 version: Debug-enhanced variant for verification and development

Key Design Principles:

  1. Double buffering prevents event loss during time step transitions

  2. Pipeline management ensures correct synchronization with BRAM latency

  3. Hazard detection/merging (base) or simplified RMW (V2) prevents data corruption

  4. State machine coordinates read-clear cycles with downstream modules

Selection Guide:

  • Multi-core system with concurrent writes → Base version

  • Single-core system, resource-constrained → Simple version

  • Debug/verification needed → V2 version

For questions or issues, cross-reference with command_interpreter.md (upstream) and hbm_processor.md (downstream) for complete system understanding.