External Events Processor Module Family#
Overview#
The External Events Processor family manages input spike events (axons) in the neuromorphic FPGA system. These modules maintain two Block RAMs in a double-buffering scheme: one for the “present” time step (currently being processed) and one for the “future” time step (accumulating new events). This architecture allows continuous operation without dropping input events.
Three variants exist:
external_events_processor.v - Base version with full pipeline hazard handling
external_events_processor_simple.v - Simplified single-core version with wider data paths
external_events_processor_v2.v - Enhanced version with debugging capabilities
Role in the Software/Hardware Stack#
Host Application (Python/C++)
|
[hs_bridge]
|
[PCIe Interface]
|
[Command Interpreter] -----> [External Events Processor] <---- External spike events
| |
| Present BRAM (8 or 16 axons/row)
| Future BRAM (8 or 16 axons/row)
| |
[HBM Processor] <------ exec_bram_spiked (spike mask)
| |
[Internal Events] <-------- exec_bram_phase1_done
Processor
Function:
Receive input spike events from external sources or command interpreter
Store events in double-buffered BRAMs (present/future)
Synchronize event delivery with HBM read operations
Clear processed events after reading
Handle pipeline hazards for concurrent writes during multi-core operation
Key Innovation: Double-buffering allows new events to accumulate in the “future” BRAM while the “present” BRAM is being read and cleared, ensuring no event loss during processing.
Variant Comparison#
Feature |
Base Version |
Simple Version |
V2 Version |
|---|---|---|---|
File |
external_events_processor.v |
external_events_processor_simple.v |
external_events_processor_v2.v |
Axons per row |
8 |
16 |
8 |
Address width |
14 bits |
13 bits |
14 bits |
Data width |
8 bits |
16 bits |
8 bits |
Target |
Multi-core |
Single-core |
Debug/verification |
Future pipeline |
3-stage hazard handling |
Direct write (no hazards) |
Simplified RMW |
State machine |
5 states |
4 states |
5 states + debug FSM |
Debug features |
None |
None |
CI interface, debug ports |
Complexity |
High |
Low |
High |
Module Architecture (Base Version)#
┌─────────────────────────────────┐
│ External Events Processor │
│ │
setArray_go ──────────┐ │ ┌──────────────────────────┐ │
setArray_addr[13:0] ──┼──────────┼─>│ Future BRAM Control │ │
setArray_data[7:0] ───┘ │ │ - 3-stage pipeline │ │
│ │ - Hazard detection │ │
exec_run ────────────────────────┼─>│ - waddr/wdata/wren[2:0] │ │
│ └────────┬─────────────────┘ │
│ │ │
│ v │
│ ┌──────────────────────────┐ │
│ │ BRAM Multiplexer │ │
│ │ bram_select toggle │ │
┌───────────────────────────────┼─>│ - BRAM0 ←→ Present │ │
│ │ │ - BRAM1 ←→ Future │ │
│ ┌────────────────────────────┼─<│ │ │
│ │ │ └───────┬──────────────────┘ │
│ │ │ │ │
│ │ │ v │
│ │ │ ┌──────────────────────────┐ │
│ │ exec_hbm_rvalidready ────┼─>│ Present BRAM Control │ │
│ │ │ │ - State machine (5) │ │
│ │ │ │ - Pipeline fill │ │
│ │ │ │ - Read & clear │ │
│ │ │ │ - raddr/waddr tracking │ │
│ │ │ └────────┬─────────────────┘ │
│ │ │ │ │
│ └───────── exec_bram_spiked[7:0] <───────┘ │
│ │ │
└───────── exec_bram_phase1_done ────────────────────────────────┤
│ │
└─────────────────────────────────┘
BRAM0 (18Kb) BRAM1 (18Kb)
┌────────────┐ ┌────────────┐
│ 16384 × 8b │ │ 16384 × 8b │
│ │ │ │
│ Toggles: │ │ Toggles: │
│ Present ←→ │ │ Future ←→ │
│ Future │ │ Present │
└────────────┘ └────────────┘
Data Flow (Two-Phase Operation)#
Phase 0: Setup (between time steps)
1. exec_run pulse triggers:
- bram_select toggles (swaps present ←→ future)
- State machine resets to IDLE
2. Present BRAM now contains accumulated events from previous "future"
3. Future BRAM ready to accumulate new events
Phase 1: Event Processing (during time step)
STATE_FILL_PIPE (cycles 0-2):
├─> Read BRAM addresses 0, 1, 2
└─> Fill 3-stage pipeline (no writes yet)
STATE_READ_INPUTS (cycles 3 to completion):
├─> Wait for exec_hbm_rvalidready
├─> Read next BRAM address (bramPresent_raddr++)
├─> Write 0 to lagging address (bramPresent_waddr++)
├─> Output exec_bram_spiked[7:0] to downstream
└─> Loop until bramPresent_waddr == BRAM_ADDR_LIMIT
STATE_PHASE1_DONE:
└─> Assert exec_bram_phase1_done
Concurrent Future Writes (throughout processing):
setArray_go pulse:
├─> Check for pipeline hazards (same address in stages 0, 1, 2)
├─> Merge with in-flight data if hazard detected
├─> Propagate through 3-stage pipeline
└─> Write to Future BRAM after 3 cycles
Interface Specification#
Base Version (external_events_processor.v)#
Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
3 |
BRAM read pipeline depth (matches BRAM latency) |
Clock and Reset#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
System clock (225 MHz) |
|
Input |
1 |
Active-low asynchronous reset |
Configuration#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
17 |
Total number of input axons (max 131,072) |
External Event Input Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
Write pulse for new axon event |
|
Input |
14 |
BRAM row address (8 axons per row) |
|
Input |
8 |
Bit mask (1=spike, 0=no spike) |
Execution Control Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
Start new time step (toggles BRAMs) |
|
Output |
1 |
Pipeline filled, ready for reads |
|
Input |
1 |
HBM data valid & ready (advance BRAM) |
|
Output |
8 |
Current spike mask (8 axons) |
|
Output |
1 |
All inputs read, phase 1 complete |
BRAM0 Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Output |
14 |
Write address |
|
Output |
8 |
Write data |
|
Output |
1 |
Write enable |
|
Output |
14 |
Read address |
|
Output |
1 |
Read enable |
|
Input |
8 |
Read data (3-cycle latency) |
BRAM1 Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Output |
14 |
Write address |
|
Output |
8 |
Write data |
|
Output |
1 |
Write enable |
|
Output |
14 |
Read address |
|
Output |
1 |
Read enable |
|
Input |
8 |
Read data (3-cycle latency) |
Simple Version (external_events_processor_simple.v)#
Key differences from base version:
13-bit addresses:
axonEvent_addr[12:0],bram0/1_*addr[12:0]16-bit data:
axonEvent_data[15:0],bram0/1_*data[15:0],exec_eep_spiked[15:0]16 axons per row:
axon_addr_limit = num_inputs[16:4](not[16:3])Additional output:
hbm2eep_rden(HBM FIFO read enable)Debug outputs:
eep_curr_state[1:0],curr_bram_waddr[12:0]Renamed ports:
exec_eep_*instead ofexec_bram_*
V2 Version (external_events_processor_v2.v)#
Additional interfaces (beyond base version):
Command Interpreter Debug Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
Debug command FIFO empty flag |
|
Input |
14 |
Debug read address from CI |
|
Output |
1 |
Debug command FIFO read enable |
|
Input |
1 |
Debug response FIFO full flag |
|
Output |
22 |
Debug response data (addr + data) |
|
Output |
1 |
Debug response FIFO write enable |
Debug BRAM Read Ports#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Output |
14 |
Debug read address for BRAM0 |
|
Input |
8 |
Debug read data from BRAM0 |
|
Output |
14 |
Debug read address for BRAM1 |
|
Input |
8 |
Debug read data from BRAM1 |
Detailed Logic Description#
Base Version State Machine#
Present BRAM Control FSM#
States:
STATE_RESET = 3'd0 // Reset addresses and flags
STATE_IDLE = 3'd1 // Wait for exec_run
STATE_FILL_PIPE = 3'd2 // Fill 3-stage BRAM read pipeline
STATE_READ_INPUTS = 3'd3 // Read inputs, clear memory, sync with HBM
STATE_PHASE1_DONE = 3'd4 // Signal completion
State Transitions:
RESET
|
v
IDLE <────────────────┐
| │
| exec_run │
v │
FILL_PIPE │
| │
| raddr >= 3 │
v │
READ_INPUTS │
| │
| waddr == limit │
v │
PHASE1_DONE ────────────┘
State Behaviors:
STATE_RESET:
bramPresent_addr_rst = 1'b1 // Reset raddr and waddr to 0
next_state = STATE_IDLE
STATE_IDLE:
if (exec_run)
bramPresent_addr_rst = 1'b1 // Reset for new time step
next_state = STATE_FILL_PIPE
STATE_FILL_PIPE:
if (bramPresent_raddr < PIPE_DEPTH)
bramPresent_rden = 1'b1 // Issue read
bramPresent_addr_inc = 1'b1 // Increment raddr
else
next_state = STATE_READ_INPUTS // Pipeline full
STATE_READ_INPUTS:
if (exec_hbm_rvalidready) // HBM ready for next data
bramPresent_rden = 1'b1 // Read next address
bramPresent_addr_inc = 1'b1 // Increment both raddr and waddr
if (bramPresent_waddr == BRAM_ADDR_LIMIT)
next_state = STATE_PHASE1_DONE
STATE_PHASE1_DONE:
next_state = STATE_IDLE // Return to idle
Address Management (Present BRAM)#
The module maintains two addresses with different roles:
Read Address (raddr) - Leading edge:
// Advances PIPE_DEPTH cycles ahead of write address
// Points to data that will be available after pipeline latency
always @(posedge clk) begin
if (~resetn | exec_run | bramPresent_addr_rst)
bramPresent_raddr <= 14'd0;
else if (bramPresent_addr_inc)
bramPresent_raddr <= bramPresent_raddr + 1'b1;
end
Write Address (waddr) - Lagging edge:
// Trails read address by PIPE_DEPTH cycles
// Points to data currently emerging from pipeline
always @(posedge clk) begin
if (~resetn | exec_run | bramPresent_addr_rst)
bramPresent_waddr <= 14'd0;
else if (bramPresent_addr_inc && exec_bram_phase1_ready)
bramPresent_waddr <= bramPresent_waddr + 1'b1;
end
Address Relationship:
Cycle 0-2 (FILL_PIPE):
raddr: 0→1→2→3
waddr: 0→0→0→0 (not advancing until exec_bram_phase1_ready)
Cycle 3+ (READ_INPUTS):
raddr: 3→4→5→6→...
waddr: 0→1→2→3→... (maintaining 3-cycle lag)
Why Two Addresses?
BRAM has 3-cycle read latency
raddr issues read requests
waddr writes zeros to addresses whose data has emerged from pipeline
This implements “read first” behavior: read data, then clear it
Future BRAM Pipeline Hazard Handling#
The base version implements a sophisticated 3-stage pipeline to handle concurrent writes to the same BRAM address during the PIPE_DEPTH filling phase.
Problem: If two setArray_go pulses target the same address within 3 cycles, data could be lost.
Solution: Three-stage pipeline with hazard detection and data merging.
// Pipeline registers
reg [13:0] bramFuture_waddr [2:0]; // Stages 2→1→0
reg bramFuture_wren [2:0];
reg [7:0] bramFuture_wdata [2:0];
// Stage assignments (stage 2 is newest, stage 0 is oldest)
always @(posedge clk) begin
if (~resetn) begin
// Initialize all stages
bramFuture_wdata[2] <= 8'd0;
bramFuture_wdata[1] <= 8'd0;
bramFuture_wdata[0] <= 8'd0;
// ... (similar for waddr, wren)
end else if (setArray_go) begin
// Check for hazards at each stage
if (setArray_addr == bramFuture_waddr[2]) begin
// Hazard in stage 2: merge immediately
bramFuture_wdata[2] <= 8'd0;
bramFuture_wdata[1] <= bramFuture_wdata[2] | setArray_data;
bramFuture_wdata[0] <= bramFuture_wdata[1];
bramFuture_wren[2] <= 1'b0;
end else if (setArray_addr == bramFuture_waddr[1]) begin
// Hazard in stage 1: merge with stage 1 data
bramFuture_wdata[2] <= 8'd0;
bramFuture_wdata[1] <= bramFuture_wdata[2];
bramFuture_wdata[0] <= bramFuture_wdata[1] | setArray_data;
bramFuture_wren[2] <= 1'b0;
end else if (setArray_addr == bramFuture_waddr[0]) begin
// Hazard in stage 0: data will merge at BRAM (commented out)
// Current code doesn't merge (see lines 95-96, 103-104)
bramFuture_wdata[2] <= 8'd0;
bramFuture_wdata[1] <= bramFuture_wdata[2];
bramFuture_wdata[0] <= bramFuture_wdata[1];
bramFuture_wren[2] <= 1'b0;
end else begin
// No hazard: normal pipeline operation
bramFuture_wdata[2] <= setArray_data;
bramFuture_wdata[1] <= bramFuture_wdata[2];
bramFuture_wdata[0] <= bramFuture_wdata[1];
bramFuture_wren[2] <= 1'b1;
end
// Always propagate addresses and enables
bramFuture_waddr[2] <= setArray_addr;
bramFuture_waddr[1] <= bramFuture_waddr[2];
bramFuture_waddr[0] <= bramFuture_waddr[1];
bramFuture_wren[1] <= bramFuture_wren[2];
bramFuture_wren[0] <= bramFuture_wren[1];
end else begin
// No new write: propagate with zeros
bramFuture_wdata[2] <= 8'd0;
bramFuture_wdata[1] <= bramFuture_wdata[2];
bramFuture_wdata[0] <= bramFuture_wdata[1];
// ... (propagate addresses/enables)
end
end
Hazard Example:
Cycle | setArray_go | addr | data | Stage2 | Stage1 | Stage0 | Action
------|-------------|------|------|-----------|-----------|-----------|------------------
0 | 1 | 100 | 0x01 | 100/0x01 | -/- | -/- | New write
1 | 1 | 100 | 0x02 | 100/0x00 | 100/0x03 | -/- | Hazard! Merge 0x01|0x02=0x03
2 | 1 | 200 | 0x04 | 200/0x04 | 100/0x00 | 100/0x03 | No hazard
3 | 0 | - | - | -/0x00 | 200/0x04 | 100/0x00 | Propagate
4 | 0 | - | - | -/0x00 | -/0x00 | 200/0x04 | Write 100(0x03)
5 | 0 | - | - | -/0x00 | -/0x00 | -/0x00 | Write 200(0x04)
Note: Lines 95-96 and 103-104 show debugging modifications that bypass the final merge operation:
// Original (with full hazard handling):
// assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;
// Debug version (simpler, may lose events):
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];
Simple Version Logic#
The simple version removes complex hazard handling for single-core operation:
Key Simplifications:
Direct Future Write (no pipeline):
// No pipeline registers - direct assignment
assign bramFuture_waddr = axonEvent_addr_reg;
assign bramFuture_wdata = axonEvent_data_reg;
assign bramFuture_wren = axonEvent_set_reg;
No Future BRAM Read (unless debugging):
assign bramFuture_raddr = 13'd0;
assign bramFuture_rden = 1'b0; // Disabled to avoid pipeline issues
Simplified State Machine (4 states instead of 5):
// Removed STATE_PHASE1_DONE, completion detected in STATE_READ_INPUTS
STATE_READ_INPUTS: begin
if (exec_hbm_rvalidready) begin
bramPresent_rden = 1'b1;
bramPresent_wren = 1'b1;
if (bramPresent_waddr == axon_addr_limit) begin
phase1_done_set = 1'b1;
next_state = STATE_IDLE; // Direct transition
end
end
end
16 Axons Per Row:
// Base version: 8 axons per row
// BRAM_ADDR_LIMIT = num_inputs[16:3] // Divide by 8
// Simple version: 16 axons per row
// axon_addr_limit = num_inputs[16:4] // Divide by 16
Address Calculation Example:
num_inputs = 17'd131072(max neurons)Base:
BRAM_ADDR_LIMIT = 131072 >> 3 = 16384rowsSimple:
axon_addr_limit = 131072 >> 4 = 8192rows
Registered Input Events:
// Better place-and-route by registering inputs
always @(posedge clk) begin
if (~resetn) begin
axonEvent_set_reg <= 1'b0;
axonEvent_addr_reg <= 13'd0;
axonEvent_data_reg <= 16'd0;
end else begin
axonEvent_set_reg <= axonEvent_set;
axonEvent_addr_reg <= axonEvent_addr;
axonEvent_data_reg <= axonEvent_data;
end
end
V2 Version Enhancements#
The V2 version adds debug capabilities while simplifying the future write logic:
Simplified Future Write (Read-Modify-Write)#
Instead of complex pipeline hazard detection, V2 uses RMW:
// Read the current value
assign bramFuture_raddr = setArray_addr[16:3]; // Note: only uses upper bits
assign bramFuture_rden = ci2eep_rden | setArray_go | bramFuture_wren[2] | bramFuture_wren[1] | bramFuture_wren[0];
assign bramFuture_rdata = bram_select ? bram0_rdata : bram1_rdata;
// Merge with new data via OR operation
assign bramFuture_wdata = bramFuture_rdata | setArray_data;
// Propagate through 3-stage pipeline (addresses and enables only)
always @(posedge clk) begin
if (~resetn) begin
bramFuture_waddr[2] <= 14'd0;
bramFuture_waddr[1] <= 14'd0;
bramFuture_waddr[0] <= 14'd0;
bramFuture_wren[2] <= 1'b0;
bramFuture_wren[1] <= 1'b0;
bramFuture_wren[0] <= 1'b0;
end else begin
bramFuture_waddr[2] <= setArray_addr;
bramFuture_waddr[1] <= bramFuture_waddr[2];
bramFuture_waddr[0] <= bramFuture_waddr[1];
bramFuture_wren[2] <= setArray_go;
bramFuture_wren[1] <= bramFuture_wren[2];
bramFuture_wren[0] <= bramFuture_wren[1];
end
end
Why This Works:
Always read before write (RMW pattern)
OR operation merges new spikes with existing ones
Simpler than explicit hazard detection
Relies on BRAM “read first” mode
Debug State Machine#
V2 adds a separate FSM for debug access:
Debug States:
DBG_STATE_RESET = 3'd0 // Reset debug logic
DBG_STATE_IDLE = 3'd1 // Wait for debug command
DBG_STATE_WAIT_0 = 3'd2 // Issue first read
DBG_STATE_WAIT_1 = 3'd3 // Wait cycle 2
DBG_STATE_WAIT_2 = 3'd4 // Wait cycle 3
DBG_STATE_WAIT_3 = 3'd5 // Wait cycle 4
DBG_STATE_DONE = 3'd6 // Send response, pop command
Debug Flow:
1. Command Interpreter writes address to ci2eep FIFO
2. Debug FSM detects ~ci2eep_empty
3. Issue BRAM read (bram0/1_raddr_dbg = ci2eep_dout)
4. Wait 4 cycles for pipeline (WAIT_0→1→2→3)
5. Write response to eep2ci FIFO (address + data)
6. Pop command from ci2eep (ci2eep_rden = 1)
7. Return to IDLE
Debug Interface Behavior:
always @(*) begin
ci2eep_rden = 1'b0;
eep2ci_wren = 1'b0;
dbg_next_state = dbg_curr_state;
case (dbg_curr_state)
DBG_STATE_IDLE: begin
if (~ci2eep_empty)
dbg_next_state = DBG_STATE_WAIT_0;
end
DBG_STATE_WAIT_0: begin
if (~eep2ci_full)
eep2ci_wren = 1'b1; // Write response
dbg_next_state = DBG_STATE_WAIT_1;
end
// ... (similar for WAIT_1, WAIT_2, WAIT_3)
DBG_STATE_DONE: begin
ci2eep_rden = 1'b1; // Pop command
dbg_next_state = DBG_STATE_IDLE;
end
endcase
end
// Response format: {address, data}
assign eep2ci_din = bram_select ? {bram0_raddr_dbg, bram0_rdata_dbg}
: {bram1_raddr_dbg, bram1_rdata_dbg};
// Debug read addresses (always driven)
assign bram0_raddr_dbg = ci2eep_dout;
assign bram1_raddr_dbg = ci2eep_dout;
Memory Map#
Base and V2 Versions (8 axons/row)#
BRAM Organization:
Depth: 16,384 rows (14-bit address)
Width: 8 bits (1 bit per axon)
Total Capacity: 131,072 axons
Dual BRAMs: BRAM0 and BRAM1 (double buffering)
Address Mapping:
Axon ID Range | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 7 | 0x0000 | [0] to [7]
8 - 15 | 0x0001 | [0] to [7]
16 - 23 | 0x0002 | [0] to [7]
... | ... | ...
131064 - 131071 | 0x3FFF | [0] to [7]
Bit Encoding:
Bit = 1: Axon spiked
Bit = 0: No spike
Address Calculation:
bram_addr = axon_id[16:3]; // Upper 14 bits
bit_pos = axon_id[2:0]; // Lower 3 bits
Simple Version (16 axons/row)#
BRAM Organization:
Depth: 8,192 rows (13-bit address)
Width: 16 bits (1 bit per axon)
Total Capacity: 131,072 axons
Dual BRAMs: BRAM0 and BRAM1
Address Mapping:
Axon ID Range | BRAM Address | Bit Position
------------------|--------------|-------------
0 - 15 | 0x0000 | [0] to [15]
16 - 31 | 0x0001 | [0] to [15]
32 - 47 | 0x0002 | [0] to [15]
... | ... | ...
131056 - 131071 | 0x1FFF | [0] to [15]
Address Calculation:
bram_addr = axon_id[16:4]; // Upper 13 bits
bit_pos = axon_id[3:0]; // Lower 4 bits
Memory Utilization:
Base/V2: 16,384 × 8 = 128 Kb per BRAM → 256 Kb total
Simple: 8,192 × 16 = 128 Kb per BRAM → 256 Kb total
Same total capacity, different organization
Timing Diagrams#
Time Step Transition (Double Buffer Swap)#
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────
clk ┘ └─────┘ └─────┘ └─────┘ └─────┘
exec_run ──────┐ ┌─────────────────────────────────────────
└─────┘
Time Step N-1 │ Time Step N
─────────────────┼─────────────────────────────────────
│
bram_select = 0 │ Toggle → 1
BRAM0 = Present │ BRAM0 → Future
BRAM1 = Future │ BRAM1 → Present
│
State: READ_INPUTS │ IDLE → FILL_PIPE → READ_INPUTS
bramPresent = BRAM0 │ bramPresent = BRAM1
bramFuture = BRAM1 │ bramFuture = BRAM0
Present BRAM Read Sequence (Base & V2)#
Cycle 0 1 2 3 4 5 6 7 8 9
─────┬────┬────┬────┬────┬────┬────┬────┬────┬────
State FILL │FILL│FILL│READ│READ│READ│READ│READ│READ│...
────┬┴────┴────┴────┴────┴────┴────┴────┴────┴────
rden ────┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───
────└────┘ └────┘ └────┘ └────┘ └────┘
raddr 0 1 2 3 4 5 6 7 8 9
wren ───────────────────┐ ┌───┐ ┌───┐ ┌───┐
───────────────────└────┘ └────┘ └────┘ └────
waddr 0 0 0 0 1 2 3 4 5 6
┌─────────┐
phase1_ready ────────┘ └──────────────────────────────
(asserted in READ_INPUTS state)
hbm_rvalidready ────────────┐ ┌───┐ ┌───┐ ┌───┐
────────────└────┘ └────┘ └────┘ └────
exec_bram_spiked X X X D0 D1 D2 D3 D4 D5
(pipeline latency = 3 cycles)
BRAM Data Flow:
Cycle 0: Issue read addr 0 → (3-cycle latency) → Cycle 3: Data 0 emerges
Cycle 1: Issue read addr 1 → (3-cycle latency) → Cycle 4: Data 1 emerges
Cycle 2: Issue read addr 2 → (3-cycle latency) → Cycle 5: Data 2 emerges
Cycle 3: Issue read addr 3, Write 0 to addr 0
Cycle 4: Issue read addr 4, Write 0 to addr 1
...
Future BRAM Write with Hazard (Base Version)#
Cycle 0 1 2 3 4 5 6 7
────┬────┬────┬────┬────┬────┬────┬────
setArray_go ───┐ ┌───┐ ┌───────────────────
───└────┘ └────┘
setArray_addr 100 200 100 - - - - -
setArray_data 0x01 0x04 0x02 - - - - -
Pipeline Stage 2:
waddr 100 200 100 - - - - -
wdata 0x01 0x04 0x00 - - - - -
wren 1 1 0 - - - - -
Pipeline Stage 1:
waddr - 100 200 100 - - - -
wdata - 0x01 0x04 0x03 - - - - ← Merged!
wren - 1 1 0 - - - -
Pipeline Stage 0:
waddr - - 100 200 100 - - -
wdata - - 0x01 0x04 0x00 - - -
wren - - 1 1 0 - - -
BRAM Write:
Addr 100 ──────────────────────────────────────┐
Data = 0x01 (from cycle 2) ───────────────────┘
Addr 200 ─────────────────────────────────────────┐
Data = 0x04 (from cycle 3) ───────────────────┘
Addr 100 ────────────────────────────────────────────┐
Data = 0x03 (merged 0x01|0x02 from cycle 4) ───────┘
Note: Cycle 2 write to addr 100 detected hazard with cycle 0, merged data in stage 1.
Simple Version: No Pipeline (Direct Write)#
Cycle 0 1 2 3 4 5
────┬────┬────┬────┬────┬────
axonEvent_set ──┐ ┌───┐ ┌──────────
───└────┘ └────┘
axonEvent_addr 100 200 300 - - -
axonEvent_data 0x0001 0x0004 0x0008 - - -
(Register inputs for better timing)
────┬────┬────┬────┬────┬────
*_set_reg ───────┐ ┌───┐ ┌──────
───└────┘ └────┘
*_addr_reg - 100 200 300 - -
*_data_reg - 0x0001 0x0004 0x0008 - -
bramFuture_waddr - 100 200 300 - -
bramFuture_wdata - 0x0001 0x0004 0x0008 - -
bramFuture_wren - 1 1 1 - -
BRAM Write: - 100 200 300 - -
0x0001 0x0004 0x0008
No hazard handling! Single-core only.
If same address written twice within pipeline depth, later write overwrites earlier.
V2 Debug Read Sequence#
Cycle 0 1 2 3 4 5 6 7
────┬────┬────┬────┬────┬────┬────┬────
DBG_State IDLE│WAIT│WAIT│WAIT│WAIT│DONE│IDLE│...
│ 0 │ 1 │ 2 │ 3 │ │ │
ci2eep_empty ────┐ ┌───
────└──────────────────────────────────┘
ci2eep_dout 0x1234 (stays valid until rden)
ci2eep_rden ────────────────────────────────────┐ ┌───
────────────────────────────────────└──┘
bram*_raddr_dbg 0x1234 (always driven)
eep2ci_wren ───────┐ ┌───┐ ┌───┐ ┌───┐ ┌───
───────└────┘ └────┘ └────┘ └──┘
(writes in WAIT_0 through WAIT_3 if not full)
eep2ci_din X {0x1234,D} (data D emerges after latency)
Debug transaction:
Cycle 0: Detect command available
Cycle 1-4: Issue reads, wait for pipeline, write response
Cycle 5: Pop command FIFO
Cycle 6: Return to IDLE
Cross-References#
Upstream Modules#
command_interpreter.v (
command_interpreter.md):Generates
setArray_go,setArray_addr,setArray_datasignalsV2: Provides debug FIFO interfaces (
ci2eep_*,eep2ci_*)Controls when external events are injected
pcie2fifos.v (
pcie2fifos.md):Ultimate source of external events from host
Events flow: PCIe → Command Interpreter → External Events Processor
Downstream Modules#
hbm_processor.v (
hbm_processor.md):Receives
exec_bram_spiked(spike mask)Provides
exec_hbm_rvalidready(synchronization signal)Uses spike masks to fetch pointer chains from HBM
internal_events_processor.v (
internal_events_processor.md):Receives
exec_bram_phase1_done(completion signal)Coordinates two-phase execution (external then internal events)
Peer Modules#
pointer_fifo_controller.v (
pointer_fifo_controller.md):Works with spike masks from this module
Controls flow of pointer data to HBM processor
Module Comparison: When to Use Each Variant#
Use Base Version When:#
Multi-core architecture with multiple cores writing to same future BRAM
Concurrent writes to the same BRAM address are expected
Data integrity is critical and no events can be lost
Pipeline hazards need explicit detection and merging
8 axons per row organization preferred
Trade-offs:
✅ Full hazard handling
✅ No data loss in multi-core scenarios
❌ More complex logic
❌ Higher resource usage (pipeline registers)
⚠️ Debugging modifications present (lines 95-96, 103-104)
Use Simple Version When:#
Single-core architecture with only one writer to future BRAM
Lower resource usage is priority
Wider data paths (16-bit) preferred for bandwidth
No concurrent writes to same address expected
Simpler logic easier to verify and debug
Trade-offs:
✅ Minimal resource usage
✅ 2× data width (16 vs 8 bits)
✅ Simpler state machine (4 vs 5 states)
✅ Better timing due to registered inputs
❌ No hazard protection
❌ Data loss if concurrent writes occur
❌ Single-core only
Use V2 Version When:#
Debug and verification required
BRAM inspection needed during runtime
Command interpreter interface for test patterns
Read-modify-write approach acceptable
Production debugging of neuromorphic algorithms
Trade-offs:
✅ Debug capabilities (FIFO interface)
✅ Simplified future write logic (RMW vs explicit hazards)
✅ Direct BRAM inspection via debug ports
❌ Additional debug FSM (more resources)
❌ Extra FIFO interfaces
❌ Not optimized for performance
Performance Characteristics#
Base and V2 Versions#
Throughput:
Read Rate: 1 BRAM address per
exec_hbm_rvalidreadycycleEffective Rate: Limited by HBM bandwidth (~450 MHz possible, typically 225 MHz)
Pipeline Fill: 3 cycles (one-time cost per time step)
Total Time:
3 + num_inputs[16:3]cycles per time step
Example (131,072 neurons):
BRAM addresses = 131072 / 8 = 16384
Pipeline fill = 3 cycles
Total cycles = 3 + 16384 = 16387 cycles
At 225 MHz = 16387 / 225e6 = 72.8 µs
Future Write Latency:
Base: 3 cycles (pipeline depth) from
setArray_goto BRAM writeV2: 3 cycles (pipeline depth) from
setArray_goto BRAM writeHazard Penalty: 0 cycles (merged in pipeline)
Simple Version#
Throughput:
Read Rate: 1 BRAM address per
exec_hbm_rvalidreadycycleEffective Rate: 225 MHz typical
Pipeline Fill: 3 cycles
Total Time:
3 + num_inputs[16:4]cycles per time step
Example (131,072 neurons):
BRAM addresses = 131072 / 16 = 8192
Pipeline fill = 3 cycles
Total cycles = 3 + 8192 = 8195 cycles
At 225 MHz = 8195 / 225e6 = 36.4 µs (2× faster than base!)
Future Write Latency:
Direct: 1 cycle from
axonEvent_setto registered writeTotal: 2 cycles (register + BRAM write)
Resource Usage Comparison:
Resource |
Base |
Simple |
V2 |
|---|---|---|---|
LUTs (approx.) |
500 |
250 |
600 |
Flip-Flops |
200 |
120 |
280 |
BRAM18K |
2 |
2 |
2 |
Pipeline Regs |
3×(14+8+1) |
0 |
3×(14+1) |
Common Issues and Debugging#
Issue 1: Events Lost During Time Step Transition#
Symptoms:
External events written near
exec_runpulse disappearInconsistent spike counts between time steps
Root Cause:
Writing to future BRAM while
bram_selectis togglingRace condition between write and buffer swap
Debug:
// Check timing of setArray_go relative to exec_run
// Add ILA probe:
ila_0 your_ila (
.clk(clk),
.probe0(exec_run),
.probe1(setArray_go),
.probe2(setArray_addr),
.probe3(bram_select)
);
Solution:
Ensure
setArray_gonever occurs within 3 cycles ofexec_runAdd FIFO between command interpreter and external events processor
Stall writes during buffer swap
Issue 2: Pipeline Hazards Not Detected (Base Version)#
Symptoms:
Expected spike data doesn’t appear
OR of multiple writes shows only one bit set
Root Cause:
Hazard detection logic not functioning
Debugging modifications (lines 95-96, 103-104) bypass merging
Debug:
// Monitor pipeline stages
(* mark_debug = "true" *) reg [13:0] bramFuture_waddr_dbg [2:0];
(* mark_debug = "true" *) reg [7:0] bramFuture_wdata_dbg [2:0];
(* mark_debug = "true" *) reg bramFuture_wren_dbg [2:0];
always @(posedge clk) begin
bramFuture_waddr_dbg <= bramFuture_waddr;
bramFuture_wdata_dbg <= bramFuture_wdata;
bramFuture_wren_dbg <= bramFuture_wren;
end
Solution:
Restore original BRAM write logic:
// Change:
assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0];
// To:
assign bram0_wdata = ~bram_select ? bramPresent_wdata :
bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe;
Issue 3: Address Limit Calculation Wrong#
Symptoms:
Phase 1 completes too early or too late
Not all neurons receive input events
Root Cause:
Incorrect calculation of BRAM address limit
Mismatch between
num_inputsand actual neuron count
Debug:
// Check address limit
// Base: BRAM_ADDR_LIMIT = num_inputs[16:3] (divide by 8)
// Simple: axon_addr_limit = num_inputs[16:4] (divide by 16)
// Add assertion:
assert property (@(posedge clk) disable iff (~resetn)
(curr_state == STATE_READ_INPUTS && bramPresent_waddr == BRAM_ADDR_LIMIT)
|=> (curr_state == STATE_PHASE1_DONE)
);
Solution:
Verify
num_inputsmatches neuron configurationBase: Ensure multiple of 8
Simple: Ensure multiple of 16
Add
+1if rounding needed:
// If num_inputs not exact multiple
assign BRAM_ADDR_LIMIT = (num_inputs[16:3]) + |num_inputs[2:0]; // Round up
Issue 4: BRAM Read Latency Mismatch#
Symptoms:
Data appears corrupted or delayed
exec_bram_spikedshows wrong values
Root Cause:
BRAM configured with latency ≠ PIPE_DEPTH (3)
Pipeline depth parameter doesn’t match actual BRAM
Debug:
// Verify BRAM configuration in IP customization:
// - Read Latency: should be 3
// - Primitive Type: should match PIPE_DEPTH
// Check if raddr/waddr maintain proper offset:
assert property (@(posedge clk) disable iff (~resetn)
(curr_state == STATE_READ_INPUTS && exec_bram_phase1_ready)
|-> (bramPresent_raddr == bramPresent_waddr + PIPE_DEPTH)
);
Solution:
Reconfigure BRAM IP for 3-cycle latency
Or update PIPE_DEPTH parameter to match BRAM:
external_events_processor #(
.PIPE_DEPTH(2) // If BRAM has 2-cycle latency
) eep_inst (
// ...
);
Issue 5: V2 Debug Reads Return Stale Data#
Symptoms:
Debug responses show old/incorrect BRAM data
Debug state machine stuck in WAIT states
Root Cause:
Insufficient wait cycles for BRAM read latency
Debug FSM transitions too quickly
Debug:
// Monitor debug state progression
(* mark_debug = "true" *) reg [2:0] dbg_state_history [7:0];
always @(posedge clk) begin
dbg_state_history[7:1] <= dbg_state_history[6:0];
dbg_state_history[0] <= dbg_curr_state;
end
Solution:
Ensure 4 WAIT states (WAIT_0→WAIT_1→WAIT_2→WAIT_3)
Add extra wait state if needed:
localparam [2:0] DBG_STATE_WAIT_4 = 3'd7;
// In state machine:
DBG_STATE_WAIT_3: begin
if (~eep2ci_full)
eep2ci_wren = 1'b1;
dbg_next_state = DBG_STATE_WAIT_4; // Extra cycle
end
DBG_STATE_WAIT_4: begin
dbg_next_state = DBG_STATE_DONE;
end
Safety and Edge Cases#
Edge Case 1: num_inputs = 0#
Behavior:
BRAM_ADDR_LIMIT = 0State machine immediately transitions FILL_PIPE → READ_INPUTS → PHASE1_DONE
No BRAM accesses occur
Safety:
✅ No undefined behavior
✅ Module functions correctly (zero inputs processed)
⚠️ Wastes cycles (should be caught at system level)
Edge Case 2: num_inputs Not Multiple of 8 (or 16)#
Example: num_inputs = 17'd100
Base Version:
BRAM_ADDR_LIMIT = 100 >> 3 = 12
Actual coverage = 12 * 8 = 96 axons
Missing = 4 axons (96-99 not processed)
Fix:
// Round up to nearest multiple
assign BRAM_ADDR_LIMIT = (num_inputs + 7) >> 3; // Ceiling division
Edge Case 3: Concurrent setArray_go and exec_run#
Scenario:
Cycle N: exec_run = 1 (toggle bram_select)
Cycle N: setArray_go = 1 (write to future BRAM)
Problem:
bram_selectchanges, may write to wrong BRAM
Current Design:
bram_selectregistered onexec_runedgesetArray_gowrites on same edgeRace condition! Indeterminate which BRAM receives write
Solution:
Pipeline
exec_runby 1 cycle:
reg exec_run_pipe;
always @(posedge clk) begin
if (~resetn)
exec_run_pipe <= 1'b0;
else
exec_run_pipe <= exec_run;
end
// Use exec_run_pipe for bram_select toggle
always @(posedge clk) begin
if (~resetn)
bram_select <= 1'b0;
else if (exec_run_pipe) // Changed from exec_run
bram_select <= ~bram_select;
end
Edge Case 4: BRAM Write During Pipeline Fill#
Scenario:
STATE_FILL_PIPE: bramPresent_wren = 0 (not asserted yet)
Future writes: bramFuture_wren[0] = 1 (trying to write)
Problem (Multi-core):
If multiple cores write to future BRAM during present BRAM pipeline fill
Potential for lost writes if exceeding BRAM write bandwidth
Current Design:
Single write port per BRAM
Future writes serialized through pipeline
Safe as long as write rate ≤ 1 per 3 cycles
Solution (if needed):
Use dual-port BRAM (separate read/write ports)
Or implement write FIFO to buffer concurrent writes
Safety Check: Phase 1 Completion Detection#
Assertion:
// Ensure phase1_done only asserted when all addresses processed
property phase1_done_check;
@(posedge clk) disable iff (~resetn)
(exec_bram_phase1_done) |-> (bramPresent_waddr == BRAM_ADDR_LIMIT);
endproperty
assert_phase1: assert property (phase1_done_check);
Safety Check: No Writes During Buffer Swap#
Assertion:
// Ensure no future writes during exec_run
property no_write_during_swap;
@(posedge clk) disable iff (~resetn)
(exec_run) |-> (bramFuture_wren[0] == 1'b0);
endproperty
assert_no_write: assert property (no_write_during_swap);
Future Enhancement Opportunities#
1. Configurable Data Width#
Allow parameterization of axons per row:
module external_events_processor #(
parameter PIPE_DEPTH = 3,
parameter AXONS_PER_ROW = 8 // 8, 16, 32, etc.
)(
// Derive address and data widths
localparam ADDR_BITS = 17 - $clog2(AXONS_PER_ROW);
localparam DATA_BITS = AXONS_PER_ROW;
input [ADDR_BITS-1:0] setArray_addr,
input [DATA_BITS-1:0] setArray_data,
// ...
);
2. Burst Mode for Faster Pipeline Fill#
Current: Fill pipeline sequentially (3 cycles) Enhancement: Issue all 3 reads in 1 cycle (if BRAM supports)
STATE_FILL_PIPE: begin
if (bramPresent_raddr == 0) begin
// Issue all 3 reads at once
bram_raddr[0] = 14'd0;
bram_raddr[1] = 14'd1;
bram_raddr[2] = 14'd2;
bram_rden[0] = 1'b1;
bram_rden[1] = 1'b1;
bram_rden[2] = 1'b1;
next_state = STATE_READ_INPUTS;
end
end
3. Event Timestamping#
Add timestamp to each event for precise temporal resolution:
// Expand data width: [7:0] data + [15:0] timestamp
input [23:0] setArray_data, // {timestamp, spike_mask}
// BRAM organization: 24 bits per row
4. Event Compression#
Sparse events (few spikes per row) waste bandwidth:
// Instead of full bit mask, store indices
// Example: Spikes at axons 5, 17, 42
// Compressed: {3'b011, 6'd42, 6'd17, 6'd5} // Count + indices
5. Multi-Buffer (>2 BRAMs)#
Allow more than 2 time steps in flight:
parameter NUM_BUFFERS = 4; // Quad buffering
reg [1:0] bram_select; // 2-bit select (4 buffers)
always @(posedge clk) begin
if (exec_run)
bram_select <= (bram_select + 1) & 2'b11; // Circular
end
6. AXI4-Stream Interface#
Replace custom interface with standard AXI4-Stream:
// Input events
input s_axis_tvalid,
output s_axis_tready,
input [31:0] s_axis_tdata, // {addr, data}
input s_axis_tlast,
// Output spikes
output m_axis_tvalid,
input m_axis_tready,
output [31:0] m_axis_tdata, // Spike mask + metadata
7. Configurable Pipeline Depth#
Auto-detect BRAM latency at synthesis:
// Query BRAM IP for latency
localparam BRAM_LATENCY = bram0.READ_LATENCY_A; // From BRAM IP
external_events_processor #(
.PIPE_DEPTH(BRAM_LATENCY) // Match automatically
) eep (
// ...
);
Key Terms and Definitions#
Term |
Definition |
|---|---|
Axon |
Input neuron connection; source of spike events |
Double Buffering |
Two-buffer scheme (present/future) allowing simultaneous read and write |
Present BRAM |
BRAM being read during current time step (then cleared) |
Future BRAM |
BRAM accumulating events for next time step |
bram_select |
Toggle bit selecting which physical BRAM is present vs. future |
Pipeline Depth |
Number of cycles between BRAM read request and data availability (typically 3) |
Pipeline Fill |
Initial phase where read pipeline is populated before writes begin |
Leading Address |
Read address (raddr) - advances pipeline depth ahead of write address |
Lagging Address |
Write address (waddr) - clears data after it emerges from pipeline |
Spike Mask |
Bit vector where each bit represents spike (1) or no-spike (0) for an axon |
Phase 1 |
External event processing (vs. Phase 2: internal/synaptic events) |
exec_run |
Control pulse starting new time step, toggling present ←→ future BRAMs |
exec_hbm_rvalidready |
Synchronization signal from HBM indicating data consumed, advance BRAM |
setArray_go |
Write pulse for external event (from command interpreter or other source) |
Pipeline Hazard |
Conflict when concurrent writes target same BRAM address within pipeline depth |
RMW (Read-Modify-Write) |
Pattern of reading current value, modifying, then writing back |
Hazard Detection |
Logic identifying when new write conflicts with in-flight writes |
Data Merging |
Combining multiple writes to same address via OR operation |
Time Step |
Discrete computation cycle in neuromorphic algorithm (milliseconds typically) |
Axon Event |
External spike arriving at input neuron |
Axons Per Row |
Number of axons packed into single BRAM address (8 or 16 bits) |
Address Limit |
Maximum BRAM address to read/write (depends on num_inputs) |
Conclusion#
The External Events Processor family provides flexible solutions for managing input spike events in neuromorphic systems:
Base version: Full-featured with pipeline hazard handling for multi-core
Simple version: Streamlined single-core variant with lower resource usage
V2 version: Debug-enhanced variant for verification and development
Key Design Principles:
Double buffering prevents event loss during time step transitions
Pipeline management ensures correct synchronization with BRAM latency
Hazard detection/merging (base) or simplified RMW (V2) prevents data corruption
State machine coordinates read-clear cycles with downstream modules
Selection Guide:
Multi-core system with concurrent writes → Base version
Single-core system, resource-constrained → Simple version
Debug/verification needed → V2 version
For questions or issues, cross-reference with command_interpreter.md (upstream) and hbm_processor.md (downstream) for complete system understanding.