# External Events Processor Module Family ## Overview The **External Events Processor** family manages input spike events (axons) in the neuromorphic FPGA system. These modules maintain two Block RAMs in a double-buffering scheme: one for the "present" time step (currently being processed) and one for the "future" time step (accumulating new events). This architecture allows continuous operation without dropping input events. Three variants exist: 1. **external_events_processor.v** - Base version with full pipeline hazard handling 2. **external_events_processor_simple.v** - Simplified single-core version with wider data paths 3. **external_events_processor_v2.v** - Enhanced version with debugging capabilities ### Role in the Software/Hardware Stack ``` Host Application (Python/C++) | [hs_bridge] | [PCIe Interface] | [Command Interpreter] -----> [External Events Processor] <---- External spike events | | | Present BRAM (8 or 16 axons/row) | Future BRAM (8 or 16 axons/row) | | [HBM Processor] <------ exec_bram_spiked (spike mask) | | [Internal Events] <-------- exec_bram_phase1_done Processor ``` **Function**: - Receive input spike events from external sources or command interpreter - Store events in double-buffered BRAMs (present/future) - Synchronize event delivery with HBM read operations - Clear processed events after reading - Handle pipeline hazards for concurrent writes during multi-core operation **Key Innovation**: Double-buffering allows new events to accumulate in the "future" BRAM while the "present" BRAM is being read and cleared, ensuring no event loss during processing. --- ## Variant Comparison | Feature | Base Version | Simple Version | V2 Version | |---------|-------------|----------------|------------| | **File** | external_events_processor.v | external_events_processor_simple.v | external_events_processor_v2.v | | **Axons per row** | 8 | 16 | 8 | | **Address width** | 14 bits | 13 bits | 14 bits | | **Data width** | 8 bits | 16 bits | 8 bits | | **Target** | Multi-core | Single-core | Debug/verification | | **Future pipeline** | 3-stage hazard handling | Direct write (no hazards) | Simplified RMW | | **State machine** | 5 states | 4 states | 5 states + debug FSM | | **Debug features** | None | None | CI interface, debug ports | | **Complexity** | High | Low | High | --- ## Module Architecture (Base Version) ``` ┌─────────────────────────────────┐ │ External Events Processor │ │ │ setArray_go ──────────┐ │ ┌──────────────────────────┐ │ setArray_addr[13:0] ──┼──────────┼─>│ Future BRAM Control │ │ setArray_data[7:0] ───┘ │ │ - 3-stage pipeline │ │ │ │ - Hazard detection │ │ exec_run ────────────────────────┼─>│ - waddr/wdata/wren[2:0] │ │ │ └────────┬─────────────────┘ │ │ │ │ │ v │ │ ┌──────────────────────────┐ │ │ │ BRAM Multiplexer │ │ │ │ bram_select toggle │ │ ┌───────────────────────────────┼─>│ - BRAM0 ←→ Present │ │ │ │ │ - BRAM1 ←→ Future │ │ │ ┌────────────────────────────┼─<│ │ │ │ │ │ └───────┬──────────────────┘ │ │ │ │ │ │ │ │ │ v │ │ │ │ ┌──────────────────────────┐ │ │ │ exec_hbm_rvalidready ────┼─>│ Present BRAM Control │ │ │ │ │ │ - State machine (5) │ │ │ │ │ │ - Pipeline fill │ │ │ │ │ │ - Read & clear │ │ │ │ │ │ - raddr/waddr tracking │ │ │ │ │ └────────┬─────────────────┘ │ │ │ │ │ │ │ └───────── exec_bram_spiked[7:0] <───────┘ │ │ │ │ └───────── exec_bram_phase1_done ────────────────────────────────┤ │ │ └─────────────────────────────────┘ BRAM0 (18Kb) BRAM1 (18Kb) ┌────────────┐ ┌────────────┐ │ 16384 × 8b │ │ 16384 × 8b │ │ │ │ │ │ Toggles: │ │ Toggles: │ │ Present ←→ │ │ Future ←→ │ │ Future │ │ Present │ └────────────┘ └────────────┘ ``` ### Data Flow (Two-Phase Operation) **Phase 0: Setup (between time steps)** ``` 1. exec_run pulse triggers: - bram_select toggles (swaps present ←→ future) - State machine resets to IDLE 2. Present BRAM now contains accumulated events from previous "future" 3. Future BRAM ready to accumulate new events ``` **Phase 1: Event Processing (during time step)** ``` STATE_FILL_PIPE (cycles 0-2): ├─> Read BRAM addresses 0, 1, 2 └─> Fill 3-stage pipeline (no writes yet) STATE_READ_INPUTS (cycles 3 to completion): ├─> Wait for exec_hbm_rvalidready ├─> Read next BRAM address (bramPresent_raddr++) ├─> Write 0 to lagging address (bramPresent_waddr++) ├─> Output exec_bram_spiked[7:0] to downstream └─> Loop until bramPresent_waddr == BRAM_ADDR_LIMIT STATE_PHASE1_DONE: └─> Assert exec_bram_phase1_done ``` **Concurrent Future Writes** (throughout processing): ``` setArray_go pulse: ├─> Check for pipeline hazards (same address in stages 0, 1, 2) ├─> Merge with in-flight data if hazard detected ├─> Propagate through 3-stage pipeline └─> Write to Future BRAM after 3 cycles ``` --- ## Interface Specification ### Base Version (external_events_processor.v) #### Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `PIPE_DEPTH` | 3 | BRAM read pipeline depth (matches BRAM latency) | #### Clock and Reset | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `clk` | Input | 1 | System clock (225 MHz) | | `resetn` | Input | 1 | Active-low asynchronous reset | #### Configuration | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `num_inputs` | Input | 17 | Total number of input axons (max 131,072) | #### External Event Input Interface | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `setArray_go` | Input | 1 | Write pulse for new axon event | | `setArray_addr` | Input | 14 | BRAM row address (8 axons per row) | | `setArray_data` | Input | 8 | Bit mask (1=spike, 0=no spike) | #### Execution Control Interface | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `exec_run` | Input | 1 | Start new time step (toggles BRAMs) | | `exec_bram_phase1_ready` | Output | 1 | Pipeline filled, ready for reads | | `exec_hbm_rvalidready` | Input | 1 | HBM data valid & ready (advance BRAM) | | `exec_bram_spiked` | Output | 8 | Current spike mask (8 axons) | | `exec_bram_phase1_done` | Output | 1 | All inputs read, phase 1 complete | #### BRAM0 Interface | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `bram0_waddr` | Output | 14 | Write address | | `bram0_wdata` | Output | 8 | Write data | | `bram0_wren` | Output | 1 | Write enable | | `bram0_raddr` | Output | 14 | Read address | | `bram0_rden` | Output | 1 | Read enable | | `bram0_rdata` | Input | 8 | Read data (3-cycle latency) | #### BRAM1 Interface | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `bram1_waddr` | Output | 14 | Write address | | `bram1_wdata` | Output | 8 | Write data | | `bram1_wren` | Output | 1 | Write enable | | `bram1_raddr` | Output | 14 | Read address | | `bram1_rden` | Output | 1 | Read enable | | `bram1_rdata` | Input | 8 | Read data (3-cycle latency) | ### Simple Version (external_events_processor_simple.v) Key differences from base version: - **13-bit addresses**: `axonEvent_addr[12:0]`, `bram0/1_*addr[12:0]` - **16-bit data**: `axonEvent_data[15:0]`, `bram0/1_*data[15:0]`, `exec_eep_spiked[15:0]` - **16 axons per row**: `axon_addr_limit = num_inputs[16:4]` (not `[16:3]`) - **Additional output**: `hbm2eep_rden` (HBM FIFO read enable) - **Debug outputs**: `eep_curr_state[1:0]`, `curr_bram_waddr[12:0]` - **Renamed ports**: `exec_eep_*` instead of `exec_bram_*` ### V2 Version (external_events_processor_v2.v) Additional interfaces (beyond base version): #### Command Interpreter Debug Interface | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `ci2eep_empty` | Input | 1 | Debug command FIFO empty flag | | `ci2eep_dout` | Input | 14 | Debug read address from CI | | `ci2eep_rden` | Output | 1 | Debug command FIFO read enable | | `eep2ci_full` | Input | 1 | Debug response FIFO full flag | | `eep2ci_din` | Output | 22 | Debug response data (addr + data) | | `eep2ci_wren` | Output | 1 | Debug response FIFO write enable | #### Debug BRAM Read Ports | Port | Direction | Width | Description | |------|-----------|-------|-------------| | `bram0_raddr_dbg` | Output | 14 | Debug read address for BRAM0 | | `bram0_rdata_dbg` | Input | 8 | Debug read data from BRAM0 | | `bram1_raddr_dbg` | Output | 14 | Debug read address for BRAM1 | | `bram1_rdata_dbg` | Input | 8 | Debug read data from BRAM1 | --- ## Detailed Logic Description ### Base Version State Machine #### Present BRAM Control FSM **States:** ```verilog STATE_RESET = 3'd0 // Reset addresses and flags STATE_IDLE = 3'd1 // Wait for exec_run STATE_FILL_PIPE = 3'd2 // Fill 3-stage BRAM read pipeline STATE_READ_INPUTS = 3'd3 // Read inputs, clear memory, sync with HBM STATE_PHASE1_DONE = 3'd4 // Signal completion ``` **State Transitions:** ``` RESET | v IDLE <────────────────┐ | │ | exec_run │ v │ FILL_PIPE │ | │ | raddr >= 3 │ v │ READ_INPUTS │ | │ | waddr == limit │ v │ PHASE1_DONE ────────────┘ ``` **State Behaviors:** ```verilog STATE_RESET: bramPresent_addr_rst = 1'b1 // Reset raddr and waddr to 0 next_state = STATE_IDLE STATE_IDLE: if (exec_run) bramPresent_addr_rst = 1'b1 // Reset for new time step next_state = STATE_FILL_PIPE STATE_FILL_PIPE: if (bramPresent_raddr < PIPE_DEPTH) bramPresent_rden = 1'b1 // Issue read bramPresent_addr_inc = 1'b1 // Increment raddr else next_state = STATE_READ_INPUTS // Pipeline full STATE_READ_INPUTS: if (exec_hbm_rvalidready) // HBM ready for next data bramPresent_rden = 1'b1 // Read next address bramPresent_addr_inc = 1'b1 // Increment both raddr and waddr if (bramPresent_waddr == BRAM_ADDR_LIMIT) next_state = STATE_PHASE1_DONE STATE_PHASE1_DONE: next_state = STATE_IDLE // Return to idle ``` #### Address Management (Present BRAM) The module maintains two addresses with different roles: **Read Address (raddr)** - Leading edge: ```verilog // Advances PIPE_DEPTH cycles ahead of write address // Points to data that will be available after pipeline latency always @(posedge clk) begin if (~resetn | exec_run | bramPresent_addr_rst) bramPresent_raddr <= 14'd0; else if (bramPresent_addr_inc) bramPresent_raddr <= bramPresent_raddr + 1'b1; end ``` **Write Address (waddr)** - Lagging edge: ```verilog // Trails read address by PIPE_DEPTH cycles // Points to data currently emerging from pipeline always @(posedge clk) begin if (~resetn | exec_run | bramPresent_addr_rst) bramPresent_waddr <= 14'd0; else if (bramPresent_addr_inc && exec_bram_phase1_ready) bramPresent_waddr <= bramPresent_waddr + 1'b1; end ``` **Address Relationship:** ``` Cycle 0-2 (FILL_PIPE): raddr: 0→1→2→3 waddr: 0→0→0→0 (not advancing until exec_bram_phase1_ready) Cycle 3+ (READ_INPUTS): raddr: 3→4→5→6→... waddr: 0→1→2→3→... (maintaining 3-cycle lag) ``` **Why Two Addresses?** - BRAM has 3-cycle read latency - raddr issues read requests - waddr writes zeros to addresses whose data has emerged from pipeline - This implements "read first" behavior: read data, then clear it #### Future BRAM Pipeline Hazard Handling The base version implements a sophisticated 3-stage pipeline to handle concurrent writes to the same BRAM address during the PIPE_DEPTH filling phase. **Problem**: If two `setArray_go` pulses target the same address within 3 cycles, data could be lost. **Solution**: Three-stage pipeline with hazard detection and data merging. ```verilog // Pipeline registers reg [13:0] bramFuture_waddr [2:0]; // Stages 2→1→0 reg bramFuture_wren [2:0]; reg [7:0] bramFuture_wdata [2:0]; // Stage assignments (stage 2 is newest, stage 0 is oldest) always @(posedge clk) begin if (~resetn) begin // Initialize all stages bramFuture_wdata[2] <= 8'd0; bramFuture_wdata[1] <= 8'd0; bramFuture_wdata[0] <= 8'd0; // ... (similar for waddr, wren) end else if (setArray_go) begin // Check for hazards at each stage if (setArray_addr == bramFuture_waddr[2]) begin // Hazard in stage 2: merge immediately bramFuture_wdata[2] <= 8'd0; bramFuture_wdata[1] <= bramFuture_wdata[2] | setArray_data; bramFuture_wdata[0] <= bramFuture_wdata[1]; bramFuture_wren[2] <= 1'b0; end else if (setArray_addr == bramFuture_waddr[1]) begin // Hazard in stage 1: merge with stage 1 data bramFuture_wdata[2] <= 8'd0; bramFuture_wdata[1] <= bramFuture_wdata[2]; bramFuture_wdata[0] <= bramFuture_wdata[1] | setArray_data; bramFuture_wren[2] <= 1'b0; end else if (setArray_addr == bramFuture_waddr[0]) begin // Hazard in stage 0: data will merge at BRAM (commented out) // Current code doesn't merge (see lines 95-96, 103-104) bramFuture_wdata[2] <= 8'd0; bramFuture_wdata[1] <= bramFuture_wdata[2]; bramFuture_wdata[0] <= bramFuture_wdata[1]; bramFuture_wren[2] <= 1'b0; end else begin // No hazard: normal pipeline operation bramFuture_wdata[2] <= setArray_data; bramFuture_wdata[1] <= bramFuture_wdata[2]; bramFuture_wdata[0] <= bramFuture_wdata[1]; bramFuture_wren[2] <= 1'b1; end // Always propagate addresses and enables bramFuture_waddr[2] <= setArray_addr; bramFuture_waddr[1] <= bramFuture_waddr[2]; bramFuture_waddr[0] <= bramFuture_waddr[1]; bramFuture_wren[1] <= bramFuture_wren[2]; bramFuture_wren[0] <= bramFuture_wren[1]; end else begin // No new write: propagate with zeros bramFuture_wdata[2] <= 8'd0; bramFuture_wdata[1] <= bramFuture_wdata[2]; bramFuture_wdata[0] <= bramFuture_wdata[1]; // ... (propagate addresses/enables) end end ``` **Hazard Example:** ``` Cycle | setArray_go | addr | data | Stage2 | Stage1 | Stage0 | Action ------|-------------|------|------|-----------|-----------|-----------|------------------ 0 | 1 | 100 | 0x01 | 100/0x01 | -/- | -/- | New write 1 | 1 | 100 | 0x02 | 100/0x00 | 100/0x03 | -/- | Hazard! Merge 0x01|0x02=0x03 2 | 1 | 200 | 0x04 | 200/0x04 | 100/0x00 | 100/0x03 | No hazard 3 | 0 | - | - | -/0x00 | 200/0x04 | 100/0x00 | Propagate 4 | 0 | - | - | -/0x00 | -/0x00 | 200/0x04 | Write 100(0x03) 5 | 0 | - | - | -/0x00 | -/0x00 | -/0x00 | Write 200(0x04) ``` **Note**: Lines 95-96 and 103-104 show debugging modifications that bypass the final merge operation: ```verilog // Original (with full hazard handling): // assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe; // Debug version (simpler, may lose events): assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0]; ``` ### Simple Version Logic The simple version removes complex hazard handling for single-core operation: **Key Simplifications:** 1. **Direct Future Write** (no pipeline): ```verilog // No pipeline registers - direct assignment assign bramFuture_waddr = axonEvent_addr_reg; assign bramFuture_wdata = axonEvent_data_reg; assign bramFuture_wren = axonEvent_set_reg; ``` 2. **No Future BRAM Read** (unless debugging): ```verilog assign bramFuture_raddr = 13'd0; assign bramFuture_rden = 1'b0; // Disabled to avoid pipeline issues ``` 3. **Simplified State Machine** (4 states instead of 5): ```verilog // Removed STATE_PHASE1_DONE, completion detected in STATE_READ_INPUTS STATE_READ_INPUTS: begin if (exec_hbm_rvalidready) begin bramPresent_rden = 1'b1; bramPresent_wren = 1'b1; if (bramPresent_waddr == axon_addr_limit) begin phase1_done_set = 1'b1; next_state = STATE_IDLE; // Direct transition end end end ``` 4. **16 Axons Per Row**: ```verilog // Base version: 8 axons per row // BRAM_ADDR_LIMIT = num_inputs[16:3] // Divide by 8 // Simple version: 16 axons per row // axon_addr_limit = num_inputs[16:4] // Divide by 16 ``` **Address Calculation Example:** - `num_inputs = 17'd131072` (max neurons) - Base: `BRAM_ADDR_LIMIT = 131072 >> 3 = 16384` rows - Simple: `axon_addr_limit = 131072 >> 4 = 8192` rows 5. **Registered Input Events**: ```verilog // Better place-and-route by registering inputs always @(posedge clk) begin if (~resetn) begin axonEvent_set_reg <= 1'b0; axonEvent_addr_reg <= 13'd0; axonEvent_data_reg <= 16'd0; end else begin axonEvent_set_reg <= axonEvent_set; axonEvent_addr_reg <= axonEvent_addr; axonEvent_data_reg <= axonEvent_data; end end ``` ### V2 Version Enhancements The V2 version adds debug capabilities while simplifying the future write logic: #### Simplified Future Write (Read-Modify-Write) Instead of complex pipeline hazard detection, V2 uses RMW: ```verilog // Read the current value assign bramFuture_raddr = setArray_addr[16:3]; // Note: only uses upper bits assign bramFuture_rden = ci2eep_rden | setArray_go | bramFuture_wren[2] | bramFuture_wren[1] | bramFuture_wren[0]; assign bramFuture_rdata = bram_select ? bram0_rdata : bram1_rdata; // Merge with new data via OR operation assign bramFuture_wdata = bramFuture_rdata | setArray_data; // Propagate through 3-stage pipeline (addresses and enables only) always @(posedge clk) begin if (~resetn) begin bramFuture_waddr[2] <= 14'd0; bramFuture_waddr[1] <= 14'd0; bramFuture_waddr[0] <= 14'd0; bramFuture_wren[2] <= 1'b0; bramFuture_wren[1] <= 1'b0; bramFuture_wren[0] <= 1'b0; end else begin bramFuture_waddr[2] <= setArray_addr; bramFuture_waddr[1] <= bramFuture_waddr[2]; bramFuture_waddr[0] <= bramFuture_waddr[1]; bramFuture_wren[2] <= setArray_go; bramFuture_wren[1] <= bramFuture_wren[2]; bramFuture_wren[0] <= bramFuture_wren[1]; end end ``` **Why This Works:** - Always read before write (RMW pattern) - OR operation merges new spikes with existing ones - Simpler than explicit hazard detection - Relies on BRAM "read first" mode #### Debug State Machine V2 adds a separate FSM for debug access: **Debug States:** ```verilog DBG_STATE_RESET = 3'd0 // Reset debug logic DBG_STATE_IDLE = 3'd1 // Wait for debug command DBG_STATE_WAIT_0 = 3'd2 // Issue first read DBG_STATE_WAIT_1 = 3'd3 // Wait cycle 2 DBG_STATE_WAIT_2 = 3'd4 // Wait cycle 3 DBG_STATE_WAIT_3 = 3'd5 // Wait cycle 4 DBG_STATE_DONE = 3'd6 // Send response, pop command ``` **Debug Flow:** ``` 1. Command Interpreter writes address to ci2eep FIFO 2. Debug FSM detects ~ci2eep_empty 3. Issue BRAM read (bram0/1_raddr_dbg = ci2eep_dout) 4. Wait 4 cycles for pipeline (WAIT_0→1→2→3) 5. Write response to eep2ci FIFO (address + data) 6. Pop command from ci2eep (ci2eep_rden = 1) 7. Return to IDLE ``` **Debug Interface Behavior:** ```verilog always @(*) begin ci2eep_rden = 1'b0; eep2ci_wren = 1'b0; dbg_next_state = dbg_curr_state; case (dbg_curr_state) DBG_STATE_IDLE: begin if (~ci2eep_empty) dbg_next_state = DBG_STATE_WAIT_0; end DBG_STATE_WAIT_0: begin if (~eep2ci_full) eep2ci_wren = 1'b1; // Write response dbg_next_state = DBG_STATE_WAIT_1; end // ... (similar for WAIT_1, WAIT_2, WAIT_3) DBG_STATE_DONE: begin ci2eep_rden = 1'b1; // Pop command dbg_next_state = DBG_STATE_IDLE; end endcase end // Response format: {address, data} assign eep2ci_din = bram_select ? {bram0_raddr_dbg, bram0_rdata_dbg} : {bram1_raddr_dbg, bram1_rdata_dbg}; // Debug read addresses (always driven) assign bram0_raddr_dbg = ci2eep_dout; assign bram1_raddr_dbg = ci2eep_dout; ``` --- ## Memory Map ### Base and V2 Versions (8 axons/row) **BRAM Organization:** - **Depth**: 16,384 rows (14-bit address) - **Width**: 8 bits (1 bit per axon) - **Total Capacity**: 131,072 axons - **Dual BRAMs**: BRAM0 and BRAM1 (double buffering) **Address Mapping:** ``` Axon ID Range | BRAM Address | Bit Position ------------------|--------------|------------- 0 - 7 | 0x0000 | [0] to [7] 8 - 15 | 0x0001 | [0] to [7] 16 - 23 | 0x0002 | [0] to [7] ... | ... | ... 131064 - 131071 | 0x3FFF | [0] to [7] ``` **Bit Encoding:** - Bit = 1: Axon spiked - Bit = 0: No spike **Address Calculation:** ```verilog bram_addr = axon_id[16:3]; // Upper 14 bits bit_pos = axon_id[2:0]; // Lower 3 bits ``` ### Simple Version (16 axons/row) **BRAM Organization:** - **Depth**: 8,192 rows (13-bit address) - **Width**: 16 bits (1 bit per axon) - **Total Capacity**: 131,072 axons - **Dual BRAMs**: BRAM0 and BRAM1 **Address Mapping:** ``` Axon ID Range | BRAM Address | Bit Position ------------------|--------------|------------- 0 - 15 | 0x0000 | [0] to [15] 16 - 31 | 0x0001 | [0] to [15] 32 - 47 | 0x0002 | [0] to [15] ... | ... | ... 131056 - 131071 | 0x1FFF | [0] to [15] ``` **Address Calculation:** ```verilog bram_addr = axon_id[16:4]; // Upper 13 bits bit_pos = axon_id[3:0]; // Lower 4 bits ``` **Memory Utilization:** - Base/V2: 16,384 × 8 = 128 Kb per BRAM → 256 Kb total - Simple: 8,192 × 16 = 128 Kb per BRAM → 256 Kb total - Same total capacity, different organization --- ## Timing Diagrams ### Time Step Transition (Double Buffer Swap) ``` ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬───── clk ┘ └─────┘ └─────┘ └─────┘ └─────┘ exec_run ──────┐ ┌───────────────────────────────────────── └─────┘ Time Step N-1 │ Time Step N ─────────────────┼───────────────────────────────────── │ bram_select = 0 │ Toggle → 1 BRAM0 = Present │ BRAM0 → Future BRAM1 = Future │ BRAM1 → Present │ State: READ_INPUTS │ IDLE → FILL_PIPE → READ_INPUTS bramPresent = BRAM0 │ bramPresent = BRAM1 bramFuture = BRAM1 │ bramFuture = BRAM0 ``` ### Present BRAM Read Sequence (Base & V2) ``` Cycle 0 1 2 3 4 5 6 7 8 9 ─────┬────┬────┬────┬────┬────┬────┬────┬────┬──── State FILL │FILL│FILL│READ│READ│READ│READ│READ│READ│... ────┬┴────┴────┴────┴────┴────┴────┴────┴────┴──── rden ────┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌─── ────└────┘ └────┘ └────┘ └────┘ └────┘ raddr 0 1 2 3 4 5 6 7 8 9 wren ───────────────────┐ ┌───┐ ┌───┐ ┌───┐ ───────────────────└────┘ └────┘ └────┘ └──── waddr 0 0 0 0 1 2 3 4 5 6 ┌─────────┐ phase1_ready ────────┘ └────────────────────────────── (asserted in READ_INPUTS state) hbm_rvalidready ────────────┐ ┌───┐ ┌───┐ ┌───┐ ────────────└────┘ └────┘ └────┘ └──── exec_bram_spiked X X X D0 D1 D2 D3 D4 D5 (pipeline latency = 3 cycles) BRAM Data Flow: Cycle 0: Issue read addr 0 → (3-cycle latency) → Cycle 3: Data 0 emerges Cycle 1: Issue read addr 1 → (3-cycle latency) → Cycle 4: Data 1 emerges Cycle 2: Issue read addr 2 → (3-cycle latency) → Cycle 5: Data 2 emerges Cycle 3: Issue read addr 3, Write 0 to addr 0 Cycle 4: Issue read addr 4, Write 0 to addr 1 ... ``` ### Future BRAM Write with Hazard (Base Version) ``` Cycle 0 1 2 3 4 5 6 7 ────┬────┬────┬────┬────┬────┬────┬──── setArray_go ───┐ ┌───┐ ┌─────────────────── ───└────┘ └────┘ setArray_addr 100 200 100 - - - - - setArray_data 0x01 0x04 0x02 - - - - - Pipeline Stage 2: waddr 100 200 100 - - - - - wdata 0x01 0x04 0x00 - - - - - wren 1 1 0 - - - - - Pipeline Stage 1: waddr - 100 200 100 - - - - wdata - 0x01 0x04 0x03 - - - - ← Merged! wren - 1 1 0 - - - - Pipeline Stage 0: waddr - - 100 200 100 - - - wdata - - 0x01 0x04 0x00 - - - wren - - 1 1 0 - - - BRAM Write: Addr 100 ──────────────────────────────────────┐ Data = 0x01 (from cycle 2) ───────────────────┘ Addr 200 ─────────────────────────────────────────┐ Data = 0x04 (from cycle 3) ───────────────────┘ Addr 100 ────────────────────────────────────────────┐ Data = 0x03 (merged 0x01|0x02 from cycle 4) ───────┘ Note: Cycle 2 write to addr 100 detected hazard with cycle 0, merged data in stage 1. ``` ### Simple Version: No Pipeline (Direct Write) ``` Cycle 0 1 2 3 4 5 ────┬────┬────┬────┬────┬──── axonEvent_set ──┐ ┌───┐ ┌────────── ───└────┘ └────┘ axonEvent_addr 100 200 300 - - - axonEvent_data 0x0001 0x0004 0x0008 - - - (Register inputs for better timing) ────┬────┬────┬────┬────┬──── *_set_reg ───────┐ ┌───┐ ┌────── ───└────┘ └────┘ *_addr_reg - 100 200 300 - - *_data_reg - 0x0001 0x0004 0x0008 - - bramFuture_waddr - 100 200 300 - - bramFuture_wdata - 0x0001 0x0004 0x0008 - - bramFuture_wren - 1 1 1 - - BRAM Write: - 100 200 300 - - 0x0001 0x0004 0x0008 No hazard handling! Single-core only. If same address written twice within pipeline depth, later write overwrites earlier. ``` ### V2 Debug Read Sequence ``` Cycle 0 1 2 3 4 5 6 7 ────┬────┬────┬────┬────┬────┬────┬──── DBG_State IDLE│WAIT│WAIT│WAIT│WAIT│DONE│IDLE│... │ 0 │ 1 │ 2 │ 3 │ │ │ ci2eep_empty ────┐ ┌─── ────└──────────────────────────────────┘ ci2eep_dout 0x1234 (stays valid until rden) ci2eep_rden ────────────────────────────────────┐ ┌─── ────────────────────────────────────└──┘ bram*_raddr_dbg 0x1234 (always driven) eep2ci_wren ───────┐ ┌───┐ ┌───┐ ┌───┐ ┌─── ───────└────┘ └────┘ └────┘ └──┘ (writes in WAIT_0 through WAIT_3 if not full) eep2ci_din X {0x1234,D} (data D emerges after latency) Debug transaction: Cycle 0: Detect command available Cycle 1-4: Issue reads, wait for pipeline, write response Cycle 5: Pop command FIFO Cycle 6: Return to IDLE ``` --- ## Cross-References ### Upstream Modules - **command_interpreter.v** (`command_interpreter.md`): - Generates `setArray_go`, `setArray_addr`, `setArray_data` signals - V2: Provides debug FIFO interfaces (`ci2eep_*`, `eep2ci_*`) - Controls when external events are injected - **pcie2fifos.v** (`pcie2fifos.md`): - Ultimate source of external events from host - Events flow: PCIe → Command Interpreter → External Events Processor ### Downstream Modules - **hbm_processor.v** (`hbm_processor.md`): - Receives `exec_bram_spiked` (spike mask) - Provides `exec_hbm_rvalidready` (synchronization signal) - Uses spike masks to fetch pointer chains from HBM - **internal_events_processor.v** (`internal_events_processor.md`): - Receives `exec_bram_phase1_done` (completion signal) - Coordinates two-phase execution (external then internal events) ### Peer Modules - **pointer_fifo_controller.v** (`pointer_fifo_controller.md`): - Works with spike masks from this module - Controls flow of pointer data to HBM processor --- ## Module Comparison: When to Use Each Variant ### Use Base Version When: - **Multi-core architecture** with multiple cores writing to same future BRAM - **Concurrent writes** to the same BRAM address are expected - **Data integrity** is critical and no events can be lost - **Pipeline hazards** need explicit detection and merging - **8 axons per row** organization preferred **Trade-offs:** - ✅ Full hazard handling - ✅ No data loss in multi-core scenarios - ❌ More complex logic - ❌ Higher resource usage (pipeline registers) - ⚠️ Debugging modifications present (lines 95-96, 103-104) ### Use Simple Version When: - **Single-core architecture** with only one writer to future BRAM - **Lower resource usage** is priority - **Wider data paths** (16-bit) preferred for bandwidth - **No concurrent writes** to same address expected - **Simpler logic** easier to verify and debug **Trade-offs:** - ✅ Minimal resource usage - ✅ 2× data width (16 vs 8 bits) - ✅ Simpler state machine (4 vs 5 states) - ✅ Better timing due to registered inputs - ❌ No hazard protection - ❌ Data loss if concurrent writes occur - ❌ Single-core only ### Use V2 Version When: - **Debug and verification** required - **BRAM inspection** needed during runtime - **Command interpreter interface** for test patterns - **Read-modify-write** approach acceptable - **Production debugging** of neuromorphic algorithms **Trade-offs:** - ✅ Debug capabilities (FIFO interface) - ✅ Simplified future write logic (RMW vs explicit hazards) - ✅ Direct BRAM inspection via debug ports - ❌ Additional debug FSM (more resources) - ❌ Extra FIFO interfaces - ❌ Not optimized for performance --- ## Performance Characteristics ### Base and V2 Versions **Throughput:** - **Read Rate**: 1 BRAM address per `exec_hbm_rvalidready` cycle - **Effective Rate**: Limited by HBM bandwidth (~450 MHz possible, typically 225 MHz) - **Pipeline Fill**: 3 cycles (one-time cost per time step) - **Total Time**: `3 + num_inputs[16:3]` cycles per time step **Example** (131,072 neurons): ``` BRAM addresses = 131072 / 8 = 16384 Pipeline fill = 3 cycles Total cycles = 3 + 16384 = 16387 cycles At 225 MHz = 16387 / 225e6 = 72.8 µs ``` **Future Write Latency:** - **Base**: 3 cycles (pipeline depth) from `setArray_go` to BRAM write - **V2**: 3 cycles (pipeline depth) from `setArray_go` to BRAM write - **Hazard Penalty**: 0 cycles (merged in pipeline) ### Simple Version **Throughput:** - **Read Rate**: 1 BRAM address per `exec_hbm_rvalidready` cycle - **Effective Rate**: 225 MHz typical - **Pipeline Fill**: 3 cycles - **Total Time**: `3 + num_inputs[16:4]` cycles per time step **Example** (131,072 neurons): ``` BRAM addresses = 131072 / 16 = 8192 Pipeline fill = 3 cycles Total cycles = 3 + 8192 = 8195 cycles At 225 MHz = 8195 / 225e6 = 36.4 µs (2× faster than base!) ``` **Future Write Latency:** - **Direct**: 1 cycle from `axonEvent_set` to registered write - **Total**: 2 cycles (register + BRAM write) **Resource Usage Comparison:** | Resource | Base | Simple | V2 | |----------|------|--------|-----| | LUTs (approx.) | 500 | 250 | 600 | | Flip-Flops | 200 | 120 | 280 | | BRAM18K | 2 | 2 | 2 | | Pipeline Regs | 3×(14+8+1) | 0 | 3×(14+1) | --- ## Common Issues and Debugging ### Issue 1: Events Lost During Time Step Transition **Symptoms:** - External events written near `exec_run` pulse disappear - Inconsistent spike counts between time steps **Root Cause:** - Writing to future BRAM while `bram_select` is toggling - Race condition between write and buffer swap **Debug:** ```verilog // Check timing of setArray_go relative to exec_run // Add ILA probe: ila_0 your_ila ( .clk(clk), .probe0(exec_run), .probe1(setArray_go), .probe2(setArray_addr), .probe3(bram_select) ); ``` **Solution:** - Ensure `setArray_go` never occurs within 3 cycles of `exec_run` - Add FIFO between command interpreter and external events processor - Stall writes during buffer swap ### Issue 2: Pipeline Hazards Not Detected (Base Version) **Symptoms:** - Expected spike data doesn't appear - OR of multiple writes shows only one bit set **Root Cause:** - Hazard detection logic not functioning - Debugging modifications (lines 95-96, 103-104) bypass merging **Debug:** ```verilog // Monitor pipeline stages (* mark_debug = "true" *) reg [13:0] bramFuture_waddr_dbg [2:0]; (* mark_debug = "true" *) reg [7:0] bramFuture_wdata_dbg [2:0]; (* mark_debug = "true" *) reg bramFuture_wren_dbg [2:0]; always @(posedge clk) begin bramFuture_waddr_dbg <= bramFuture_waddr; bramFuture_wdata_dbg <= bramFuture_wdata; bramFuture_wren_dbg <= bramFuture_wren; end ``` **Solution:** - Restore original BRAM write logic: ```verilog // Change: assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0]; // To: assign bram0_wdata = ~bram_select ? bramPresent_wdata : bramFuture_wdata[0] | bramFuture_rdata | setArray_data_pipe; ``` ### Issue 3: Address Limit Calculation Wrong **Symptoms:** - Phase 1 completes too early or too late - Not all neurons receive input events **Root Cause:** - Incorrect calculation of BRAM address limit - Mismatch between `num_inputs` and actual neuron count **Debug:** ```verilog // Check address limit // Base: BRAM_ADDR_LIMIT = num_inputs[16:3] (divide by 8) // Simple: axon_addr_limit = num_inputs[16:4] (divide by 16) // Add assertion: assert property (@(posedge clk) disable iff (~resetn) (curr_state == STATE_READ_INPUTS && bramPresent_waddr == BRAM_ADDR_LIMIT) |=> (curr_state == STATE_PHASE1_DONE) ); ``` **Solution:** - Verify `num_inputs` matches neuron configuration - Base: Ensure multiple of 8 - Simple: Ensure multiple of 16 - Add `+1` if rounding needed: ```verilog // If num_inputs not exact multiple assign BRAM_ADDR_LIMIT = (num_inputs[16:3]) + |num_inputs[2:0]; // Round up ``` ### Issue 4: BRAM Read Latency Mismatch **Symptoms:** - Data appears corrupted or delayed - `exec_bram_spiked` shows wrong values **Root Cause:** - BRAM configured with latency ≠ PIPE_DEPTH (3) - Pipeline depth parameter doesn't match actual BRAM **Debug:** ```verilog // Verify BRAM configuration in IP customization: // - Read Latency: should be 3 // - Primitive Type: should match PIPE_DEPTH // Check if raddr/waddr maintain proper offset: assert property (@(posedge clk) disable iff (~resetn) (curr_state == STATE_READ_INPUTS && exec_bram_phase1_ready) |-> (bramPresent_raddr == bramPresent_waddr + PIPE_DEPTH) ); ``` **Solution:** - Reconfigure BRAM IP for 3-cycle latency - Or update PIPE_DEPTH parameter to match BRAM: ```verilog external_events_processor #( .PIPE_DEPTH(2) // If BRAM has 2-cycle latency ) eep_inst ( // ... ); ``` ### Issue 5: V2 Debug Reads Return Stale Data **Symptoms:** - Debug responses show old/incorrect BRAM data - Debug state machine stuck in WAIT states **Root Cause:** - Insufficient wait cycles for BRAM read latency - Debug FSM transitions too quickly **Debug:** ```verilog // Monitor debug state progression (* mark_debug = "true" *) reg [2:0] dbg_state_history [7:0]; always @(posedge clk) begin dbg_state_history[7:1] <= dbg_state_history[6:0]; dbg_state_history[0] <= dbg_curr_state; end ``` **Solution:** - Ensure 4 WAIT states (WAIT_0→WAIT_1→WAIT_2→WAIT_3) - Add extra wait state if needed: ```verilog localparam [2:0] DBG_STATE_WAIT_4 = 3'd7; // In state machine: DBG_STATE_WAIT_3: begin if (~eep2ci_full) eep2ci_wren = 1'b1; dbg_next_state = DBG_STATE_WAIT_4; // Extra cycle end DBG_STATE_WAIT_4: begin dbg_next_state = DBG_STATE_DONE; end ``` --- ## Safety and Edge Cases ### Edge Case 1: num_inputs = 0 **Behavior:** - `BRAM_ADDR_LIMIT = 0` - State machine immediately transitions FILL_PIPE → READ_INPUTS → PHASE1_DONE - No BRAM accesses occur **Safety:** - ✅ No undefined behavior - ✅ Module functions correctly (zero inputs processed) - ⚠️ Wastes cycles (should be caught at system level) ### Edge Case 2: num_inputs Not Multiple of 8 (or 16) **Example:** `num_inputs = 17'd100` **Base Version:** ```verilog BRAM_ADDR_LIMIT = 100 >> 3 = 12 Actual coverage = 12 * 8 = 96 axons Missing = 4 axons (96-99 not processed) ``` **Fix:** ```verilog // Round up to nearest multiple assign BRAM_ADDR_LIMIT = (num_inputs + 7) >> 3; // Ceiling division ``` ### Edge Case 3: Concurrent setArray_go and exec_run **Scenario:** ``` Cycle N: exec_run = 1 (toggle bram_select) Cycle N: setArray_go = 1 (write to future BRAM) ``` **Problem:** - `bram_select` changes, may write to wrong BRAM **Current Design:** - `bram_select` registered on `exec_run` edge - `setArray_go` writes on same edge - **Race condition!** Indeterminate which BRAM receives write **Solution:** - Pipeline `exec_run` by 1 cycle: ```verilog reg exec_run_pipe; always @(posedge clk) begin if (~resetn) exec_run_pipe <= 1'b0; else exec_run_pipe <= exec_run; end // Use exec_run_pipe for bram_select toggle always @(posedge clk) begin if (~resetn) bram_select <= 1'b0; else if (exec_run_pipe) // Changed from exec_run bram_select <= ~bram_select; end ``` ### Edge Case 4: BRAM Write During Pipeline Fill **Scenario:** ``` STATE_FILL_PIPE: bramPresent_wren = 0 (not asserted yet) Future writes: bramFuture_wren[0] = 1 (trying to write) ``` **Problem (Multi-core):** - If multiple cores write to future BRAM during present BRAM pipeline fill - Potential for lost writes if exceeding BRAM write bandwidth **Current Design:** - Single write port per BRAM - Future writes serialized through pipeline - **Safe** as long as write rate ≤ 1 per 3 cycles **Solution (if needed):** - Use dual-port BRAM (separate read/write ports) - Or implement write FIFO to buffer concurrent writes ### Safety Check: Phase 1 Completion Detection **Assertion:** ```verilog // Ensure phase1_done only asserted when all addresses processed property phase1_done_check; @(posedge clk) disable iff (~resetn) (exec_bram_phase1_done) |-> (bramPresent_waddr == BRAM_ADDR_LIMIT); endproperty assert_phase1: assert property (phase1_done_check); ``` ### Safety Check: No Writes During Buffer Swap **Assertion:** ```verilog // Ensure no future writes during exec_run property no_write_during_swap; @(posedge clk) disable iff (~resetn) (exec_run) |-> (bramFuture_wren[0] == 1'b0); endproperty assert_no_write: assert property (no_write_during_swap); ``` --- ## Future Enhancement Opportunities ### 1. Configurable Data Width Allow parameterization of axons per row: ```verilog module external_events_processor #( parameter PIPE_DEPTH = 3, parameter AXONS_PER_ROW = 8 // 8, 16, 32, etc. )( // Derive address and data widths localparam ADDR_BITS = 17 - $clog2(AXONS_PER_ROW); localparam DATA_BITS = AXONS_PER_ROW; input [ADDR_BITS-1:0] setArray_addr, input [DATA_BITS-1:0] setArray_data, // ... ); ``` ### 2. Burst Mode for Faster Pipeline Fill Current: Fill pipeline sequentially (3 cycles) Enhancement: Issue all 3 reads in 1 cycle (if BRAM supports) ```verilog STATE_FILL_PIPE: begin if (bramPresent_raddr == 0) begin // Issue all 3 reads at once bram_raddr[0] = 14'd0; bram_raddr[1] = 14'd1; bram_raddr[2] = 14'd2; bram_rden[0] = 1'b1; bram_rden[1] = 1'b1; bram_rden[2] = 1'b1; next_state = STATE_READ_INPUTS; end end ``` ### 3. Event Timestamping Add timestamp to each event for precise temporal resolution: ```verilog // Expand data width: [7:0] data + [15:0] timestamp input [23:0] setArray_data, // {timestamp, spike_mask} // BRAM organization: 24 bits per row ``` ### 4. Event Compression Sparse events (few spikes per row) waste bandwidth: ```verilog // Instead of full bit mask, store indices // Example: Spikes at axons 5, 17, 42 // Compressed: {3'b011, 6'd42, 6'd17, 6'd5} // Count + indices ``` ### 5. Multi-Buffer (>2 BRAMs) Allow more than 2 time steps in flight: ```verilog parameter NUM_BUFFERS = 4; // Quad buffering reg [1:0] bram_select; // 2-bit select (4 buffers) always @(posedge clk) begin if (exec_run) bram_select <= (bram_select + 1) & 2'b11; // Circular end ``` ### 6. AXI4-Stream Interface Replace custom interface with standard AXI4-Stream: ```verilog // Input events input s_axis_tvalid, output s_axis_tready, input [31:0] s_axis_tdata, // {addr, data} input s_axis_tlast, // Output spikes output m_axis_tvalid, input m_axis_tready, output [31:0] m_axis_tdata, // Spike mask + metadata ``` ### 7. Configurable Pipeline Depth Auto-detect BRAM latency at synthesis: ```verilog // Query BRAM IP for latency localparam BRAM_LATENCY = bram0.READ_LATENCY_A; // From BRAM IP external_events_processor #( .PIPE_DEPTH(BRAM_LATENCY) // Match automatically ) eep ( // ... ); ``` --- ## Key Terms and Definitions | Term | Definition | |------|------------| | **Axon** | Input neuron connection; source of spike events | | **Double Buffering** | Two-buffer scheme (present/future) allowing simultaneous read and write | | **Present BRAM** | BRAM being read during current time step (then cleared) | | **Future BRAM** | BRAM accumulating events for next time step | | **bram_select** | Toggle bit selecting which physical BRAM is present vs. future | | **Pipeline Depth** | Number of cycles between BRAM read request and data availability (typically 3) | | **Pipeline Fill** | Initial phase where read pipeline is populated before writes begin | | **Leading Address** | Read address (raddr) - advances pipeline depth ahead of write address | | **Lagging Address** | Write address (waddr) - clears data after it emerges from pipeline | | **Spike Mask** | Bit vector where each bit represents spike (1) or no-spike (0) for an axon | | **Phase 1** | External event processing (vs. Phase 2: internal/synaptic events) | | **exec_run** | Control pulse starting new time step, toggling present ←→ future BRAMs | | **exec_hbm_rvalidready** | Synchronization signal from HBM indicating data consumed, advance BRAM | | **setArray_go** | Write pulse for external event (from command interpreter or other source) | | **Pipeline Hazard** | Conflict when concurrent writes target same BRAM address within pipeline depth | | **RMW (Read-Modify-Write)** | Pattern of reading current value, modifying, then writing back | | **Hazard Detection** | Logic identifying when new write conflicts with in-flight writes | | **Data Merging** | Combining multiple writes to same address via OR operation | | **Time Step** | Discrete computation cycle in neuromorphic algorithm (milliseconds typically) | | **Axon Event** | External spike arriving at input neuron | | **Axons Per Row** | Number of axons packed into single BRAM address (8 or 16 bits) | | **Address Limit** | Maximum BRAM address to read/write (depends on num_inputs) | --- ## Conclusion The **External Events Processor** family provides flexible solutions for managing input spike events in neuromorphic systems: - **Base version**: Full-featured with pipeline hazard handling for multi-core - **Simple version**: Streamlined single-core variant with lower resource usage - **V2 version**: Debug-enhanced variant for verification and development **Key Design Principles:** 1. Double buffering prevents event loss during time step transitions 2. Pipeline management ensures correct synchronization with BRAM latency 3. Hazard detection/merging (base) or simplified RMW (V2) prevents data corruption 4. State machine coordinates read-clear cycles with downstream modules **Selection Guide:** - Multi-core system with concurrent writes → **Base version** - Single-core system, resource-constrained → **Simple version** - Debug/verification needed → **V2 version** For questions or issues, cross-reference with `command_interpreter.md` (upstream) and `hbm_processor.md` (downstream) for complete system understanding.