Pointer FIFO Controller Module#
Overview#
The Pointer FIFO Controller is a critical datapath component that manages the flow of synaptic pointer data during the two-phase neuromorphic execution cycle. It demultiplexes 512-bit HBM pointer data into 16 parallel FIFOs (one per neuron group), then arbitrates between these FIFOs to feed the HBM processor during Phase 2 (synaptic weight fetch).
Role in the Software/Hardware Stack#
Phase 1: External/Internal Events
(Spike Detection)
|
┌─────────────────────────────┼─────────────────────────────┐
| v |
| [External Events Processor] |
| | |
| exec_bram_spiked[15:0] |
| | |
| [Internal Events Processor] |
| | |
| exec_uram_spiked[15:0] |
| | |
| v |
| ┌────────────────────────────┐ |
| │ Pointer FIFO Controller │ |
| │ │ |
| HBM ───>│ 512b → 16×32b demux │ |
| Data │ 16 Pointer FIFOs (ptr0-15)│ |
| │ Round-robin arbiter │ |
| └────────────┬───────────────┘ |
| | |
| ptrFIFO (32b) |
| | |
| v |
| [HBM Processor] |
| | |
| Synaptic Weights |
| | |
| v |
| [Spike FIFOs] ──> Phase 2 Synaptic Updates |
└───────────────────────────────────────────────────────────┘
Function:
Demultiplex HBM Pointer Data: Split 512-bit HBM read into 16×32-bit pointer records
Spike-Gated Buffering: Only store pointers for neurons that spiked (sparse event handling)
Fair Arbitration: Round-robin scheduler ensures all neuron groups get equal service
Phase Coordination: Handle both external (BRAM) and internal (URAM) spike events
Key Innovation: By buffering pointers in 16 parallel FIFOs, the module decouples HBM read bandwidth from pointer processing, allowing efficient handling of sparse neural activity.
Module Architecture#
HBM Data Path (512 bits)
|
v
┌──────────────────────────────┐
│ exec_hbm_rdata[511:0] │
│ exec_hbm_rvalidready │
└──────────────┬───────────────┘
|
┌─────────────────────────┼─────────────────────────┐
| v |
| Demux to 16 Groups |
| [31:0] [63:32] [95:64] ... [511:480] |
| | | | | |
| v v v v |
| ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ |
| │FIFO0│ │FIFO1│ │FIFO2│ ... │FIFO15│ |
| │32b │ │32b │ │32b │ │32b │ |
| │FWFT │ │FWFT │ │FWFT │ │FWFT │ |
| └──┬──┘ └──┬──┘ └──┬──┘ └──┬───┘ |
| ^ ^ ^ ^ |
| | | | | |
| wren0 wren1 wren2 wren15 |
| | | | | |
| └────────┴────────┴─────────────┘ |
| | |
| Spike-Gated Write Enable Logic |
| (bram_spiked[i] | uram_spiked[i]) & !full |
| ^ |
| ┌───────────────┴───────────────┐ |
| | | |
| exec_bram_spiked[15:0] exec_uram_spiked[15:0] |
| | | |
| ┌───┴────┐ ┌───────┴──────┐ |
| │External│ │ Internal │ |
| │Events │ │ Events │ |
| │Proc. │ │ Proc. │ |
| └────────┘ └──────────────┘ |
└───────────────────────────────────────────────────┘
Round-Robin Arbiter
|
addr[3:0] counter (0→15)
|
┌───────────────────┼───────────────────┐
| v |
| 16:1 Multiplexer |
| (Select ptr_dout[addr]) |
| | |
| v |
| ┌─────────────┐ |
| │ ptrFIFO │ |
| │ (32-bit) │ |
| │ To HBM Proc│ |
| └─────────────┘ |
└───────────────────────────────────────┘
Two-Phase Operation#
Phase 1a: External Events (BRAM Reading)
1. bram_reading = 1 (set on exec_run)
2. For each HBM read (exec_hbm_rvalidready):
- Split 512b data into ptr0_din...ptr15_din
- Write to FIFO[i] if exec_bram_spiked[i]==1 and !ptr_full[i]
3. Continue until exec_bram_phase1_done
4. Transition to Phase 1b
Phase 1b: Internal Events (URAM Reading)
1. uram_reading = 1 (set on exec_bram_phase1_done)
2. For each HBM read (exec_hbm_rvalidready):
- Split 512b data into ptr0_din...ptr15_din
- Write to FIFO[i] if exec_uram_spiked[i]==1 and !ptr_full[i]
3. Continue until exec_uram_phase1_done
4. End of Phase 1
Phase 2: Pointer Drain (Concurrent with Phase 1)
1. Round-robin arbiter cycles addr 0→1→2→...→15→0
2. Every cycle:
- If ptr[addr]_empty==0 and ptrFIFO_full==0:
* Read from ptr[addr] (rden=1)
* Write to ptrFIFO (wren=1, din=ptr_dout)
3. HBM processor consumes pointers, fetches synapses
4. Continues until all FIFOs empty
Interface Specification#
Clock and Reset#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
System clock (225 MHz typical) |
|
Input |
1 |
Active-low asynchronous reset |
Execution Control#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
Start new time step (sets bram_reading=1) |
External Events Processor Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
16 |
Spike mask from external events (16 neuron groups) |
|
Input |
1 |
External events complete, transition to internal |
Internal Events Processor Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
16 |
Spike mask from internal events (16 neuron groups) |
|
Input |
1 |
Internal events complete, end Phase 1 |
HBM Processor Input Interface#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
HBM read data valid and ready |
|
Input |
512 |
HBM read data (16 pointers × 32 bits) |
|
Output |
1 |
FIFO read enable (FWFT mode) |
Note: Comments indicate hbm2pfc_dout and hbm2pfc_empty are wired at top wrapper level.
Pointer FIFO Interfaces (16 instances: ptr0-ptr15)#
Each pointer FIFO has identical interface (example for ptr0):
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
FIFO full flag (backpressure) |
|
Output |
32 |
Data input to FIFO (pointer record) |
|
Output |
1 |
Write enable (gated by spike and full) |
|
Input |
1 |
FIFO empty flag |
|
Input |
32 |
Data output from FIFO |
|
Output |
1 |
Read enable (from arbiter) |
Pointer FIFOs: ptr1, ptr2, …, ptr15 (identical interfaces)
HBM Processor Output Interface (Aggregated Pointer FIFO)#
Port |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1 |
Aggregated FIFO full flag |
|
Output |
32 |
Pointer data to HBM processor |
|
Output |
1 |
Write enable (from arbiter) |
Detailed Logic Description#
Phase Tracking State Machine#
The module uses two registers to track execution phase:
reg bram_reading; // Phase 1a: External events
reg uram_reading; // Phase 1b: Internal events
always @(posedge clk) begin
if (!resetn) begin
bram_reading <= 1'b0;
uram_reading <= 1'b0;
end else if (exec_run) begin
// Start of new time step: begin external event processing
bram_reading <= 1'b1;
end else if (exec_bram_phase1_done & !uram_reading) begin
// Transition from external to internal event processing
bram_reading <= 1'b0;
uram_reading <= 1'b1;
end else if (exec_uram_phase1_done) begin
// End of Phase 1
uram_reading <= 1'b0;
end
end
State Transitions:
IDLE (both=0)
|
| exec_run
v
BRAM_READING (bram=1, uram=0)
|
| exec_bram_phase1_done
v
URAM_READING (bram=0, uram=1)
|
| exec_uram_phase1_done
v
IDLE (both=0)
Note: During idle, the round-robin arbiter continues draining pointer FIFOs (Phase 2).
HBM Data Demultiplexing#
The 512-bit HBM data is split into 16 groups of 32 bits:
// Direct bit-slice assignments
assign ptr0_din = exec_hbm_rdata[031:000]; // Bits 0-31
assign ptr1_din = exec_hbm_rdata[063:032]; // Bits 32-63
assign ptr2_din = exec_hbm_rdata[095:064]; // Bits 64-95
assign ptr3_din = exec_hbm_rdata[127:096]; // Bits 96-127
// ... (pattern continues)
assign ptr15_din = exec_hbm_rdata[511:480]; // Bits 480-511
Data Layout (each 32-bit pointer):
Bits [31:23] = Length (9 bits, max 511 synapses)
Bits [22:0] = Start address in HBM (23 bits, byte address)
Example:
exec_hbm_rdata = 512'h...AB12_3456_CD78_9ABC_...
ptr0_din = 32'hCD78_9ABC → Length=0x1AF, Addr=0x389ABC
ptr1_din = 32'hAB12_3456 → Length=0x156, Addr=0x523456
...
Spike-Gated Write Enable Logic#
Each pointer FIFO write is conditional on:
HBM data valid and ready
Corresponding spike bit asserted
FIFO not full
assign ptr0_wren = !ptr0_full & exec_hbm_rvalidready &
((bram_reading & exec_bram_spiked[0]) |
(uram_reading & exec_uram_spiked[0]));
assign ptr1_wren = !ptr1_full & exec_hbm_rvalidready &
((bram_reading & exec_bram_spiked[1]) |
(uram_reading & exec_uram_spiked[1]));
// ... (pattern repeats for ptr2-ptr15)
Logic Breakdown:
ptr_wren[i] = !ptr_full[i] // FIFO has space
& exec_hbm_rvalidready // HBM data available
& (
(bram_reading & exec_bram_spiked[i]) // External spike
|
(uram_reading & exec_uram_spiked[i]) // Internal spike
)
Example Scenarios:
Scenario 1: External spike on neuron group 5
Cycle N:
bram_reading = 1
exec_bram_spiked = 16'b0000_0000_0010_0000 (bit 5 set)
exec_hbm_rvalidready = 1
ptr5_full = 0
Result:
ptr5_wren = 1 → Write exec_hbm_rdata[191:160] to ptr5 FIFO
ptr0-4,6-15_wren = 0 → No write to other FIFOs
Scenario 2: Multiple spikes (groups 0, 3, 7)
Cycle N:
uram_reading = 1
exec_uram_spiked = 16'b0000_0000_1000_1001 (bits 0,3,7 set)
exec_hbm_rvalidready = 1
ptr0_full = 0, ptr3_full = 0, ptr7_full = 1 (ptr7 full!)
Result:
ptr0_wren = 1 → Write to ptr0
ptr3_wren = 1 → Write to ptr3
ptr7_wren = 0 → Blocked by full (data lost!)
Others = 0
Backpressure Handling: If any FIFO is full when its spike arrives, that pointer is lost. System must ensure FIFOs drain fast enough.
Round-Robin Arbiter#
A 4-bit counter cycles through FIFOs 0-15, servicing one per cycle:
reg [3:0] addr; // 4 bits for 16 FIFOs (0-15)
always @(posedge clk) begin
if (~resetn)
addr <= 4'd0;
else
addr <= addr + 1'b1; // Wraps 15→0 automatically
end
Arbitration Cycle:
Cycle 0: addr=0 → Check ptr0
Cycle 1: addr=1 → Check ptr1
Cycle 2: addr=2 → Check ptr2
...
Cycle 15: addr=15 → Check ptr15
Cycle 16: addr=0 → Back to ptr0
...
Arbitration Logic (combinational):
always @(*) begin
// Default: No reads, no writes
ptr0_rden = 1'b0;
ptr1_rden = 1'b0;
// ... (all ptr*_rden = 0)
ptrFIFO_din = 32'dX;
ptrFIFO_wren = 1'b0;
case (addr)
4'd0: begin
if (~ptr0_empty & ~ptrFIFO_full) begin
ptr0_rden = 1'b1;
ptrFIFO_din = ptr0_dout;
ptrFIFO_wren = 1'b1;
end
end
4'd1: begin
if (~ptr1_empty & ~ptrFIFO_full) begin
ptr1_rden = 1'b1;
ptrFIFO_din = ptr1_dout;
ptrFIFO_wren = 1'b1;
end
end
// ... (pattern repeats for 4'd2 through 4'd15)
default: begin
// All outputs stay at default (0 or X)
end
endcase
end
Arbitration Example:
Cycle | addr | ptr0_empty | ptr1_empty | ptr5_empty | ptrFIFO_full | Action
-------|------|------------|------------|------------|--------------|------------------
0 | 0 | 0 | 0 | 0 | 0 | Read ptr0
1 | 1 | 0 | 0 | 0 | 0 | Read ptr1
2 | 2 | 1 | 0 | 0 | 0 | Skip (empty)
3 | 3 | 1 | 1 | 0 | 0 | Skip (empty)
4 | 4 | 1 | 1 | 0 | 0 | Skip (empty)
5 | 5 | 1 | 1 | 0 | 0 | Read ptr5
6 | 6 | 1 | 1 | 1 | 0 | Skip (empty)
7 | 7 | 1 | 1 | 1 | 0 | Skip (empty)
... | ... | ... | ... | ... | ... | ...
15 | 15 | 1 | 1 | 1 | 0 | Skip (empty)
16 | 0 | 0 | 1 | 1 | 0 | Read ptr0 again
Fairness: Each FIFO gets equal opportunity (once per 16 cycles), regardless of occupancy.
Starvation: If a FIFO is always full, other FIFOs continue to be serviced. No single FIFO can block others.
FWFT (First-Word Fall-Through) Mode#
The FIFOs operate in FWFT mode, meaning data appears on dout immediately when empty deasserts:
Traditional FIFO:
Cycle N: rden=1 (issue read)
Cycle N+1: dout valid (1 cycle latency)
FWFT FIFO:
Cycle N: empty=0, dout already valid
Cycle N: rden=1 (consume word, advance to next)
Cycle N+1: dout shows next word (if available)
Why FWFT?: Reduces latency - arbiter can read and forward pointer in single cycle.
HBM FIFO Read Enable:
assign hbm2pfc_rden = exec_hbm_rvalidready;
Every time HBM data is consumed (exec_hbm_rvalidready=1), the FIFO is advanced to present next 512-bit word. This assumes FWFT mode on the HBM data FIFO.
Timing Diagrams#
Phase Transition: BRAM → URAM#
Cycle 0 1 2 3 4 5 6 7 8 9
────┬────┬────┬────┬────┬────┬────┬────┬────┬────
exec_run ───┐ ┌─────────────────────────────────────────
───└────┘
bram_reading ────┐ ┌────────────────
────────└────────────────────────┘
uram_reading ─────────────────────────┐ ┌─────
─────────────────────────────└──────────────┘
exec_bram_phase1_done ────────────┐ ┌─────────────────
────────────└────┘
exec_uram_phase1_done ─────────────────────────┐ ┌─────
─────────────────────────└────┘
Phase IDLE BRAM BRAM BRAM BRAM URAM URAM URAM IDLE
Pointer FIFO Write (Spike-Gated)#
Cycle 0 1 2 3 4 5 6
────┬────┬────┬────┬────┬────┬────
bram_reading ───────────────────────────────────
───┐
└───────────────────────────────
exec_hbm_rvalidready ──┐ ┌───┐ ┌───┐ ┌
────└────┘ └────┘ └────┘
exec_bram_spiked 0x0005 0x0003 0x0000
(bits 0,2)(bits 0,1) (none)
ptr0_full ─────────────────────────────── (always room)
ptr0_wren ───┐ ┌───┐
───└─────────┘ └───────────── (spike bit 0)
ptr1_wren ────────────────┐
────────────────└─────────────── (spike bit 1)
ptr2_wren ───┐
───└─────────────────────────── (spike bit 2)
ptr0_din P0 P0'
↓ ↓
ptr0 FIFO [empty] → [P0] → [P0,P0']
Explanation:
Cycle 1: exec_bram_spiked=0x0005 (bits 0 and 2)
→ ptr0_wren=1, ptr2_wren=1
→ Write to ptr0 and ptr2 FIFOs
Cycle 3: exec_bram_spiked=0x0003 (bits 0 and 1)
→ ptr0_wren=1, ptr1_wren=1
→ Write to ptr0 (again) and ptr1 FIFOs
Cycle 5: exec_bram_spiked=0x0000 (no spikes)
→ All ptr*_wren=0
→ No writes (HBM data ignored)
Round-Robin Arbiter Operation#
Cycle 0 1 2 3 4 5 6 7 8
────┬────┬────┬────┬────┬────┬────┬────┬────
addr 0 1 2 3 4 5 6 7 8
ptr0_empty ────┐ ┌─────
────└─────────────────────────────┘
(has data cycles 0-7, empty at 8)
ptr1_empty ───────────────────────────────────────
(empty throughout)
ptr2_empty ──────────┐ ┌────────
──────────└───────────────────┘
(has data cycles 2-6)
ptrFIFO_full ───────────────────────────────────────
(never full)
ptr0_rden ───┐ ┌─────
───└─────────────────────────────┘
ptr2_rden ──────────┐
──────────└───────────────────────────
ptrFIFO_wren ───┐ ┌─────────────────────┐
───└───────┘ └─────
ptrFIFO_din D0 D2 X
Explanation:
Cycle 0 (addr=0): ptr0 not empty → read ptr0, write ptrFIFO
Cycle 1 (addr=1): ptr1 empty → skip
Cycle 2 (addr=2): ptr2 not empty → read ptr2, write ptrFIFO
Cycle 3-7: All empty → skip
Cycle 8 (addr=8): Continue round-robin (wraps at 15)
FIFO Full Backpressure#
Cycle 0 1 2 3 4 5
────┬────┬────┬────┬────┬────
exec_hbm_rvalidready ┐ ┌───┐ ┌───┐
────└────┘ └────┘ └
exec_bram_spiked 0x0001 0x0001 0x0001
(bit 0)(bit 0)(bit 0)
ptr0_full ────────────┐ ┌────
────────────└─────────┘
(becomes full at cycle 2)
ptr0_wren ───┐ ┌───┐ ┌────
───└────┘ └─────────┘
ptr0_din D0 D1 X D2
ptr0 contents [D0] [D0,D1] [D0,D1] [D1,D2]
Explanation:
Cycle 1: Write D0 to ptr0 (wren=1)
Cycle 2: ptr0 becomes full
Cycle 3: D1 written, but ptr0_full=1 → wren=0 → D1 LOST!
Cycle 4: ptr0 not full again
Cycle 5: D2 written (wren=1)
Result: D1 was lost due to FIFO full condition!
Prevention: Ensure arbiter drains FIFOs faster than they fill, or increase FIFO depth.
Memory and Resource Usage#
FIFO Depth Considerations#
Minimum FIFO Depth (to avoid loss):
Assume:
Max neurons per group: 8192 (131,072 / 16)
Worst case: All neurons in one group spike
Arbiter services each FIFO once per 16 cycles
Fill Rate (during bram_reading or uram_reading):
1 pointer per HBM read (exec_hbm_rvalidready)
Max rate: 1 per cycle (if HBM always ready)
Drain Rate:
1 pointer per 16 cycles (round-robin)
Net Accumulation:
Fill: +1 per cycle (worst case)
Drain: +1 per 16 cycles
Net: +15 pointers per 16 cycles
Depth Calculation:
Time to process 8192 neurons @ 225 MHz:
8192 / 16 (axons per HBM read) = 512 HBM reads
512 cycles @ 225 MHz = 2.27 µs
Pointers accumulated in one FIFO (worst case):
All 8192 neurons in one group spike
= 8192 / 16 = 512 pointers
(Each HBM read provides 1 pointer for that group)
Pointers drained during 512 cycles:
512 / 16 = 32 pointers
Net FIFO occupancy:
512 - 32 = 480 pointers
Required FIFO depth: ~512 (power of 2 for FPGA FIFOs)
Typical FIFO Configuration:
Depth: 512 or 1024 entries
Width: 32 bits
Type: Distributed RAM (for small depth) or Block RAM
Mode: FWFT (First-Word Fall-Through)
Resource Estimates#
Per Pointer FIFO (16 instances):
Depth 512 × 32b = 16 Kb = 0.89 BRAM18K (use 1 BRAM18K)
FWFT logic: ~50 LUTs, ~30 FFs
Total for 16 FIFOs:
BRAM18K: 16 (one per FIFO)
LUTs: ~800 (FIFOs) + ~200 (arbiter) = ~1000
FFs: ~500 (FIFOs) + ~50 (arbiter/control) = ~550
Controller Logic:
Demux: 16 × 32-bit slices (wiring only, ~0 LUTs)
Write Enable: 16 × (4-input AND + OR) = ~96 LUTs
Arbiter: 16-way mux + control = ~150 LUTs
Phase Control: ~20 LUTs, ~3 FFs
Cross-References#
Upstream Modules#
external_events_processor.v (
external_events_processor.md):Provides
exec_bram_spiked[15:0](external spike mask)Asserts
exec_bram_phase1_doneto signal phase transition
internal_events_processor.v (
internal_events_processor.md):Provides
exec_uram_spiked[15:0](internal spike mask)Asserts
exec_uram_phase1_doneto signal phase 1 complete
hbm_processor.v (
hbm_processor.md):Provides
exec_hbm_rdata[511:0](pointer data from HBM)Provides
exec_hbm_rvalidready(data valid signal)Receives
ptrFIFO_din,ptrFIFO_wren(aggregated pointers for Phase 2)
Downstream Modules#
hbm_processor.v (
hbm_processor.md):Consumes pointers from
ptrFIFOUses pointers to fetch synaptic weights during Phase 2
Sends fetched synapses to spike FIFOs
Peer Modules#
spike_fifo_controller.v (
spike_fifo_controller.md):Similar architecture (demux + arbiter)
Handles synaptic weight data instead of pointers
Works in Phase 2 alongside this module’s pointer drain
Common Issues and Debugging#
Issue 1: Pointers Lost (FIFO Overflow)#
Symptoms:
Neurons don’t receive expected synaptic updates
FIFO full flags assert frequently
Spike counts don’t match expected connectivity
Root Cause:
Arbiter can’t drain FIFOs fast enough
FIFO depth too small for burst activity
Debug:
// Add probes for FIFO occupancy
(* mark_debug = "true" *) wire [9:0] ptr0_count; // Assuming 512-deep FIFO
(* mark_debug = "true" *) wire ptr0_overflow;
// Monitor overflow events
always @(posedge clk) begin
if (ptr0_full & ptr0_wren)
ptr0_overflow <= 1'b1; // Overflow detected!
end
Solution:
Increase FIFO depth (512 → 1024 or 2048)
Optimize arbiter (see Enhancement #1 below)
Add priority arbitration for fuller FIFOs
Issue 2: Unfair Arbitration (Starvation)#
Symptoms:
Some neuron groups process much slower than others
Uneven latency across different spike patterns
Root Cause:
Round-robin gives equal slots, but some FIFOs have more data
FIFO[0] with 100 entries gets same service as FIFO[15] with 1 entry
Debug:
// Track arbitration wins per FIFO
(* mark_debug = "true" *) reg [15:0] arb_wins [15:0];
always @(posedge clk) begin
if (ptr0_rden) arb_wins[0] <= arb_wins[0] + 1;
if (ptr1_rden) arb_wins[1] <= arb_wins[1] + 1;
// ... (repeat for all FIFOs)
end
Solution:
Implement weighted round-robin (award more slots to fuller FIFOs)
Use priority encoder favoring non-empty FIFOs
Skip empty FIFOs faster (see Enhancement #2)
Issue 3: Phase Transition Glitch#
Symptoms:
Pointers written with wrong spike mask during phase boundary
Corruption at transition from BRAM to URAM reading
Root Cause:
Race condition between
exec_bram_phase1_doneand last HBM readWrite enable uses old phase flags
Debug:
// Monitor phase transition timing
(* mark_debug = "true" *) reg phase_transition;
always @(posedge clk) begin
if (exec_bram_phase1_done & !uram_reading)
phase_transition <= 1'b1;
else
phase_transition <= 1'b0;
end
// Check if any writes occur during transition
assert property (@(posedge clk)
phase_transition |-> (|{ptr0_wren, ptr1_wren, ..., ptr15_wren} == 0)
);
Solution:
Pipeline phase flags by one cycle
Add guard time between phases (no writes for 1 cycle)
Use registered versions of bram_reading/uram_reading for write enables
Issue 4: HBM FIFO Not Advancing#
Symptoms:
Same HBM data appears multiple times
Pointer FIFOs fill with duplicate entries
Root Cause:
hbm2pfc_rdennot properly connected or not assertingFWFT mode misconfigured on HBM FIFO
Debug:
// Verify read enable toggles
(* mark_debug = "true" *) wire hbm2pfc_rden;
(* mark_debug = "true" *) wire exec_hbm_rvalidready;
(* mark_debug = "true" *) wire [511:0] exec_hbm_rdata;
// Check for stuck data
reg [511:0] prev_hbm_rdata;
always @(posedge clk) begin
if (exec_hbm_rvalidready)
prev_hbm_rdata <= exec_hbm_rdata;
end
// Assert: consecutive reads should have different data (usually)
// (unless network connectivity happens to repeat, rare)
Solution:
Verify FWFT mode enabled on HBM FIFO IP
Check that
hbm2pfc_rdenis wired to FIFO’s read enableConfirm FIFO has data (not empty)
Issue 5: Address Counter Wrapping Incorrectly#
Symptoms:
Some FIFOs never serviced
Arbiter stuck on certain addresses
Root Cause:
4-bit counter not wrapping correctly (should wrap 15→0)
Synthesis optimization error
Debug:
// Monitor counter progression
(* mark_debug = "true" *) reg [3:0] addr;
(* mark_debug = "true" *) reg [3:0] prev_addr;
always @(posedge clk) begin
prev_addr <= addr;
// Check for proper increment (with wrap)
assert ((addr == (prev_addr + 1'b1)) || (!resetn));
end
Solution:
Explicitly handle wrap:
always @(posedge clk) begin
if (~resetn)
addr <= 4'd0;
else if (addr == 4'd15)
addr <= 4'd0; // Explicit wrap
else
addr <= addr + 1'b1;
end
Performance Characteristics#
Throughput Analysis#
HBM Read Bandwidth:
Peak: 512 bits per cycle @ 225 MHz = 14.4 GB/s
Typical: Limited by HBM latency and contention (~50% efficiency) = 7.2 GB/s
Pointers per Second: (7.2 GB/s) / (32 bits) = 1.8 billion pointers/s
Arbiter Throughput:
Max: 1 pointer per cycle @ 225 MHz = 225 million pointers/s
Typical (50% FIFO occupancy): ~112 million pointers/s
Bottleneck: Arbiter is NOT the bottleneck (HBM fill rate >> drain rate in Phase 1)
Phase 1 Duration (example: 131,072 neurons):
External Events:
Input axons: 16,384 (assuming 16 per HBM read)
HBM reads: 16,384 / 16 = 1,024 reads
Time @ 225 MHz: 1,024 cycles = 4.55 µs
Internal Events:
URAM neurons: 131,072
URAM rows: 131,072 / 2 = 65,536 (2 neurons per row)
URAM banks: 16
Rows per bank: 65,536 / 16 = 4,096
HBM reads per bank: 4,096 / 16 = 256 (if 16 neurons spike per read)
Total HBM reads: ~16,384 (worst case, all banks active)
Time @ 225 MHz: 16,384 cycles = 72.8 µs
Total Phase 1: ~77 µs
Phase 2 Duration (pointer drain):
Assume 10% neurons spike (13,107 neurons):
Pointers to process: 13,107
Arbiter rate: 1 per 16 cycles (round-robin overhead)
Effective drain: 225 MHz / 16 = 14.06 million pointers/s
Time: 13,107 pointers / 14.06M/s = 0.93 ms
But Phase 2 overlaps with next Phase 1!
Phase 1 and 2 pipeline, so overall latency = max(Phase1, Phase2)
Typical: Phase 2 >> Phase 1, so Phase 2 dominates
Latency (pointer from HBM to ptrFIFO):
Best Case (FIFO empty, arbiter on correct address):
FWFT mode: 0 cycles (immediate)
Write to ptrFIFO: 1 cycle
Total: 1 cycle @ 225 MHz = 4.4 ns
Worst Case (FIFO full, arbiter just passed):
Wait for FIFO space: N cycles (depends on drain rate)
Wait for arbiter: 15 cycles (worst case, just missed)
Total: ~16 cycles @ 225 MHz = 71 ns (ignoring FIFO drain time)
Resource Utilization Summary#
Resource |
Usage |
Notes |
|---|---|---|
LUTs |
~1,200 |
Demux, arbiter, control, FIFO logic |
FFs |
~550 |
Phase control, arbiter, FIFO pointers |
BRAM18K |
16 |
One per pointer FIFO (512×32b each) |
DSPs |
0 |
No arithmetic operations |
Percentage of Typical FPGA (e.g., Xilinx UltraScale+ VU9P):
LUTs: 1,200 / 1,182,240 = 0.1%
FFs: 550 / 2,364,480 = 0.02%
BRAM18K: 16 / 2,160 = 0.74%
Conclusion: Very lightweight module, dominated by FIFO storage.
Safety and Edge Cases#
Edge Case 1: All Neurons Spike Simultaneously#
Scenario: Every neuron in every group spikes in same cycle.
Behavior:
exec_bram_spiked = 16'hFFFF (all bits set)
All 16 pointer FIFOs receive write:
ptr0_wren = 1, ptr1_wren = 1, ..., ptr15_wren = 1
Each FIFO receives 1 pointer per HBM read.
Safety:
✅ All writes occur in parallel (16 separate FIFOs)
✅ No conflicts (each FIFO independent)
⚠️ FIFO depth must handle burst (512+ pointers)
⚠️ Arbiter drain rate becomes critical (1 per 16 cycles)
Result: System handles correctly if FIFO depth adequate.
Edge Case 2: No Neurons Spike (Quiescent Network)#
Scenario: No spikes in entire time step.
Behavior:
exec_bram_spiked = 16'h0000 (all bits clear)
exec_uram_spiked = 16'h0000
All ptr*_wren = 0 (no writes to any FIFO)
HBM reads still occur, but data discarded.
Safety:
✅ No FIFO writes (correct behavior)
✅ Arbiter continues cycling (no-op, all FIFOs empty)
✅ Phase transitions occur normally
⚠️ HBM bandwidth wasted (reading data that’s discarded)
Optimization Opportunity: Gate HBM reads based on spike mask (see Enhancements).
Edge Case 3: Single Bit Spike (Minimal Activity)#
Scenario: Only one neuron in one group spikes.
Behavior:
exec_bram_spiked = 16'h0001 (only bit 0 set)
Only ptr0_wren = 1 (one FIFO active)
Other 15 FIFOs idle.
Safety:
✅ Correct - only relevant FIFO updated
✅ Arbiter cycles through all, only reads from ptr0
✅ Minimal resource usage
Result: Efficient sparse event handling.
Edge Case 4: ptrFIFO Full (Downstream Backpressure)#
Scenario: HBM processor can’t consume pointers fast enough.
Behavior:
ptrFIFO_full = 1
Arbiter logic:
if (~ptr[addr]_empty & ~ptrFIFO_full) → Condition false!
ptr[addr]_rden = 0 (no read)
ptrFIFO_wren = 0 (no write)
Safety:
✅ Arbiter stalls (doesn’t read from any pointer FIFO)
✅ Upstream pointer FIFOs continue to fill
⚠️ If pointer FIFOs also fill, writes are lost (see Issue 1)
Required: System must ensure ptrFIFO drains faster than it fills.
Safety Check: Write Enable Conflicts#
Assertion: Verify only one arbiter read per cycle
wire [15:0] rdens = {ptr15_rden, ptr14_rden, ..., ptr0_rden};
property one_hot_rdens;
@(posedge clk) disable iff (~resetn)
$onehot0(rdens); // At most one bit set
endproperty
assert_rdens: assert property (one_hot_rdens);
Safety Check: Phase Mutual Exclusion#
Assertion: Ensure bram_reading and uram_reading never both asserted
property phases_mutex;
@(posedge clk) disable iff (~resetn)
!(bram_reading & uram_reading);
endproperty
assert_phases: assert property (phases_mutex);
Future Enhancement Opportunities#
1. Priority Arbiter#
Replace round-robin with priority-based arbitration:
// Calculate occupancy for each FIFO (requires rd_data_count from FIFO IP)
wire [9:0] ptr0_count, ptr1_count, ..., ptr15_count;
// Find fullest FIFO (priority encoder)
reg [3:0] priority_addr;
always @(*) begin
if (ptr0_count > threshold) priority_addr = 4'd0;
else if (ptr1_count > threshold) priority_addr = 4'd1;
// ... (priority order 0→1→2→...→15)
else priority_addr = addr; // Fall back to round-robin
end
// Use priority_addr instead of addr in arbiter mux
Benefit: Prevents FIFO overflow by draining fuller FIFOs first.
2. Skip-Empty Optimization#
Current arbiter wastes cycles checking empty FIFOs:
// Add empty flag aggregation
wire [15:0] ptrs_empty = {ptr15_empty, ..., ptr0_empty};
// Fast-forward to next non-empty FIFO
reg [3:0] next_addr;
always @(*) begin
next_addr = addr;
for (int i = 1; i <= 16; i++) begin
if (!ptrs_empty[(addr + i) & 4'hF]) begin
next_addr = (addr + i) & 4'hF;
break;
end
end
end
always @(posedge clk) begin
if (~resetn)
addr <= 4'd0;
else
addr <= next_addr; // Jump to next non-empty
end
Benefit: Reduces latency by ~50% when many FIFOs empty.
3. Gated HBM Reads#
Don’t read HBM when no spikes:
// Compute OR of spike mask
wire any_spikes = |(exec_bram_spiked | exec_uram_spiked);
// Gate HBM read enable
assign hbm2pfc_rden = exec_hbm_rvalidready & any_spikes;
Benefit: Saves HBM bandwidth during quiescent periods.
4. Configurable FIFO Count#
Parameterize number of FIFOs:
module pointer_fifo_controller #(
parameter NUM_FIFOS = 16,
parameter FIFO_DEPTH = 512
)(
input [NUM_FIFOS-1:0] exec_bram_spiked,
// ... (generate FIFO instances and arbiter)
);
// Use generate blocks for FIFO instantiation
genvar i;
generate
for (i = 0; i < NUM_FIFOS; i++) begin : fifo_gen
fifo_32x512 ptr_fifo (
.din(exec_hbm_rdata[(i+1)*32-1 : i*32]),
.wr_en(ptr_wren[i]),
// ...
);
end
endgenerate
Benefit: Flexible configuration for different neuron group sizes.
5. Multi-Port Arbiter#
Read from multiple FIFOs per cycle:
// Dual-port arbiter (2 pointers per cycle)
reg [3:0] addr_a, addr_b;
always @(posedge clk) begin
addr_a <= addr_a + 2; // Even addresses
addr_b <= addr_b + 2; // Odd addresses
end
// Mux for addr_a and addr_b, write to ptrFIFO twice per cycle
Benefit: 2× drain rate, halves FIFO depth requirements.
Trade-off: Requires wider ptrFIFO or double-pumped downstream.
6. Adaptive FIFO Depth#
Dynamically adjust FIFO depth based on activity:
// Use distributed RAM for shallow portion, spill to BRAM when full
// Requires custom FIFO controller with dual-tier storage
Benefit: Saves BRAM when network activity is sparse.
7. Burst Write to ptrFIFO#
Instead of one pointer per cycle, burst multiple:
// If ptrFIFO has depth, write up to 4 pointers per cycle
// Requires ptrFIFO to accept burst writes (wider interface)
assign ptrFIFO_din[127:0] = {ptr[addr+3]_dout, ptr[addr+2]_dout,
ptr[addr+1]_dout, ptr[addr]_dout};
assign ptrFIFO_wren = burst_valid;
Benefit: 4× drain rate (if downstream supports).
Key Terms and Definitions#
Term |
Definition |
|---|---|
Pointer FIFO |
Buffer storing 32-bit pointer records (length + address) for synaptic lists |
Round-Robin |
Arbitration scheme giving equal service time to each FIFO in cyclic order |
Spike-Gated |
Write enable conditional on neuron spike (sparse event handling) |
Demultiplexing |
Splitting wide HBM data (512b) into narrow pointer streams (16×32b) |
FWFT (First-Word Fall-Through) |
FIFO mode where data appears immediately on |
Phase 1a |
External event processing (BRAM reading, external axon spikes) |
Phase 1b |
Internal event processing (URAM reading, neuron-to-neuron spikes) |
Phase 2 |
Synaptic weight fetch (pointer drain, HBM synaptic reads) |
Neuron Group |
Set of 16 neurons mapped to one pointer FIFO |
Backpressure |
Flow control mechanism where full FIFO blocks upstream writes |
Arbiter |
Logic deciding which FIFO gets access to shared resource (ptrFIFO) |
ptrFIFO |
Aggregated pointer FIFO feeding HBM processor for Phase 2 |
Starvation |
Condition where some FIFOs never serviced (not possible in round-robin) |
Overflow |
Condition where pointer write lost due to FIFO full |
Pointer Record |
32-bit datum: [31:23]=length (9b), [22:0]=start address (23b) |
HBM rvalidready |
Signal indicating HBM read data valid and consumer ready |
exec_run |
Control pulse starting new time step, initiating Phase 1a |
Conclusion#
The Pointer FIFO Controller is a well-designed datapath component that efficiently manages sparse neural spike events through:
Parallel Buffering: 16 independent FIFOs decouple HBM read from pointer consumption
Spike-Gated Writes: Only buffer pointers for neurons that actually spiked (sparse efficiency)
Fair Arbitration: Round-robin ensures no FIFO monopolizes downstream bandwidth
Two-Phase Coordination: Seamlessly handles both external and internal event sources
Design Strengths:
Simple, proven architecture (demux + FIFOs + arbiter)
Minimal logic (mostly wiring and control)
FWFT mode reduces latency
Phase control cleanly separates external and internal events
Potential Improvements:
Priority arbitration to prevent overflow
Skip-empty optimization to reduce latency
Gated HBM reads to save bandwidth
Multi-port arbiter for higher drain rate
Critical Parameters:
FIFO depth must accommodate worst-case burst (512-1024 entries)
Arbiter must drain faster than fill rate (or FIFOs overflow)
Round-robin period (16 cycles) limits drain rate
For complete understanding, see cross-referenced modules: external_events_processor.md, internal_events_processor.md, hbm_processor.md, and spike_fifo_controller.md.