3 Verilog Files Review#
Overview#
The Verilog File directory contains the FPGA hardware implementation for a neuromorphic computing system designed to simulate large-scale spiking neural networks. This implementation targets Xilinx XCVU37p FPGAs with High Bandwidth Memory (HBM) and is part of the CRI (Cognitive Research Infrastructure) neuromorphic computing cluster at San Diego Supercomputer Center.
System Scale#
40 FPGA boards across 5 compute servers
32 cores per FPGA (each core can process 128K neurons)
4M neurons per FPGA, 160M neurons total system capacity
400+ GBps HBM bandwidth per board
Key Features#
Supports 16 neuron groups per core (8192 neurons each)
Multi-clock domain design: 225 MHz (aclk) and 450 MHz (aclk450)
Three-tier memory hierarchy:
BRAM for axon/external event data
URAM for neuron state data
HBM for synaptic connectivity data
PCIe Gen3 x16 interface for host communication via DMA
Two-phase execution model: External events (Phase 1) → Internal updates (Phase 2)
High-Level Architecture#
┌─────────────────────────────────────────────┐
│ Host Computer (Python) │
│ hs_bridge: Network definition & control │
└──────────────────┬──────────────────────────┘
│
PCIe Gen3 x16
│
┌──────────────────▼──────────────────────────┐
│ FPGA (Xilinx XCVU37p) │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ pcie2fifos (AXI4 to FIFO) │ │
│ │ 512-bit data path @ 225 MHz │ │
│ └──────┬─────────────────────┬───────────┘ │
│ │ │ │
│ Input FIFO Output FIFO │
│ │ │ │
│ ┌──────▼─────────────────────▼───────────┐ │
│ │ command_interpreter │ │
│ │ Parses commands, routes to modules │ │
│ └──┬─────────┬──────────┬─────────┬──────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
┌───────────┼─────▼─────────┴──────────┴─────────┴─────┐ │
│ BRAM │ input_data_handler (BRAM Arbiter) │ │
│ 2^15 x │ - Arbitrates CI vs EEP access │ │
│ 256-bit │ - 3-cycle read latency │ │
└───────────┼──────┬──────────────────────────┬─────────┘ │
│ │ │ │
│ ┌───▼──────────────┐ ┌───────▼──────────┐ │
│ │ external_events_ │ │ spike_fifo_ │ │
│ │ processor │ │ controller │ │
│ │ - Reads axon/ │ │ - Collects spikes│ │
│ │ BRAM data │ │ from 8 FIFOs │ │
│ │ - Generates HBM │ │ - Round-robin │ │
│ │ read requests │ │ arbitration │ │
│ └───┬──────────────┘ └──────────────────┘ │
│ │ │
│ ┌───▼────────────────────────────────────┐ │
│ │ hbm_processor │ │
│ │ - Manages HBM access (AXI4) │ │
│ │ - Reads synaptic pointers │ │
│ │ - Prefetches synapse data │ │
│ │ - Handles pointer chains │ │
│ └───┬────────────────────────────────┬───┘ │
│ │ │ │
┌───────────┼──────▼────────────────────┐ ┌──────▼─────┐ │
│ HBM │ pointer_fifo_controller │ │ hbm_ │ │
│ 33-bit │ - 16 pointer FIFOs │ │ register_ │ │
│ address │ - Round-robin dispatch │ │ slice │ │
│ 256-bit │ - Manages BRAM/URAM │ │ (timing) │ │
│ data │ spike flags │ └────────────┘ │
│ 400+ └───┬───────────────────────┘ │
│ GBps │ │
└───────────┼──────────▼────────────────────────────────┐ │
│ internal_events_processor │ │
│ - Processes 16 URAM banks │ │
│ - Updates neuron states │ │
│ - Detects spikes │ │
│ - Read-modify-write hazard resolution │ │
┌───────────┼──────────────────────────────────────────┐│ │
│ URAM │ 16 banks x 4096 rows x 72 bits ││ │
│ 12-bit │ (2 neurons per 72-bit word) ││ │
│ address │ @450 MHz for high throughput ││ │
│ 72-bit └──────────┬────────────────────────────────┘│ │
│ data │ │ │
└──────────────────────┼──────────────────────────────────┘ │
│ │ │
│ Spike outputs → spike_fifo_controller │
│ (17-bit neuron addresses) │
└────────────────────────────────────────────────┘
Directory Structure#
CRI_proj/ - Single-Core Verilog Implementation#
Original single-core implementation using Verilog. Contains the core processing modules:
Module |
Purpose |
Key Features |
|---|---|---|
command_interpreter.v |
PCIe command interface |
Parses commands, routes to processors |
pcie2fifos.v |
PCIe to FIFO bridge |
AXI4 512-bit interface, handles DMA |
input_data_handler.v |
BRAM arbiter |
Arbitrates between command interpreter and external events processor |
external_events_processor.v |
Axon event processing |
Reads BRAM, generates HBM requests for synapses |
hbm_processor.v |
HBM memory controller |
Manages synapse data access, pointer chaining |
pointer_fifo_controller.v |
Pointer distribution |
Distributes pointers to 16 neuron groups |
internal_events_processor.v |
Neuron state updates |
Processes 16 URAM banks, updates neurons, detects spikes |
spike_fifo_controller.v |
Spike collection |
Collects spikes from 8 FIFOs, round-robin arbitration |
Variants:
external_events_processor_simple.v- Simplified version with fixed pipeline depthexternal_events_processor_v2.v- Enhanced version with improved timing
N_cores/ - Multi-Core SystemVerilog Implementation#
Multi-core implementation using SystemVerilog with improved modularity and timing closure:
Module |
Purpose |
Key Features |
|---|---|---|
single_core.sv |
Top-level core integration |
Instantiates all processors, 16 URAMs, FIFOs |
core_wrapper.sv |
Core wrapper with reset sync |
Adds HBM register slice for timing closure |
types.sv |
Interface definitions |
AXI4, AXILite, AXIStream, FIFO, RAM interfaces |
Xilinx_IP_wrappers.sv |
Xilinx IP macros |
FIFO and URAM wrapper macros |
FIFO_AXI_Converters.sv |
Protocol converters |
FIFO ↔ AXI Stream conversion modules |
reset_synchronizer.sv |
Reset synchronization |
2-FF synchronizer for clock domain crossing |
Memory Organization#
Memory Hierarchy#
┌─────────────────────────────────────────────────────────────┐
│ Memory Type │ Size/Dimensions │ Purpose │
├───────────────┼───────────────────┼────────────────────────┤
│ BRAM │ 2^15 x 256-bit │ Axon/External Events │
│ (Block RAM) │ = 1 MB │ - Row addresses │
│ │ │ - Spike masks │
├───────────────┼───────────────────┼────────────────────────┤
│ URAM │ 16 banks │ Neuron States │
│ (UltraRAM) │ 4096 x 72-bit │ - Membrane potential │
│ │ = 294 Kb/bank │ - Threshold │
│ │ @450 MHz │ - Refractory state │
│ │ │ (2 neurons/word) │
├───────────────┼───────────────────┼────────────────────────┤
│ HBM │ 33-bit address │ Synaptic Connectivity │
│ (High │ 256-bit data │ - Pointer chains │
│ Bandwidth │ 400+ GBps │ - Synapse weights │
│ Memory) │ 8 GB total │ - Target neuron IDs │
└───────────────┴───────────────────┴────────────────────────┘
Address Space Layout#
BRAM Address (15-bit):
[14:0] Row address → 32,768 rows
Each row: 256 bits = 16 x 16-bit masks (one per neuron group)
URAM Address (12-bit per bank):
[11:0] Row address → 4,096 rows per bank
16 banks total
72 bits per row = 2 neurons x 36 bits each
Total: 8,192 neurons per bank x 16 banks = 131,072 neurons
HBM Address (33-bit):
[32:0] Byte address → 8 GB addressable space
Stores synapse data in pointer-chain format:
[31:0] Next pointer (32-bit HBM address)
[47:32] Synapse weight (16-bit)
[63:48] Target neuron ID (16-bit)
Data Flow and Execution Phases#
Phase 1: External Event Processing#
1. Host → PCIe → pcie2fifos → Input FIFO
2. command_interpreter extracts axon spike events
3. external_events_processor:
- Reads BRAM (axon data) via input_data_handler
- Generates HBM read requests for synapse pointers
4. hbm_processor:
- Fetches synapse pointer chains from HBM
- Prefetches synapse data
5. pointer_fifo_controller:
- Distributes pointers to 16 neuron groups
- Sets spike flags in BRAM/URAM
Phase 2: Internal Event Processing#
1. internal_events_processor:
- Reads 16 URAM banks (neuron states) @450 MHz
- Applies synaptic inputs from pointer FIFOs
- Updates membrane potentials
- Detects threshold crossings (spikes)
- Writes back updated states
2. Spike outputs → spike_fifo_controller
3. spike_fifo_controller:
- Collects from 8 spike FIFOs (round-robin)
- Sends spike events back to external_events_processor
OR sends to host via Output FIFO
4. Output FIFO → pcie2fifos → PCIe → Host
Clock Domains#
The design uses two primary clock domains:
Clock Domain |
Frequency |
Usage |
|---|---|---|
aclk |
225 MHz |
PCIe, command interpreter, BRAM, most processing logic |
aclk450 |
450 MHz |
URAM access for high-throughput neuron updates |
Clock Domain Crossing:
Reset signals synchronized with
reset_synchronizer.sv(2-FF synchronizer)Data crossing handled by async FIFOs with independent read/write clocks
Critical path timing improved with
hbm_register_slicein AXI4 HBM interface
Module Hierarchy#
single_core (top-level)
├── core_wrapper
│ ├── reset_synchronizer (aclk domain)
│ ├── reset_synchronizer (aclk450 domain)
│ ├── hbm_register_slice (AXI4 pipeline)
│ └── core
│ ├── pcie2fifos (PCIe ↔ FIFO bridge)
│ ├── command_interpreter
│ ├── input_data_handler (BRAM arbiter)
│ │ └── BRAM (2^15 x 256)
│ ├── external_events_processor
│ ├── hbm_processor
│ │ └── AXI4 Master (HBM interface)
│ ├── pointer_fifo_controller
│ │ └── 16 x pointer_FIFO (32-bit)
│ ├── internal_events_processor
│ │ └── 16 x URAM (4096 x 72-bit) @450MHz
│ └── spike_fifo_controller
│ └── 8 x spike_FIFO (17-bit)
Key Terms and Definitions#
Term |
Definition |
|---|---|
Axon |
Presynaptic neuron output; external events stored in BRAM |
BRAM |
Block RAM - On-chip memory for axon/external event data |
HBM |
High Bandwidth Memory - Off-chip DRAM for synaptic connectivity |
URAM |
UltraRAM - High-density on-chip memory for neuron states |
Neuron Group |
8,192 neurons sharing a URAM bank and processing pipeline |
Pointer Chain |
Linked-list structure in HBM storing synapses for each axon |
Spike |
Action potential - neuron firing event when threshold is crossed |
Synapse |
Connection between neurons with associated weight |
AXI4 |
ARM Advanced eXtensible Interface - High-performance memory protocol |
DMA |
Direct Memory Access - Host-FPGA data transfer without CPU involvement |
FIFO |
First-In-First-Out buffer for asynchronous data transfer |
Round-Robin |
Fair scheduling algorithm cycling through multiple requesters |
Register Slice |
Pipeline stage for timing closure in high-speed interfaces |
FWFT |
First-Word Fall-Through - FIFO mode with zero-latency reads |
Interface Specifications#
PCIe Interface (pcie2fifos.v)#
Protocol: AXI4 (512-bit data width)
Clock: 225 MHz (aclk)
Bandwidth: ~14 GB/s theoretical (512 bits × 225 MHz / 8)
Latency: Command-to-response ~10-20 cycles typical
HBM Interface (hbm_processor.v)#
Protocol: AXI4 (256-bit data width)
Address Width: 33 bits (8 GB addressable)
Clock: 225 MHz (aclk)
Bandwidth: 400+ GB/s (multiple HBM channels)
Latency: ~100-200 ns typical read latency
BRAM Interface (input_data_handler.v)#
Width: 256 bits
Depth: 32,768 rows (15-bit address)
Read Latency: 3 cycles
Arbitration: Command interpreter has priority over external events processor
URAM Interface (internal_events_processor.v)#
Width: 72 bits (2 neurons × 36 bits)
Depth: 4,096 rows per bank (12-bit address)
Banks: 16 independent banks
Clock: 450 MHz (aclk450)
Read Latency: 1 cycle
Special Feature: Read-modify-write hazard detection and resolution
Cross-References#
Software Stack Integration#
Python API:
hs_bridge/directory contains the host-side softwarenetwork.py: High-level network definitioncompile_network.py: Converts network to HBM memory layoutFPGA_Execution/fpga_controller.py: Sends commands to command_interpreterwrapped_dmadump/: DMA library for PCIe communication
Hardware Configuration#
System Information:
CRI_stack_information- Details on 40-board clusterConfiguration Files:
hs_bridge/config.yaml- FPGA/network parameters
Documentation Flow#
Start with this README for overall architecture
Read
CRI_proj/orN_cores/module documentation for implementation detailsRefer to interface specifications for integration
Cross-reference with Python code for command protocols
Implementation Notes#
Design Evolution#
The codebase shows evidence of scaling from 8 to 16 neuron groups:
Commented code in
spike_fifo_controller.vshows previous 16-FIFO support (now 8)pointer_fifo_controller.vhas full 16-FIFO implementationAddress widths expanded: 12-bit → 13-bit for row addresses
Performance Optimization#
Pipeline Depth: External events processor uses 3-stage pipeline
Register Slicing: HBM interface includes register slice for timing closure
Dual Clock: URAM runs at 450 MHz (2x) for doubled throughput
Round-Robin: Fair arbitration prevents starvation in multi-FIFO controllers
Safety Features#
Reset Synchronization: Prevents metastability across clock domains
Hazard Resolution: Read-modify-write conflicts detected in internal_events_processor
Flow Control: FIFO full/empty signals prevent data loss
Default States: All state machines include default cases returning to reset
Getting Started#
Prerequisites#
Xilinx Vivado for FPGA synthesis (version compatible with XCVU37p)
Python 3.x with hs_bridge package for host control
PCIe Gen3 x16 connection to FPGA board
Access to CRI cluster or compatible hardware setup
Build Flow#
Synthesize modules in
CRI_proj/orN_cores/Integrate with Xilinx HBM and PCIe IP cores
Place and route with timing constraints for 225/450 MHz
Generate bitstream and program FPGA
Use
hs_bridgePython library to configure and run networks
Testing#
Unit tests: Simulate individual modules with test vectors
Integration tests: Full-core simulation with spike input/output
Hardware validation: Use simple network patterns to verify connectivity
Future Development#
Potential areas for enhancement:
Scaling: Support for 32+ neuron groups per core
Precision: Configurable bit widths for weights and potentials
Learning: Hardware support for on-chip STDP or other plasticity rules
Multi-Core: Inter-core communication for distributed networks
Monitoring: Built-in performance counters and debug interfaces
Contact and Support#
For questions about this hardware implementation:
Refer to
hs_bridgedocumentation for software interfaceCheck individual module documentation for detailed logic descriptions
See
CRI_stack_informationfor system architecture and configuration
Last Updated: December 2025 (Generated Documentation) Hardware Version: 16 neuron groups, 225/450 MHz dual-clock design Target Device: Xilinx XCVU37p with HBM