3 Verilog Files Review#

Overview#

The Verilog File directory contains the FPGA hardware implementation for a neuromorphic computing system designed to simulate large-scale spiking neural networks. This implementation targets Xilinx XCVU37p FPGAs with High Bandwidth Memory (HBM) and is part of the CRI (Cognitive Research Infrastructure) neuromorphic computing cluster at San Diego Supercomputer Center.

System Scale#

40 FPGA boards across 5 compute servers
32 cores per FPGA (each core can process 128K neurons)
4M neurons per FPGA, 160M neurons total system capacity
400+ GBps HBM bandwidth per board

Key Features#

Supports 16 neuron groups per core (8192 neurons each)
Multi-clock domain design: 225 MHz (aclk) and 450 MHz (aclk450)
Three-tier memory hierarchy:
- BRAM for axon/external event data
- URAM for neuron state data
- HBM for synaptic connectivity data
PCIe Gen3 x16 interface for host communication via DMA
Two-phase execution model: External events (Phase 1) → Internal updates (Phase 2)

High-Level Architecture#

                    ┌─────────────────────────────────────────────┐
                    │         Host Computer (Python)              │
                    │    hs_bridge: Network definition & control  │
                    └──────────────────┬──────────────────────────┘
                                       │
                                  PCIe Gen3 x16
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │           FPGA (Xilinx XCVU37p)             │
                    │                                              │
                    │  ┌────────────────────────────────────────┐ │
                    │  │      pcie2fifos (AXI4 to FIFO)         │ │
                    │  │  512-bit data path @ 225 MHz           │ │
                    │  └──────┬─────────────────────┬───────────┘ │
                    │         │                     │              │
                    │    Input FIFO            Output FIFO         │
                    │         │                     │              │
                    │  ┌──────▼─────────────────────▼───────────┐ │
                    │  │      command_interpreter                │ │
                    │  │  Parses commands, routes to modules     │ │
                    │  └──┬─────────┬──────────┬─────────┬──────┘ │
                    │     │         │          │         │         │
                    │     │         │          │         │         │
        ┌───────────┼─────▼─────────┴──────────┴─────────┴─────┐  │
        │  BRAM     │  input_data_handler (BRAM Arbiter)        │  │
        │  2^15 x   │  - Arbitrates CI vs EEP access            │  │
        │  256-bit  │  - 3-cycle read latency                   │  │
        └───────────┼──────┬──────────────────────────┬─────────┘  │
                    │      │                          │             │
                    │  ┌───▼──────────────┐   ┌───────▼──────────┐ │
                    │  │ external_events_ │   │  spike_fifo_     │ │
                    │  │    processor     │   │   controller     │ │
                    │  │ - Reads axon/    │   │ - Collects spikes│ │
                    │  │   BRAM data      │   │   from 8 FIFOs   │ │
                    │  │ - Generates HBM  │   │ - Round-robin    │ │
                    │  │   read requests  │   │   arbitration    │ │
                    │  └───┬──────────────┘   └──────────────────┘ │
                    │      │                                        │
                    │  ┌───▼────────────────────────────────────┐  │
                    │  │       hbm_processor                     │  │
                    │  │  - Manages HBM access (AXI4)            │  │
                    │  │  - Reads synaptic pointers              │  │
                    │  │  - Prefetches synapse data              │  │
                    │  │  - Handles pointer chains               │  │
                    │  └───┬────────────────────────────────┬───┘  │
                    │      │                                │       │
        ┌───────────┼──────▼────────────────────┐   ┌──────▼─────┐ │
        │  HBM      │  pointer_fifo_controller  │   │ hbm_       │ │
        │  33-bit   │  - 16 pointer FIFOs       │   │ register_  │ │
        │  address  │  - Round-robin dispatch   │   │ slice      │ │
        │  256-bit  │  - Manages BRAM/URAM      │   │ (timing)   │ │
        │  data     │    spike flags            │   └────────────┘ │
        │  400+     └───┬───────────────────────┘                  │
        │  GBps                │                                    │
        └───────────┼──────────▼────────────────────────────────┐  │
                    │  internal_events_processor                │  │
                    │  - Processes 16 URAM banks                │  │
                    │  - Updates neuron states                  │  │
                    │  - Detects spikes                         │  │
                    │  - Read-modify-write hazard resolution    │  │
        ┌───────────┼──────────────────────────────────────────┐│  │
        │  URAM     │  16 banks x 4096 rows x 72 bits          ││  │
        │  12-bit   │  (2 neurons per 72-bit word)             ││  │
        │  address  │  @450 MHz for high throughput            ││  │
        │  72-bit   └──────────┬────────────────────────────────┘│  │
        │  data                │                                  │  │
        └──────────────────────┼──────────────────────────────────┘  │
                    │          │                                     │
                    │     Spike outputs → spike_fifo_controller      │
                    │          (17-bit neuron addresses)             │
                    └────────────────────────────────────────────────┘

Directory Structure#

`CRI_proj/` - Single-Core Verilog Implementation#

Original single-core implementation using Verilog. Contains the core processing modules:

Module	Purpose	Key Features
command_interpreter.v	PCIe command interface	Parses commands, routes to processors
pcie2fifos.v	PCIe to FIFO bridge	AXI4 512-bit interface, handles DMA
input_data_handler.v	BRAM arbiter	Arbitrates between command interpreter and external events processor
external_events_processor.v	Axon event processing	Reads BRAM, generates HBM requests for synapses
hbm_processor.v	HBM memory controller	Manages synapse data access, pointer chaining
pointer_fifo_controller.v	Pointer distribution	Distributes pointers to 16 neuron groups
internal_events_processor.v	Neuron state updates	Processes 16 URAM banks, updates neurons, detects spikes
spike_fifo_controller.v	Spike collection	Collects spikes from 8 FIFOs, round-robin arbitration

Variants:

external_events_processor_simple.v - Simplified version with fixed pipeline depth
external_events_processor_v2.v - Enhanced version with improved timing

`N_cores/` - Multi-Core SystemVerilog Implementation#

Multi-core implementation using SystemVerilog with improved modularity and timing closure:

Module	Purpose	Key Features
single_core.sv	Top-level core integration	Instantiates all processors, 16 URAMs, FIFOs
core_wrapper.sv	Core wrapper with reset sync	Adds HBM register slice for timing closure
types.sv	Interface definitions	AXI4, AXILite, AXIStream, FIFO, RAM interfaces
Xilinx_IP_wrappers.sv	Xilinx IP macros	FIFO and URAM wrapper macros
FIFO_AXI_Converters.sv	Protocol converters	FIFO ↔ AXI Stream conversion modules
reset_synchronizer.sv	Reset synchronization	2-FF synchronizer for clock domain crossing

Memory Organization#

Memory Hierarchy#

┌─────────────────────────────────────────────────────────────┐
│  Memory Type  │  Size/Dimensions  │  Purpose               │
├───────────────┼───────────────────┼────────────────────────┤
│  BRAM         │  2^15 x 256-bit   │  Axon/External Events  │
│  (Block RAM)  │  = 1 MB           │  - Row addresses       │
│               │                   │  - Spike masks         │
├───────────────┼───────────────────┼────────────────────────┤
│  URAM         │  16 banks         │  Neuron States         │
│  (UltraRAM)   │  4096 x 72-bit    │  - Membrane potential  │
│               │  = 294 Kb/bank    │  - Threshold           │
│               │  @450 MHz         │  - Refractory state    │
│               │                   │  (2 neurons/word)      │
├───────────────┼───────────────────┼────────────────────────┤
│  HBM          │  33-bit address   │  Synaptic Connectivity │
│  (High        │  256-bit data     │  - Pointer chains      │
│   Bandwidth   │  400+ GBps        │  - Synapse weights     │
│   Memory)     │  8 GB total       │  - Target neuron IDs   │
└───────────────┴───────────────────┴────────────────────────┘

Address Space Layout#

BRAM Address (15-bit):

[14:0] Row address → 32,768 rows
       Each row: 256 bits = 16 x 16-bit masks (one per neuron group)

URAM Address (12-bit per bank):

[11:0] Row address → 4,096 rows per bank
       16 banks total
       72 bits per row = 2 neurons x 36 bits each
       Total: 8,192 neurons per bank x 16 banks = 131,072 neurons

HBM Address (33-bit):

[32:0] Byte address → 8 GB addressable space
       Stores synapse data in pointer-chain format:
       [31:0]  Next pointer (32-bit HBM address)
       [47:32] Synapse weight (16-bit)
       [63:48] Target neuron ID (16-bit)

Data Flow and Execution Phases#

Phase 1: External Event Processing#

1. Host → PCIe → pcie2fifos → Input FIFO
2. command_interpreter extracts axon spike events
3. external_events_processor:
   - Reads BRAM (axon data) via input_data_handler
   - Generates HBM read requests for synapse pointers
4. hbm_processor:
   - Fetches synapse pointer chains from HBM
   - Prefetches synapse data
5. pointer_fifo_controller:
   - Distributes pointers to 16 neuron groups
   - Sets spike flags in BRAM/URAM

Phase 2: Internal Event Processing#

1. internal_events_processor:
   - Reads 16 URAM banks (neuron states) @450 MHz
   - Applies synaptic inputs from pointer FIFOs
   - Updates membrane potentials
   - Detects threshold crossings (spikes)
   - Writes back updated states
2. Spike outputs → spike_fifo_controller
3. spike_fifo_controller:
   - Collects from 8 spike FIFOs (round-robin)
   - Sends spike events back to external_events_processor
   OR sends to host via Output FIFO
4. Output FIFO → pcie2fifos → PCIe → Host

Clock Domains#

The design uses two primary clock domains:

Clock Domain	Frequency	Usage
aclk	225 MHz	PCIe, command interpreter, BRAM, most processing logic
aclk450	450 MHz	URAM access for high-throughput neuron updates

Clock Domain Crossing:

Reset signals synchronized with reset_synchronizer.sv (2-FF synchronizer)
Data crossing handled by async FIFOs with independent read/write clocks
Critical path timing improved with hbm_register_slice in AXI4 HBM interface

Module Hierarchy#

single_core (top-level)
├── core_wrapper
│   ├── reset_synchronizer (aclk domain)
│   ├── reset_synchronizer (aclk450 domain)
│   ├── hbm_register_slice (AXI4 pipeline)
│   └── core
│       ├── pcie2fifos (PCIe ↔ FIFO bridge)
│       ├── command_interpreter
│       ├── input_data_handler (BRAM arbiter)
│       │   └── BRAM (2^15 x 256)
│       ├── external_events_processor
│       ├── hbm_processor
│       │   └── AXI4 Master (HBM interface)
│       ├── pointer_fifo_controller
│       │   └── 16 x pointer_FIFO (32-bit)
│       ├── internal_events_processor
│       │   └── 16 x URAM (4096 x 72-bit) @450MHz
│       └── spike_fifo_controller
│           └── 8 x spike_FIFO (17-bit)

Key Terms and Definitions#

Term	Definition
Axon	Presynaptic neuron output; external events stored in BRAM
BRAM	Block RAM - On-chip memory for axon/external event data
HBM	High Bandwidth Memory - Off-chip DRAM for synaptic connectivity
URAM	UltraRAM - High-density on-chip memory for neuron states
Neuron Group	8,192 neurons sharing a URAM bank and processing pipeline
Pointer Chain	Linked-list structure in HBM storing synapses for each axon
Spike	Action potential - neuron firing event when threshold is crossed
Synapse	Connection between neurons with associated weight
AXI4	ARM Advanced eXtensible Interface - High-performance memory protocol
DMA	Direct Memory Access - Host-FPGA data transfer without CPU involvement
FIFO	First-In-First-Out buffer for asynchronous data transfer
Round-Robin	Fair scheduling algorithm cycling through multiple requesters
Register Slice	Pipeline stage for timing closure in high-speed interfaces
FWFT	First-Word Fall-Through - FIFO mode with zero-latency reads

Interface Specifications#

PCIe Interface (pcie2fifos.v)#

Protocol: AXI4 (512-bit data width)
Clock: 225 MHz (aclk)
Bandwidth: ~14 GB/s theoretical (512 bits × 225 MHz / 8)
Latency: Command-to-response ~10-20 cycles typical

HBM Interface (hbm_processor.v)#

Protocol: AXI4 (256-bit data width)
Address Width: 33 bits (8 GB addressable)
Clock: 225 MHz (aclk)
Bandwidth: 400+ GB/s (multiple HBM channels)
Latency: ~100-200 ns typical read latency

BRAM Interface (input_data_handler.v)#

Width: 256 bits
Depth: 32,768 rows (15-bit address)
Read Latency: 3 cycles
Arbitration: Command interpreter has priority over external events processor

URAM Interface (internal_events_processor.v)#

Width: 72 bits (2 neurons × 36 bits)
Depth: 4,096 rows per bank (12-bit address)
Banks: 16 independent banks
Clock: 450 MHz (aclk450)
Read Latency: 1 cycle
Special Feature: Read-modify-write hazard detection and resolution

Cross-References#

Software Stack Integration#

Python API: hs_bridge/ directory contains the host-side software
- network.py: High-level network definition
- compile_network.py: Converts network to HBM memory layout
- FPGA_Execution/fpga_controller.py: Sends commands to command_interpreter
- wrapped_dmadump/: DMA library for PCIe communication

Hardware Configuration#

System Information: CRI_stack_information - Details on 40-board cluster
Configuration Files: hs_bridge/config.yaml - FPGA/network parameters

Documentation Flow#

Start with this README for overall architecture
Read CRI_proj/ or N_cores/ module documentation for implementation details
Refer to interface specifications for integration
Cross-reference with Python code for command protocols

Implementation Notes#

Design Evolution#

The codebase shows evidence of scaling from 8 to 16 neuron groups:

Commented code in spike_fifo_controller.v shows previous 16-FIFO support (now 8)
pointer_fifo_controller.v has full 16-FIFO implementation
Address widths expanded: 12-bit → 13-bit for row addresses

Performance Optimization#

Pipeline Depth: External events processor uses 3-stage pipeline
Register Slicing: HBM interface includes register slice for timing closure
Dual Clock: URAM runs at 450 MHz (2x) for doubled throughput
Round-Robin: Fair arbitration prevents starvation in multi-FIFO controllers

Safety Features#

Reset Synchronization: Prevents metastability across clock domains
Hazard Resolution: Read-modify-write conflicts detected in internal_events_processor
Flow Control: FIFO full/empty signals prevent data loss
Default States: All state machines include default cases returning to reset

Getting Started#

Prerequisites#

Xilinx Vivado for FPGA synthesis (version compatible with XCVU37p)
Python 3.x with hs_bridge package for host control
PCIe Gen3 x16 connection to FPGA board
Access to CRI cluster or compatible hardware setup

Build Flow#

Synthesize modules in CRI_proj/ or N_cores/
Integrate with Xilinx HBM and PCIe IP cores
Place and route with timing constraints for 225/450 MHz
Generate bitstream and program FPGA
Use hs_bridge Python library to configure and run networks

Testing#

Unit tests: Simulate individual modules with test vectors
Integration tests: Full-core simulation with spike input/output
Hardware validation: Use simple network patterns to verify connectivity

Future Development#

Potential areas for enhancement:

Scaling: Support for 32+ neuron groups per core
Precision: Configurable bit widths for weights and potentials
Learning: Hardware support for on-chip STDP or other plasticity rules
Multi-Core: Inter-core communication for distributed networks
Monitoring: Built-in performance counters and debug interfaces

Contact and Support#

For questions about this hardware implementation:

Refer to hs_bridge documentation for software interface
Check individual module documentation for detailed logic descriptions
See CRI_stack_information for system architecture and configuration

Last Updated: December 2025 (Generated Documentation) Hardware Version: 16 neuron groups, 225/450 MHz dual-clock design Target Device: Xilinx XCVU37p with HBM