Appendix: Hardware Glossary for Software Experts#
This appendix defines all hardware terms and concepts used throughout the documentation. Each entry includes a software analogy where applicable to bridge the gap between hardware and software thinking.
A#
Address Passthrough#
What it is: A technique where a memory controller echoes back the requested address along with the returned data.
Software analogy: Like a database query that returns both the data and the primary key you searched for. Useful when multiple reads are in flight and you need to correlate responses with requests.
Why it matters: In pipelined hardware, you might issue multiple reads before the first one completes. Address passthrough tells you which read just finished.
Arbiter#
What it is: Hardware logic that decides which of multiple competing requesters gains access to a shared resource (memory, bus, FIFO).
Software analogy: Like a mutex or semaphore, but implemented in hardware logic gates. Multiple threads want the same resource; the arbiter picks one based on priority, round-robin fairness, or other policy.
Common types:
Priority arbiter: Always picks highest-priority requester (like VIP queue)
Round-robin arbiter: Cycles through requesters fairly (like
itertools.cycle())First-come-first-served: Picks whoever requested first (like a queue)
Why it matters: Prevents conflicts when multiple hardware modules try to access the same memory or bus simultaneously.
AXI4 (Advanced eXtensible Interface)#
What it is: A memory-mapped communication protocol designed by ARM for on-chip data transfer. It’s the “language” FPGA modules speak to transfer data.
Software analogy: Like a REST API for hardware—defines request/response formats, handshake protocols, and data structures. Instead of HTTP methods (GET, POST), you have read and write transactions.
Five channels:
Read Address (AR): “I want to read from address X”
Read Data (R): “Here’s the data you requested”
Write Address (AW): “I want to write to address X”
Write Data (W): “Here’s the data to write”
Write Response (B): “Write complete, here’s the status”
Key signals:
VALID: “I have valid data/request”READY: “I’m ready to accept data/request”Transaction occurs only when both are high (handshake)
Why it matters: All on-chip communication in the FPGA uses AXI4. Understanding it is like understanding HTTP for web development.
See: hardware_map.md Section 4, Chapter 1 (DMA transfers)
AXI4 Master#
What it is: A hardware module that initiates AXI4 transactions (sends read/write requests).
Software analogy: Like an HTTP client that makes requests to servers. The FPGA’s DMA engine is a master; it requests data from host memory.
Why it matters: Determines who “drives” the bus. Masters control data flow; slaves respond.
AXI4 Slave#
What it is: A hardware module that responds to AXI4 transactions (serves read/write requests).
Software analogy: Like an HTTP server that responds to client requests. Host memory is a slave; it serves data when the FPGA requests it.
Why it matters: Slaves must be ready to handle requests at any time. They implement the actual storage or computation.
Axon#
What it is: In neuromorphic systems, an external input to the network. In hardware terms, axons don’t have internal state (no membrane potential)—they’re just spike sources.
Software analogy: Like a publisher in a pub/sub system. When an axon fires, it publishes a spike event that gets delivered to all subscribed neurons.
Storage: Axon spike masks are stored in BRAM as one-hot bit vectors (256 bits per row).
Why it matters: Distinguishing axons (stateless inputs) from neurons (stateful computations) determines memory organization and processing flow.
See: Chapter 1 (BRAM organization), Chapter 2 (Phase 1 processing)
B#
Backpressure#
What it is: A flow control mechanism where a downstream module signals “I’m full, stop sending data” to an upstream module.
Software analogy: Like TCP windowing or a bounded queue’s .full() flag. When a buffer fills up, backpressure tells the sender to pause.
Implementation: Typically via a FULL or !READY signal. When asserted, upstream must stop writing.
Why it matters: Prevents data loss when fast producers overwhelm slow consumers. Critical in pipelined systems.
See: Chapter 2 (FIFO interfaces), hardware_map.md (AXI4 handshakes)
Beat#
What it is: A single data transfer within a burst transaction. A burst is composed of multiple beats.
Software analogy: Like a single element in a batch API request. If you request 16 records in one query, each record is one “beat.”
Example: A 16-beat burst transfers 16 data words sequentially.
Why it matters: Bursts amortize transaction overhead. Instead of 16 separate requests (expensive), send one request for 16 beats (efficient).
BRAM (Block RAM)#
What it is: Fast on-chip SRAM built into the FPGA fabric. Organized into dual-port blocks (can read and write simultaneously from different ports).
Specifications:
Capacity: 1 MB total (256 rows × 256 bits per row)
Latency: 3 clock cycles @ 225 MHz (~13 ns)
Technology: 6-transistor SRAM cells (fast, power-hungry)
Software analogy: Like L1 cache—very fast, very expensive (in silicon area), small capacity. Use for frequently accessed data.
Usage in neuromorphic system:
Stores axon spike masks (one-hot encoded, 256 bits per row)
Dual-port allows host PC to write new spikes while FPGA reads current spikes
Why it matters: BRAM is the fastest memory accessible to FPGA logic (except registers). Ideal for low-latency, small datasets.
See: Chapter 1 Section 1.1, Chapter 2 Phase 0
Burst#
What it is: An AXI4 transaction that transfers multiple consecutive data words in a single request.
Software analogy: Like a batch database query. Instead of:
for i in range(16):
data[i] = read(addr + i) # 16 separate requests
You do:
data = read_burst(addr, length=16) # 1 request, 16 data transfers
Why it matters: Reduces protocol overhead. Each request has setup time (address, handshake); bursts amortize this over multiple data words.
See: hardware_map.md (AXI4 timing), Chapter 1 (HBM bursts)
Burst Length#
What it is: The number of beats (data words) in a burst transaction.
AXI4 limits: 1-256 beats for INCR bursts.
Example: Burst length 16 means the transaction transfers 16 consecutive data words.
Why it matters: Longer bursts → higher bandwidth efficiency, but require contiguous addresses.
C#
Clock Domain#
What it is: A region of a circuit where all flip-flops are clocked by the same clock signal. Everything in a clock domain updates simultaneously on clock edges.
Software analogy: Like a thread with its own event loop. All state updates happen synchronously at clock ticks (like await points in async Python).
Example in neuromorphic system:
225 MHz domain: BRAM, HBM interfaces, most control logic
450 MHz domain: URAM, internal events processor (2x faster for neuron updates)
Why it matters: Data transfer between clock domains requires special synchronization (FIFOs with gray code) to avoid metastability.
See: hardware_map.md (clock generation), Chapter 2 (dual clock operation)
Clock Domain Crossing (CDC)#
What it is: Transferring data from one clock domain to another (e.g., 225 MHz → 450 MHz).
Software analogy: Like passing data between threads with different event loops. You need thread-safe mechanisms (mutex, queue) to avoid race conditions.
Hardware solution: Asynchronous FIFOs with gray-code counters to prevent metastability.
Why it matters: Improper CDC causes random bit flips, data corruption, or circuit hangs. Must use proven synchronization techniques.
See: hardware_map.md Section 3 (FIFO synchronization)
Combinational Logic#
What it is: Digital circuits where outputs depend only on current inputs (no memory, no state). Examples: AND gates, multiplexers, adders.
Software analogy: Pure functions in functional programming—same inputs always produce same outputs, no side effects, no state.
def combinational(a, b):
return a & b # Output depends only on a, b (no self.state)
Verilog:
assign output = (a & b) | c; // Combinational: output updates whenever a, b, c change
Why it matters: Combinational logic is fast (no clock delay) but creates no memory. Contrast with sequential logic (flip-flops, registers) which stores state.
See: hardware_map.md (logic gates), Chapter 2 (state machines use both)
Command Opcode#
What it is: An 8-bit identifier in PCIe command packets that specifies the operation type.
Software analogy: Like HTTP methods (GET, POST, PUT, DELETE) or opcodes in assembly language (MOV, ADD, JMP).
Example opcodes in neuromorphic system:
0x01: Write to BRAM (axon spikes)0x02: Write to HBM (network structure)0x03: Write to URAM (neuron states)0x04: Read from URAM0xC8: Execute (run one timestep)
Why it matters: The command interpreter decodes opcodes to route data to correct hardware modules.
See: Chapter 1 Section 1.2, Chapter 2 Phase 0
D#
DMA (Direct Memory Access)#
What it is: A technique where an FPGA (or peripheral) transfers data to/from host memory without involving the CPU. The FPGA becomes a “bus master” and directly reads/writes RAM.
Software analogy: Like mmap() or shared memory. Instead of:
# Slow: CPU copies data
data = host_memory.read()
fpga.write(data)
You have:
# Fast: FPGA directly accesses host memory
fpga.dma_read(host_address, size)
Why it matters: DMA bypasses the CPU bottleneck, achieving 10-100x higher bandwidth. Essential for large data transfers (network initialization, spike outputs).
How it works:
FPGA sends PCIe TLP (Transaction Layer Packet) with memory address
Host memory controller responds with data
No CPU involvement—fully offloaded
See: Chapter 1 Section 1.2 (initialization), hardware_map.md Section 2 (PCIe)
Double Buffering#
What it is: Using two buffers that alternate roles—while one is being read (consumed), the other is being written (produced).
Software analogy:
class DoubleBuffer:
def __init__(self):
self.buffer_a = [0] * 1000
self.buffer_b = [0] * 1000
self.active = 'a' # Which buffer is being read
def swap(self):
# Swap roles: reader becomes writer, writer becomes reader
self.active = 'b' if self.active == 'a' else 'a'
Usage in neuromorphic system: External events processor uses double-buffered BRAMs:
Present BRAM: Being read/cleared (processing current timestep spikes)
Future BRAM: Being written (collecting spikes for next timestep)
After execute, they swap roles
Why it matters: Enables pipelining—overlap computation with I/O. No need to wait for reads to finish before accepting new writes.
See: Verilog sources (external_events_processor_simple.v)
DRAM (Dynamic RAM)#
What it is: Memory technology that stores bits as charge in capacitors. “Dynamic” because charge leaks, requiring periodic refresh.
Structure: 1 transistor + 1 capacitor per bit (1T1C). Compact but slow.
Software analogy: Like disk storage—high capacity, slow access, requires maintenance (refresh = defragmentation).
Types:
DDR4: Desktop/server RAM (~20 GB/s per channel)
HBM2: 3D-stacked high-bandwidth RAM (~400 GB/s, used in neuromorphic system)
Contrast with SRAM:
DRAM: 1T1C, dense, slow (50-100 ns), needs refresh
SRAM: 6T, fast (1-3 ns), expensive, no refresh
Why it matters: DRAM provides bulk storage (8 GB HBM) for synaptic weights. Too slow for direct neuron computation (hence URAM for neuron states).
See: hardware_map.md Section 1.3 (HBM structure)
E#
execRun_ctr (Execution Run Counter)#
What it is: A hardware register that counts timesteps. Increments after each execute() command completes.
Software analogy:
class NeuromorphicEngine:
def __init__(self):
self.timestep = 0 # Like execRun_ctr
def execute(self):
# Run one timestep
self.timestep += 1
Why it matters: Used to tag spike events with their timestep. When spikes return to host, you know when they occurred.
See: Chapter 2 (spike packet format)
execRun_timer (Execution Run Timer)#
What it is: A hardware counter that measures clock cycles elapsed during a timestep.
Software analogy:
import time
start = time.perf_counter()
execute()
elapsed_cycles = (time.perf_counter() - start) * clock_frequency
Why it matters: Profiling tool. Tells you how many clock cycles (and thus microseconds) a timestep took. Useful for optimization.
F#
FIFO (First-In-First-Out)#
What it is: A hardware buffer that stores data in order received and outputs data in the same order.
Software analogy: queue.Queue() in Python—.put() adds to tail, .get() removes from head.
Hardware interface:
// Write side
input full, // 1 = FIFO full, cannot write
output wren, // 1 = write data_in this cycle
output data_in, // Data to write
// Read side
input empty, // 1 = FIFO empty, no data
output rden, // 1 = read data_out this cycle
input data_out, // Data read
FWFT mode (First-Word Fall-Through): data_out always shows next item (zero latency). Like .peek() in Python.
Why it matters: FIFOs decouple clock domains, buffer mismatched rates (fast producer, slow consumer), and enable pipelining.
See: hardware_map.md Section 3, Chapter 2 (pointer FIFOs, spike FIFOs)
Flip-Flop#
What it is: A basic sequential logic element that stores one bit. Updates on clock edge.
Software analogy: Like a single-bit instance variable:
class FlipFlop:
def __init__(self):
self.q = 0 # Current state
def clock_edge(self, d):
self.q = d # Update state on clock tick
Verilog:
always @(posedge clk) begin
q <= d; // q updates to d on rising clock edge
end
D-type flip-flop: Most common. Has one data input (D) and one output (Q). On clock edge, Q ← D.
Why it matters: Flip-flops are the fundamental unit of memory in digital circuits. Registers, counters, state machines—all built from flip-flops.
See: hardware_map.md Section 1.2 (FPGA fabric)
FPGA (Field-Programmable Gate Array)#
What it is: A chip containing millions of configurable logic blocks (LUTs, flip-flops) connected by programmable routing. You write code (Verilog/VHDL) that gets “compiled” into a configuration bitstream that rewires the chip.
Software analogy: Imagine if you could rewrite the CPU’s microarchitecture at runtime. An FPGA is like a blank canvas where you design custom hardware for your application.
Neuromorphic system uses: Xilinx XCVU37p UltraScale+ VU37P
1.1 million LUTs (programmable logic gates)
2.2 million flip-flops (registers)
50 MB BRAM (on-chip cache)
4.5 MB URAM (ultra-dense cache)
16 lanes PCIe Gen3 (host communication)
32 HBM2 channels (memory bandwidth)
Why it matters: FPGAs offer specialized, massively parallel computation. Neuromorphic networks need to update 100,000+ neurons in microseconds—impossible on CPU, possible on FPGA.
See: hardware_map.md Section 1.2, Chapter 1 (compilation)
FWFT (First-Word Fall-Through)#
What it is: A FIFO operating mode where the output data is always valid (showing the next item) without requiring a read strobe first.
Normal FIFO:
if not fifo.empty():
fifo.read_enable = True # Request data
# Wait 1 cycle
data = fifo.data_out # Data appears next cycle
FWFT FIFO:
if not fifo.empty():
data = fifo.data_out # Data already available (0 latency)
fifo.read_enable = True # Advance to next item
Why it matters: Reduces latency by 1 cycle. Important in high-speed pipelines where every cycle counts.
H#
Handshake#
What it is: A two-signal protocol (VALID and READY) where data transfer occurs only when both signals are high.
Protocol:
Sender: Asserts
VALIDwhen data is available, holds data stableReceiver: Asserts
READYwhen it can accept dataTransfer occurs: When
VALID && READYon clock edge
Software analogy:
while True:
sender.valid = sender.has_data()
receiver.ready = not receiver.is_full()
if sender.valid and receiver.ready:
receiver.data = sender.data # Transfer!
sender.pop()
break
Why it matters: Prevents data loss. Sender can’t send until receiver is ready; receiver doesn’t miss data because sender holds it stable.
See: hardware_map.md (AXI4 channels), Chapter 2 (HBM read valid/ready)
Hazard#
What it is: A conflict in pipelined hardware when consecutive operations access the same resource (typically same memory address) in ways that could cause data corruption.
Types:
Read-after-write (RAW): Read tries to fetch old value before write completes
Write-after-read (WAR): Write might corrupt data before read completes
Write-after-write (WAW): Second write might complete before first
Software analogy: Like race conditions in multithreaded code:
# Thread 1
x = x + 1 # Read x, add 1, write back
# Thread 2 (starts during Thread 1)
y = x # Which value of x do I see?
Hardware solution: Hazard detection logic checks if addresses match in pipeline stages. If match, either:
Stall: Pause until pipeline clears
Bypass/Forward: Route fresh data directly from pipeline register
Why it matters: URAM has 3-cycle read latency. If you read address A, then write address A within 3 cycles, hazard detection prevents reading stale data.
See: Chapter 2 Section 2.2 (internal_events_processor hazard logic)
HBM (High Bandwidth Memory)#
What it is: 3D-stacked DRAM technology providing extreme bandwidth (~400 GB/s). Multiple DRAM dies stacked vertically, connected by Through-Silicon Vias (TSVs).
Specifications:
Capacity: 8 GB (for synaptic connectivity data)
Bandwidth: 400+ GB/s (32 channels × 14 GB/s per channel)
Latency: ~100-200 ns (slow compared to BRAM/URAM, fast compared to DDR4)
Organization: 2 pseudo-channels × 16 banks per channel
Software analogy: Like a distributed database with 32 shards. Each shard (channel) serves 256 MB and can handle requests independently (parallel access).
Usage in neuromorphic system:
Region 1: Axon pointers (maps axon ID → synapse list address)
Region 2: Neuron pointers (maps neuron ID → synapse list address)
Region 3: Synapses (actual connection data: target neuron + weight)
Why it matters: Stores the entire network structure (billions of synapses for large networks). Bandwidth supports updating millions of neurons per millisecond.
See: hardware_map.md Section 1.3, Chapter 1 Section 1.1, Chapter 2 Phase 1
I#
INCR Burst#
What it is: An AXI4 burst type where addresses increment sequentially for each beat.
Example:
Burst start address: 0x1000
Burst length: 4
Addresses accessed: 0x1000, 0x1004, 0x1008, 0x100C (assuming 32-bit words)
Contrast with WRAP bursts: Addresses wrap around within a boundary (used for cache lines).
Why it matters: Simplest burst type. Used for sequential data access (reading synapse lists, writing neuron states).
L#
Latency#
What it is: The time delay between when a request is issued and when the response is received.
Software analogy: Like ping time in networking, or function call overhead.
Examples in neuromorphic system:
BRAM latency: 3 cycles @ 225 MHz = ~13 ns
URAM latency: 1 cycle @ 450 MHz = ~2 ns
HBM latency: ~100-200 ns
PCIe latency: ~1-10 µs (depends on packet size, distance)
Why it matters: Latency determines pipeline depth. 3-cycle BRAM latency means you need 3 pipeline stages between read request and data availability.
Contrast with throughput: Latency is “time per operation,” throughput is “operations per time.”
See: hardware_map.md (memory specifications), Chapter 2 (pipeline filling)
LUT (Lookup Table)#
What it is: A programmable logic element in an FPGA. Typically a 6-input, 1-output function implemented as a 64-bit SRAM.
How it works:
6 inputs → 2^6 = 64 possible input combinations
64-bit SRAM stores the output value for each combination
LUT acts as a 6-input truth table
Software analogy:
# 2-input LUT (simplified)
lut_contents = [0, 0, 0, 1] # 4-bit SRAM for 2 inputs
def lut(a, b):
index = (a << 1) | b # Convert inputs to index
return lut_contents[index]
# This LUT implements: output = a AND b
# 00 → 0, 01 → 0, 10 → 0, 11 → 1
Why it matters: LUTs are the fundamental building blocks of FPGA logic. Any combinational function of up to 6 inputs can be implemented in 1 LUT. Complex functions use multiple LUTs connected together.
See: hardware_map.md Section 1.2
M#
Metastability#
What it is: An undefined state in digital circuits where a flip-flop’s output oscillates between 0 and 1, unable to settle. Occurs during clock domain crossings when setup/hold time requirements are violated.
Software analogy: Like a race condition where a variable reads as garbage because two threads wrote simultaneously.
Cause: When a flip-flop’s input changes too close to the clock edge, the output can become metastable (neither 0 nor 1, or fluctuating).
Solution: Multi-stage synchronizers (2-3 flip-flops in series) give time for metastability to resolve. First flip-flop might be metastable, but probability that second flip-flop is also metastable is astronomically low.
Why it matters: Metastability causes random bit flips, data corruption, system crashes. Must use proven synchronization techniques for clock domain crossings (async FIFOs with gray code).
See: hardware_map.md Section 3 (FIFO synchronization)
N#
Neuron Group#
What it is: A set of 8,192 neurons that share a URAM bank and processing pipeline. The neuromorphic system has 16 neuron groups (0-15), totaling 131,072 neurons.
Organization:
Group 0: Neurons 0-8,191 (URAM bank 0)
Group 1: Neurons 8,192-16,383 (URAM bank 1)
…
Group 15: Neurons 122,880-131,071 (URAM bank 15)
Why groups? Parallel processing. All 16 groups can be updated simultaneously (16-way parallelism).
Software analogy: Like database sharding. Each shard (neuron group) has its own storage (URAM bank) and compute pipeline.
Why it matters: Determines memory addresses, routing logic, and parallelism. Understanding groups is key to understanding performance scaling.
See: Chapter 1 Section 1.1, Chapter 2 Section 2.2
P#
PCIe (Peripheral Component Interconnect Express)#
What it is: A high-speed serial communication bus connecting the CPU to peripherals (GPUs, FPGAs, SSDs). Uses differential signaling over point-to-point links.
Neuromorphic system: PCIe Gen3 x16
Gen3: Third generation (8 GT/s per lane)
x16: 16 lanes (parallel connections)
Bandwidth: ~14 GB/s bidirectional
Software analogy: Like USB or Ethernet, but much faster and lower latency. From software, you use it via memory-mapped I/O or DMA.
How it works:
Physical layer: Differential pairs (16 lanes × 2 directions = 32 pairs)
Data link layer: Packets with CRC error detection
Transaction layer: TLP (Transaction Layer Packets) for reads/writes
Why it matters: PCIe is the bridge between host CPU (software) and FPGA (hardware). All data transfer (inputs, outputs, configuration) flows through PCIe.
See: hardware_map.md Section 2, Chapter 1 Section 1.2
Pipeline#
What it is: A technique where operations are broken into stages, with multiple operations executing simultaneously at different stages.
Software analogy:
# No pipeline: 3 cycles per item
for item in data:
stage1(item) # 1 cycle
stage2(item) # 1 cycle
stage3(item) # 1 cycle
# Total: 3N cycles for N items
# Pipeline: 1 cycle per item after initial fill
# Cycle 0: stage1(item[0])
# Cycle 1: stage1(item[1]), stage2(item[0])
# Cycle 2: stage1(item[2]), stage2(item[1]), stage3(item[0])
# Cycle 3: stage1(item[3]), stage2(item[2]), stage3(item[1])
# Total: N+2 cycles for N items (3x speedup for large N)
Example in neuromorphic system: BRAM read pipeline
Cycle 0: Issue address for read 0
Cycle 1: Issue address for read 1 (read 0 in pipeline stage 1)
Cycle 2: Issue address for read 2 (read 1 in stage 1, read 0 in stage 2)
Cycle 3: Read 0 data available, issue address for read 3
Why it matters: Pipelines enable high throughput despite high latency. BRAM has 3-cycle latency but can start a new read every cycle (throughput = 1 read/cycle).
See: hardware_map.md (memory timing), Chapter 2 (pipeline filling)
Pipeline Depth#
What it is: The number of stages in a pipeline, or equivalently, the number of clock cycles between input and output.
Example: BRAM has 3-cycle read latency → pipeline depth = 3.
Why it matters: Determines how many cycles you must wait before first result. Also affects hazard detection (must check all pipeline stages for address conflicts).
Pointer Chain#
What it is: A linked-list structure in HBM where each neuron/axon has a pointer to the start of its synapse list.
Structure:
Axon a0 → Pointer: 0x00080000 (address of synapse list in HBM)
HBM[0x00080000] = [synapse_0, synapse_1, ..., synapse_4]
Each synapse = {target: h0, weight: 1000}
Software analogy:
# Like a dictionary of lists
synapses = {
'a0': [('h0', 1000), ('h1', 1000), ...],
'a1': [('h0', 1000), ('h1', 1000), ...],
}
# In hardware, stored as pointers:
pointers = {'a0': 0x80000, 'a1': 0x80005}
memory = {
0x80000: [('h0', 1000), ('h1', 1000), ...],
0x80005: [('h0', 1000), ('h1', 1000), ...],
}
Why it matters: Enables variable fan-out (some neurons have 10 synapses, others 10,000). Pointer indirection allows efficient memory usage.
See: Chapter 1 Section 1.1 (HBM layout), Chapter 2 Phase 1
Priority Arbitration#
What it is: An arbitration scheme where high-priority requesters always win over low-priority requesters.
Software analogy: Like a priority queue or VIP line.
Example: If both CPU and DMA request the bus, CPU (high priority) always wins.
Disadvantage: Low-priority requesters can starve (never get access).
See also: Round-robin (fair alternative)
R#
Read Latency#
What it is: The number of clock cycles from when a read address is issued to when the data becomes valid.
Examples:
BRAM: 3 cycles
URAM: 1 cycle
HBM: ~100-200 cycles (at 225 MHz)
Why it matters: Determines pipeline depth and minimum loop iteration time.
See: Memory specifications throughout documentation
Register#
What it is: A storage element (or group of flip-flops) that holds a multi-bit value.
Software analogy: Like an instance variable in a class.
Verilog:
reg [7:0] counter; // 8-bit register
always @(posedge clk) begin
counter <= counter + 1; // Updates on every clock edge
end
Why it matters: Registers store state in sequential circuits. Counters, addresses, data buffers—all implemented as registers.
Register Slice#
What it is: A pipeline stage inserted into a data path to improve timing (break long combinational paths).
Software analogy: Like adding an intermediate variable to break up a complex expression:
# Hard to optimize (long dependency chain)
result = f(g(h(i(j(x)))))
# Easier (can pipeline/parallelize)
temp1 = j(x)
temp2 = i(temp1)
temp3 = h(temp2)
temp4 = g(temp3)
result = f(temp4)
Why it matters: High-speed interfaces (450 MHz) require short logic paths. Register slices reduce maximum combinational delay, allowing higher clock frequencies.
See: internal_events_processor (450 MHz URAM access)
RMW (Read-Modify-Write)#
What it is: A pattern where you read a value, modify it, and write it back.
Software analogy:
x = memory[addr] # Read
x = x + 1 # Modify
memory[addr] = x # Write
Hardware challenges: With 3-cycle read latency, must ensure no other operation accesses the same address during the RMW sequence (hazard detection).
Example in neuromorphic system: Masked URAM writes
Read 72-bit word containing 2 neurons
Modify upper 36 bits (neuron voltage)
Write back full 72-bit word
Why it matters: Common pattern requiring careful hazard management.
See: Chapter 2 Section 2.2 (internal_events_processor)
Round-Robin#
What it is: A fair arbitration algorithm that cycles through requesters in order, giving each a turn.
Software analogy:
requesters = [0, 1, 2, 3, 4, 5, 6, 7]
current = 0
while True:
if requesters[current].has_request():
service(requesters[current])
current = (current + 1) % len(requesters) # Cycle: 0→1→2→...→7→0
Verilog:
reg [2:0] addr; // 3-bit counter for 8 requesters
always @(posedge clk) begin
addr <= addr + 1; // Automatically wraps 7→0
end
Usage in neuromorphic system:
Pointer FIFO controller: 16-way round-robin (checks ptr0, ptr1, …, ptr15, ptr0, …)
Spike FIFO controller: 8-way round-robin (checks spk0, spk1, …, spk7, spk0, …)
Why it matters: Ensures fairness—no requester starves. Every requester gets regular turns regardless of activity.
See: Chapter 2 (FIFO controllers), Verilog sources
S#
Sequential Logic#
What it is: Digital circuits with memory—outputs depend on both current inputs and past state.
Software analogy: Objects with instance variables:
class Counter:
def __init__(self):
self.count = 0 # State
def increment(self):
self.count += 1 # Output depends on previous state
Verilog:
reg [7:0] count;
always @(posedge clk) begin
count <= count + 1; // State updates on clock edges
end
Building blocks: Flip-flops, registers, state machines, counters, shift registers.
Contrast with combinational logic: Combinational has no memory (pure functions).
Why it matters: All computation with memory/state requires sequential logic.
See: hardware_map.md (flip-flops), Chapter 2 (state machines)
Sign Extension#
What it is: Expanding a signed integer to a wider bit width by replicating the sign bit.
Example:
8-bit signed: 11111010 (-6 in two's complement)
16-bit signed: 11111111 11111010 (still -6)
Rule: Copy the most significant bit (sign bit) into all new upper bits.
Why it matters: Prevents corruption when mixing different bit widths in arithmetic. Hardware often needs to extend weights (16-bit) to match neuron voltages (36-bit).
See: Chapter 2 (synaptic accumulation)
Spike#
What it is: An action potential—a neuron firing event that occurs when membrane potential crosses threshold.
In hardware: Represented as a 17-bit value: {valid_bit, neuron_address[16:0]}
Example: Neuron 5 spikes → spike packet = 0x00005 with valid bit = 1
Why it matters: Spikes are the fundamental events in spiking neural networks. All computation revolves around spike propagation and accumulation.
See: Throughout documentation
SRAM (Static RAM)#
What it is: Memory technology using 6 transistors per bit (6T). “Static” because it holds state as long as powered (no refresh needed).
Characteristics:
Fast: 1-3 ns access time
Expensive: 6 transistors per bit (vs. 1T1C for DRAM)
Low density: Large silicon area
Examples: CPU caches (L1, L2), FPGA BRAM
Contrast with DRAM:
SRAM: 6T, fast, expensive, no refresh
DRAM: 1T1C, slow, cheap, needs refresh
Why it matters: SRAM is used for speed-critical data (BRAM for spike masks). DRAM is used for bulk storage (HBM for synapses).
See: hardware_map.md Section 1.1 (BRAM), Section 1.3 (HBM comparison)
State Machine#
What it is: A sequential circuit that transitions through a defined set of states based on inputs and current state.
Software analogy:
class StateMachine:
def __init__(self):
self.state = 'IDLE'
def update(self, input):
if self.state == 'IDLE' and input == 'start':
self.state = 'RUNNING'
elif self.state == 'RUNNING' and input == 'done':
self.state = 'IDLE'
Verilog:
reg [1:0] state;
localparam IDLE = 0, RUNNING = 1, DONE = 2;
always @(posedge clk) begin
case (state)
IDLE: if (start) state <= RUNNING;
RUNNING: if (done) state <= DONE;
DONE: state <= IDLE;
endcase
end
Example in neuromorphic system: external_events_processor
IDLE: Waiting for execute command
FILL_PIPE: Filling BRAM read pipeline (3 cycles)
READ: Reading spike masks and fetching synapses
DONE: All spikes processed
Why it matters: State machines coordinate complex multi-step operations. Almost every hardware module has at least one state machine for control flow.
See: Chapter 2 Section 2.2 (Verilog state machines)
Synapse#
What it is: A connection between two neurons with an associated weight. When the presynaptic neuron spikes, the postsynaptic neuron’s voltage increases by the synaptic weight.
Hardware representation: 32-bit value
Bits [31:29]: Opcode (usually 0 for normal synapse)
Bits [28:16]: Target neuron address (13 bits)
Bits [15:0]: Synaptic weight (signed 16-bit integer)
Example: 0x0000_03E8 = target neuron 0, weight 1000
Storage: Synapses stored in HBM (billions of them for large networks).
Why it matters: Synapses define the network structure. All learning involves modifying synaptic weights.
See: Chapter 1 Section 1.1 (HBM synapse region), Chapter 2 Phase 1
T#
Threshold#
What it is: The membrane potential value at which a neuron fires (spikes).
Example: If threshold = 2000 and neuron voltage reaches 2000, the neuron spikes and resets to 0.
Hardware: Stored as a 36-bit signed integer in configuration registers.
Software analogy:
if neuron.voltage >= threshold:
neuron.spike()
neuron.voltage = 0 # Reset
Why it matters: Determines network sensitivity and dynamics. Lower threshold → more spikes, higher threshold → sparse activity.
See: Introduction (neuron model), Chapter 2 (threshold checking)
Throughput#
What it is: The amount of data processed per unit time.
Software analogy: Requests per second (RPS), or bandwidth in networking.
Examples:
HBM throughput: 400 GB/s (can transfer 400 billion bytes per second)
BRAM throughput: 1 read per cycle @ 225 MHz = 225 million reads/s
Pipeline throughput: 1 result per cycle (after initial fill)
Contrast with latency: Latency = time per operation, throughput = operations per time.
Relationship: High latency can still have high throughput if pipelined.
Why it matters: Throughput determines how fast you can process large datasets. Latency determines responsiveness for individual operations.
Transaction ID (TID)#
What it is: An identifier tag attached to requests to allow out-of-order completion.
Software analogy:
# Send multiple requests with IDs
send_request(addr=0x1000, tid=1)
send_request(addr=0x2000, tid=2)
send_request(addr=0x3000, tid=3)
# Responses can return in any order
response = receive() # {tid: 3, data: ...} ← request 3 finished first
response = receive() # {tid: 1, data: ...} ← request 1 finished second
response = receive() # {tid: 2, data: ...} ← request 2 finished last
Why it matters: Allows parallelism. Without TIDs, you’d have to wait for request 1 to complete before issuing request 2. With TIDs, issue all requests immediately and match responses when they arrive.
See: AXI4 protocol, HBM memory controller
Transistor#
What it is: A semiconductor device that acts as an electrically-controlled switch. The fundamental building block of all digital circuits.
Types:
NMOS: Conducts when gate voltage is high (switch closes with 1)
PMOS: Conducts when gate voltage is low (switch closes with 0)
Software analogy: Like an if statement:
if gate_voltage:
output = input # Transistor "on" (conducting)
else:
output = disconnected # Transistor "off" (not conducting)
Scale: Modern FPGAs contain billions of transistors. A 6T SRAM cell has 6 transistors, a DRAM cell has 1 transistor.
Why it matters: Everything in hardware—logic gates, memory, CPUs—is built from transistors.
See: hardware_map.md Section 1 (SRAM cells, DRAM cells)
U#
URAM (UltraRAM)#
What it is: High-density on-chip DRAM-like memory blocks in Xilinx UltraScale+ FPGAs. Faster and denser than BRAM.
Specifications:
Capacity: 4.5 MB total (16 banks × 288 KB per bank)
Latency: 1 cycle @ 450 MHz (~2 ns)
Technology: 1T1C DRAM-like cells (denser than BRAM’s 6T SRAM)
Organization: 16 banks, each 4096 words × 72 bits
Software analogy: Like L2 cache—larger than L1 (BRAM) but still much faster than main memory (HBM).
Usage in neuromorphic system:
Stores neuron membrane potentials (131,072 neurons × 36 bits)
Each bank holds 8,192 neurons (1 neuron group)
Dual-neuron packing: 2 neurons per 72-bit word ([71:36]=upper, [35:0]=lower)
Why it matters: URAM is the perfect middle ground—larger than BRAM, faster than HBM. Ideal for neuron state storage (frequent access, moderate size).
See: Chapter 1 Section 1.1, Chapter 2 Section 2.2 (internal_events_processor)
Memory Hierarchy Summary#
From fastest to slowest:
Memory |
Capacity |
Latency |
Bandwidth |
Use Case |
|---|---|---|---|---|
Registers |
~1 KB |
0 cycles |
N/A |
Pipeline state |
URAM |
4.5 MB |
1 cycle (2 ns) |
~200 GB/s |
Neuron states |
BRAM |
1 MB |
3 cycles (13 ns) |
~50 GB/s |
Spike masks |
HBM |
8 GB |
100-200 ns |
400 GB/s |
Synaptic weights |
Host DDR4 |
64+ GB |
1-10 µs |
20 GB/s |
Long-term storage |
Software analogy:
Registers = CPU registers
URAM = L1 cache
BRAM = L2 cache
HBM = Main RAM
Host DDR4 = Disk/SSD
Key Concepts Summary#
Pipelining#
Break operations into stages, overlap execution. Latency stays same, throughput increases.
State Machines#
Sequential control logic stepping through states (IDLE → RUNNING → DONE).
Hazard Detection#
Prevent read-modify-write conflicts by tracking pipeline addresses.
Clock Domain Crossing#
Synchronize data between different clock frequencies using async FIFOs.
Handshake Protocol#
VALID && READY ensures reliable data transfer with flow control.
Round-Robin Arbitration#
Fair scheduling by cycling through requesters in order.
Backpressure#
Downstream signals “I’m full” to pause upstream sender.
Notation Conventions#
Throughout the documentation, you’ll see:
Hexadecimal:
0x1234or0xABCDBinary:
0b1010or4'b1010(4-bit binary)Bit ranges:
[31:0]means bits 31 down to 0 (32 bits total)Bit indexing:
data[7]means bit 7 of dataActive-low signals:
resetn(n suffix) means 0=active, 1=inactiveRegister notation:
reg [7:0] counter= 8-bit register named counter
Further Reading#
For deeper hardware understanding:
AXI4 specification: ARM IHI0022 (AXI protocol)
Xilinx UltraScale+ architecture: UG574 (FPGA fabric, memory)
PCIe specification: PCI-SIG PCIe Base 3.0
Digital design fundamentals: “Digital Design and Computer Architecture” by Harris & Harris
For neuromorphic computing:
Spiking neural networks: “Neuronal Dynamics” by Gerstner et al.
Hardware acceleration: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson
This glossary provides the foundation for understanding the hardware implementation details throughout the documentation. When you encounter an unfamiliar term, refer back to this appendix for clarification and context.