# Appendix: Hardware Glossary for Software Experts This appendix defines all hardware terms and concepts used throughout the documentation. Each entry includes a software analogy where applicable to bridge the gap between hardware and software thinking. --- ## A ### Address Passthrough **What it is:** A technique where a memory controller echoes back the requested address along with the returned data. **Software analogy:** Like a database query that returns both the data and the primary key you searched for. Useful when multiple reads are in flight and you need to correlate responses with requests. **Why it matters:** In pipelined hardware, you might issue multiple reads before the first one completes. Address passthrough tells you which read just finished. --- ### Arbiter **What it is:** Hardware logic that decides which of multiple competing requesters gains access to a shared resource (memory, bus, FIFO). **Software analogy:** Like a mutex or semaphore, but implemented in hardware logic gates. Multiple threads want the same resource; the arbiter picks one based on priority, round-robin fairness, or other policy. **Common types:** - **Priority arbiter:** Always picks highest-priority requester (like VIP queue) - **Round-robin arbiter:** Cycles through requesters fairly (like `itertools.cycle()`) - **First-come-first-served:** Picks whoever requested first (like a queue) **Why it matters:** Prevents conflicts when multiple hardware modules try to access the same memory or bus simultaneously. --- ### AXI4 (Advanced eXtensible Interface) **What it is:** A memory-mapped communication protocol designed by ARM for on-chip data transfer. It's the "language" FPGA modules speak to transfer data. **Software analogy:** Like a REST API for hardware—defines request/response formats, handshake protocols, and data structures. Instead of HTTP methods (GET, POST), you have read and write transactions. **Five channels:** 1. **Read Address (AR):** "I want to read from address X" 2. **Read Data (R):** "Here's the data you requested" 3. **Write Address (AW):** "I want to write to address X" 4. **Write Data (W):** "Here's the data to write" 5. **Write Response (B):** "Write complete, here's the status" **Key signals:** - `VALID`: "I have valid data/request" - `READY`: "I'm ready to accept data/request" - **Transaction occurs only when both are high (handshake)** **Why it matters:** All on-chip communication in the FPGA uses AXI4. Understanding it is like understanding HTTP for web development. **See:** hardware_map.md Section 4, Chapter 1 (DMA transfers) --- ### AXI4 Master **What it is:** A hardware module that initiates AXI4 transactions (sends read/write requests). **Software analogy:** Like an HTTP client that makes requests to servers. The FPGA's DMA engine is a master; it requests data from host memory. **Why it matters:** Determines who "drives" the bus. Masters control data flow; slaves respond. --- ### AXI4 Slave **What it is:** A hardware module that responds to AXI4 transactions (serves read/write requests). **Software analogy:** Like an HTTP server that responds to client requests. Host memory is a slave; it serves data when the FPGA requests it. **Why it matters:** Slaves must be ready to handle requests at any time. They implement the actual storage or computation. --- ### Axon **What it is:** In neuromorphic systems, an external input to the network. In hardware terms, axons don't have internal state (no membrane potential)—they're just spike sources. **Software analogy:** Like a publisher in a pub/sub system. When an axon fires, it publishes a spike event that gets delivered to all subscribed neurons. **Storage:** Axon spike masks are stored in BRAM as one-hot bit vectors (256 bits per row). **Why it matters:** Distinguishing axons (stateless inputs) from neurons (stateful computations) determines memory organization and processing flow. **See:** Chapter 1 (BRAM organization), Chapter 2 (Phase 1 processing) --- ## B ### Backpressure **What it is:** A flow control mechanism where a downstream module signals "I'm full, stop sending data" to an upstream module. **Software analogy:** Like `TCP windowing` or a bounded queue's `.full()` flag. When a buffer fills up, backpressure tells the sender to pause. **Implementation:** Typically via a `FULL` or `!READY` signal. When asserted, upstream must stop writing. **Why it matters:** Prevents data loss when fast producers overwhelm slow consumers. Critical in pipelined systems. **See:** Chapter 2 (FIFO interfaces), hardware_map.md (AXI4 handshakes) --- ### Beat **What it is:** A single data transfer within a burst transaction. A burst is composed of multiple beats. **Software analogy:** Like a single element in a batch API request. If you request 16 records in one query, each record is one "beat." **Example:** A 16-beat burst transfers 16 data words sequentially. **Why it matters:** Bursts amortize transaction overhead. Instead of 16 separate requests (expensive), send one request for 16 beats (efficient). --- ### BRAM (Block RAM) **What it is:** Fast on-chip SRAM built into the FPGA fabric. Organized into dual-port blocks (can read and write simultaneously from different ports). **Specifications:** - **Capacity:** 1 MB total (256 rows × 256 bits per row) - **Latency:** 3 clock cycles @ 225 MHz (~13 ns) - **Technology:** 6-transistor SRAM cells (fast, power-hungry) **Software analogy:** Like L1 cache—very fast, very expensive (in silicon area), small capacity. Use for frequently accessed data. **Usage in neuromorphic system:** - Stores axon spike masks (one-hot encoded, 256 bits per row) - Dual-port allows host PC to write new spikes while FPGA reads current spikes **Why it matters:** BRAM is the fastest memory accessible to FPGA logic (except registers). Ideal for low-latency, small datasets. **See:** Chapter 1 Section 1.1, Chapter 2 Phase 0 --- ### Burst **What it is:** An AXI4 transaction that transfers multiple consecutive data words in a single request. **Software analogy:** Like a batch database query. Instead of: ```python for i in range(16): data[i] = read(addr + i) # 16 separate requests ``` You do: ```python data = read_burst(addr, length=16) # 1 request, 16 data transfers ``` **Why it matters:** Reduces protocol overhead. Each request has setup time (address, handshake); bursts amortize this over multiple data words. **See:** hardware_map.md (AXI4 timing), Chapter 1 (HBM bursts) --- ### Burst Length **What it is:** The number of beats (data words) in a burst transaction. **AXI4 limits:** 1-256 beats for INCR bursts. **Example:** Burst length 16 means the transaction transfers 16 consecutive data words. **Why it matters:** Longer bursts → higher bandwidth efficiency, but require contiguous addresses. --- ## C ### Clock Domain **What it is:** A region of a circuit where all flip-flops are clocked by the same clock signal. Everything in a clock domain updates simultaneously on clock edges. **Software analogy:** Like a thread with its own event loop. All state updates happen synchronously at clock ticks (like `await` points in async Python). **Example in neuromorphic system:** - **225 MHz domain:** BRAM, HBM interfaces, most control logic - **450 MHz domain:** URAM, internal events processor (2x faster for neuron updates) **Why it matters:** Data transfer between clock domains requires special synchronization (FIFOs with gray code) to avoid metastability. **See:** hardware_map.md (clock generation), Chapter 2 (dual clock operation) --- ### Clock Domain Crossing (CDC) **What it is:** Transferring data from one clock domain to another (e.g., 225 MHz → 450 MHz). **Software analogy:** Like passing data between threads with different event loops. You need thread-safe mechanisms (mutex, queue) to avoid race conditions. **Hardware solution:** Asynchronous FIFOs with gray-code counters to prevent metastability. **Why it matters:** Improper CDC causes random bit flips, data corruption, or circuit hangs. Must use proven synchronization techniques. **See:** hardware_map.md Section 3 (FIFO synchronization) --- ### Combinational Logic **What it is:** Digital circuits where outputs depend only on current inputs (no memory, no state). Examples: AND gates, multiplexers, adders. **Software analogy:** Pure functions in functional programming—same inputs always produce same outputs, no side effects, no state. ```python def combinational(a, b): return a & b # Output depends only on a, b (no self.state) ``` **Verilog:** ```verilog assign output = (a & b) | c; // Combinational: output updates whenever a, b, c change ``` **Why it matters:** Combinational logic is fast (no clock delay) but creates no memory. Contrast with sequential logic (flip-flops, registers) which stores state. **See:** hardware_map.md (logic gates), Chapter 2 (state machines use both) --- ### Command Opcode **What it is:** An 8-bit identifier in PCIe command packets that specifies the operation type. **Software analogy:** Like HTTP methods (GET, POST, PUT, DELETE) or opcodes in assembly language (MOV, ADD, JMP). **Example opcodes in neuromorphic system:** - `0x01`: Write to BRAM (axon spikes) - `0x02`: Write to HBM (network structure) - `0x03`: Write to URAM (neuron states) - `0x04`: Read from URAM - `0xC8`: Execute (run one timestep) **Why it matters:** The command interpreter decodes opcodes to route data to correct hardware modules. **See:** Chapter 1 Section 1.2, Chapter 2 Phase 0 --- ## D ### DMA (Direct Memory Access) **What it is:** A technique where an FPGA (or peripheral) transfers data to/from host memory without involving the CPU. The FPGA becomes a "bus master" and directly reads/writes RAM. **Software analogy:** Like `mmap()` or shared memory. Instead of: ```python # Slow: CPU copies data data = host_memory.read() fpga.write(data) ``` You have: ```python # Fast: FPGA directly accesses host memory fpga.dma_read(host_address, size) ``` **Why it matters:** DMA bypasses the CPU bottleneck, achieving 10-100x higher bandwidth. Essential for large data transfers (network initialization, spike outputs). **How it works:** 1. FPGA sends PCIe TLP (Transaction Layer Packet) with memory address 2. Host memory controller responds with data 3. No CPU involvement—fully offloaded **See:** Chapter 1 Section 1.2 (initialization), hardware_map.md Section 2 (PCIe) --- ### Double Buffering **What it is:** Using two buffers that alternate roles—while one is being read (consumed), the other is being written (produced). **Software analogy:** ```python class DoubleBuffer: def __init__(self): self.buffer_a = [0] * 1000 self.buffer_b = [0] * 1000 self.active = 'a' # Which buffer is being read def swap(self): # Swap roles: reader becomes writer, writer becomes reader self.active = 'b' if self.active == 'a' else 'a' ``` **Usage in neuromorphic system:** External events processor uses double-buffered BRAMs: - **Present BRAM:** Being read/cleared (processing current timestep spikes) - **Future BRAM:** Being written (collecting spikes for next timestep) - After execute, they swap roles **Why it matters:** Enables pipelining—overlap computation with I/O. No need to wait for reads to finish before accepting new writes. **See:** Verilog sources (external_events_processor_simple.v) --- ### DRAM (Dynamic RAM) **What it is:** Memory technology that stores bits as charge in capacitors. "Dynamic" because charge leaks, requiring periodic refresh. **Structure:** 1 transistor + 1 capacitor per bit (1T1C). Compact but slow. **Software analogy:** Like disk storage—high capacity, slow access, requires maintenance (refresh = defragmentation). **Types:** - **DDR4:** Desktop/server RAM (~20 GB/s per channel) - **HBM2:** 3D-stacked high-bandwidth RAM (~400 GB/s, used in neuromorphic system) **Contrast with SRAM:** - DRAM: 1T1C, dense, slow (50-100 ns), needs refresh - SRAM: 6T, fast (1-3 ns), expensive, no refresh **Why it matters:** DRAM provides bulk storage (8 GB HBM) for synaptic weights. Too slow for direct neuron computation (hence URAM for neuron states). **See:** hardware_map.md Section 1.3 (HBM structure) --- ## E ### execRun_ctr (Execution Run Counter) **What it is:** A hardware register that counts timesteps. Increments after each `execute()` command completes. **Software analogy:** ```python class NeuromorphicEngine: def __init__(self): self.timestep = 0 # Like execRun_ctr def execute(self): # Run one timestep self.timestep += 1 ``` **Why it matters:** Used to tag spike events with their timestep. When spikes return to host, you know when they occurred. **See:** Chapter 2 (spike packet format) --- ### execRun_timer (Execution Run Timer) **What it is:** A hardware counter that measures clock cycles elapsed during a timestep. **Software analogy:** ```python import time start = time.perf_counter() execute() elapsed_cycles = (time.perf_counter() - start) * clock_frequency ``` **Why it matters:** Profiling tool. Tells you how many clock cycles (and thus microseconds) a timestep took. Useful for optimization. --- ## F ### FIFO (First-In-First-Out) **What it is:** A hardware buffer that stores data in order received and outputs data in the same order. **Software analogy:** `queue.Queue()` in Python—`.put()` adds to tail, `.get()` removes from head. **Hardware interface:** ```verilog // Write side input full, // 1 = FIFO full, cannot write output wren, // 1 = write data_in this cycle output data_in, // Data to write // Read side input empty, // 1 = FIFO empty, no data output rden, // 1 = read data_out this cycle input data_out, // Data read ``` **FWFT mode (First-Word Fall-Through):** `data_out` always shows next item (zero latency). Like `.peek()` in Python. **Why it matters:** FIFOs decouple clock domains, buffer mismatched rates (fast producer, slow consumer), and enable pipelining. **See:** hardware_map.md Section 3, Chapter 2 (pointer FIFOs, spike FIFOs) --- ### Flip-Flop **What it is:** A basic sequential logic element that stores one bit. Updates on clock edge. **Software analogy:** Like a single-bit instance variable: ```python class FlipFlop: def __init__(self): self.q = 0 # Current state def clock_edge(self, d): self.q = d # Update state on clock tick ``` **Verilog:** ```verilog always @(posedge clk) begin q <= d; // q updates to d on rising clock edge end ``` **D-type flip-flop:** Most common. Has one data input (D) and one output (Q). On clock edge, Q ← D. **Why it matters:** Flip-flops are the fundamental unit of memory in digital circuits. Registers, counters, state machines—all built from flip-flops. **See:** hardware_map.md Section 1.2 (FPGA fabric) --- ### FPGA (Field-Programmable Gate Array) **What it is:** A chip containing millions of configurable logic blocks (LUTs, flip-flops) connected by programmable routing. You write code (Verilog/VHDL) that gets "compiled" into a configuration bitstream that rewires the chip. **Software analogy:** Imagine if you could rewrite the CPU's microarchitecture at runtime. An FPGA is like a blank canvas where you design custom hardware for your application. **Neuromorphic system uses:** Xilinx XCVU37p UltraScale+ VU37P - 1.1 million LUTs (programmable logic gates) - 2.2 million flip-flops (registers) - 50 MB BRAM (on-chip cache) - 4.5 MB URAM (ultra-dense cache) - 16 lanes PCIe Gen3 (host communication) - 32 HBM2 channels (memory bandwidth) **Why it matters:** FPGAs offer specialized, massively parallel computation. Neuromorphic networks need to update 100,000+ neurons in microseconds—impossible on CPU, possible on FPGA. **See:** hardware_map.md Section 1.2, Chapter 1 (compilation) --- ### FWFT (First-Word Fall-Through) **What it is:** A FIFO operating mode where the output data is always valid (showing the next item) without requiring a read strobe first. **Normal FIFO:** ```python if not fifo.empty(): fifo.read_enable = True # Request data # Wait 1 cycle data = fifo.data_out # Data appears next cycle ``` **FWFT FIFO:** ```python if not fifo.empty(): data = fifo.data_out # Data already available (0 latency) fifo.read_enable = True # Advance to next item ``` **Why it matters:** Reduces latency by 1 cycle. Important in high-speed pipelines where every cycle counts. --- ## H ### Handshake **What it is:** A two-signal protocol (`VALID` and `READY`) where data transfer occurs only when both signals are high. **Protocol:** - **Sender:** Asserts `VALID` when data is available, holds data stable - **Receiver:** Asserts `READY` when it can accept data - **Transfer occurs:** When `VALID && READY` on clock edge **Software analogy:** ```python while True: sender.valid = sender.has_data() receiver.ready = not receiver.is_full() if sender.valid and receiver.ready: receiver.data = sender.data # Transfer! sender.pop() break ``` **Why it matters:** Prevents data loss. Sender can't send until receiver is ready; receiver doesn't miss data because sender holds it stable. **See:** hardware_map.md (AXI4 channels), Chapter 2 (HBM read valid/ready) --- ### Hazard **What it is:** A conflict in pipelined hardware when consecutive operations access the same resource (typically same memory address) in ways that could cause data corruption. **Types:** - **Read-after-write (RAW):** Read tries to fetch old value before write completes - **Write-after-read (WAR):** Write might corrupt data before read completes - **Write-after-write (WAW):** Second write might complete before first **Software analogy:** Like race conditions in multithreaded code: ```python # Thread 1 x = x + 1 # Read x, add 1, write back # Thread 2 (starts during Thread 1) y = x # Which value of x do I see? ``` **Hardware solution:** Hazard detection logic checks if addresses match in pipeline stages. If match, either: - **Stall:** Pause until pipeline clears - **Bypass/Forward:** Route fresh data directly from pipeline register **Why it matters:** URAM has 3-cycle read latency. If you read address A, then write address A within 3 cycles, hazard detection prevents reading stale data. **See:** Chapter 2 Section 2.2 (internal_events_processor hazard logic) --- ### HBM (High Bandwidth Memory) **What it is:** 3D-stacked DRAM technology providing extreme bandwidth (~400 GB/s). Multiple DRAM dies stacked vertically, connected by Through-Silicon Vias (TSVs). **Specifications:** - **Capacity:** 8 GB (for synaptic connectivity data) - **Bandwidth:** 400+ GB/s (32 channels × 14 GB/s per channel) - **Latency:** ~100-200 ns (slow compared to BRAM/URAM, fast compared to DDR4) - **Organization:** 2 pseudo-channels × 16 banks per channel **Software analogy:** Like a distributed database with 32 shards. Each shard (channel) serves 256 MB and can handle requests independently (parallel access). **Usage in neuromorphic system:** - **Region 1:** Axon pointers (maps axon ID → synapse list address) - **Region 2:** Neuron pointers (maps neuron ID → synapse list address) - **Region 3:** Synapses (actual connection data: target neuron + weight) **Why it matters:** Stores the entire network structure (billions of synapses for large networks). Bandwidth supports updating millions of neurons per millisecond. **See:** hardware_map.md Section 1.3, Chapter 1 Section 1.1, Chapter 2 Phase 1 --- ## I ### INCR Burst **What it is:** An AXI4 burst type where addresses increment sequentially for each beat. **Example:** ``` Burst start address: 0x1000 Burst length: 4 Addresses accessed: 0x1000, 0x1004, 0x1008, 0x100C (assuming 32-bit words) ``` **Contrast with WRAP bursts:** Addresses wrap around within a boundary (used for cache lines). **Why it matters:** Simplest burst type. Used for sequential data access (reading synapse lists, writing neuron states). --- ## L ### Latency **What it is:** The time delay between when a request is issued and when the response is received. **Software analogy:** Like ping time in networking, or function call overhead. **Examples in neuromorphic system:** - **BRAM latency:** 3 cycles @ 225 MHz = ~13 ns - **URAM latency:** 1 cycle @ 450 MHz = ~2 ns - **HBM latency:** ~100-200 ns - **PCIe latency:** ~1-10 µs (depends on packet size, distance) **Why it matters:** Latency determines pipeline depth. 3-cycle BRAM latency means you need 3 pipeline stages between read request and data availability. **Contrast with throughput:** Latency is "time per operation," throughput is "operations per time." **See:** hardware_map.md (memory specifications), Chapter 2 (pipeline filling) --- ### LUT (Lookup Table) **What it is:** A programmable logic element in an FPGA. Typically a 6-input, 1-output function implemented as a 64-bit SRAM. **How it works:** - 6 inputs → 2^6 = 64 possible input combinations - 64-bit SRAM stores the output value for each combination - LUT acts as a 6-input truth table **Software analogy:** ```python # 2-input LUT (simplified) lut_contents = [0, 0, 0, 1] # 4-bit SRAM for 2 inputs def lut(a, b): index = (a << 1) | b # Convert inputs to index return lut_contents[index] # This LUT implements: output = a AND b # 00 → 0, 01 → 0, 10 → 0, 11 → 1 ``` **Why it matters:** LUTs are the fundamental building blocks of FPGA logic. Any combinational function of up to 6 inputs can be implemented in 1 LUT. Complex functions use multiple LUTs connected together. **See:** hardware_map.md Section 1.2 --- ## M ### Metastability **What it is:** An undefined state in digital circuits where a flip-flop's output oscillates between 0 and 1, unable to settle. Occurs during clock domain crossings when setup/hold time requirements are violated. **Software analogy:** Like a race condition where a variable reads as garbage because two threads wrote simultaneously. **Cause:** When a flip-flop's input changes too close to the clock edge, the output can become metastable (neither 0 nor 1, or fluctuating). **Solution:** Multi-stage synchronizers (2-3 flip-flops in series) give time for metastability to resolve. First flip-flop might be metastable, but probability that second flip-flop is also metastable is astronomically low. **Why it matters:** Metastability causes random bit flips, data corruption, system crashes. Must use proven synchronization techniques for clock domain crossings (async FIFOs with gray code). **See:** hardware_map.md Section 3 (FIFO synchronization) --- ## N ### Neuron Group **What it is:** A set of 8,192 neurons that share a URAM bank and processing pipeline. The neuromorphic system has 16 neuron groups (0-15), totaling 131,072 neurons. **Organization:** - **Group 0:** Neurons 0-8,191 (URAM bank 0) - **Group 1:** Neurons 8,192-16,383 (URAM bank 1) - ... - **Group 15:** Neurons 122,880-131,071 (URAM bank 15) **Why groups?** Parallel processing. All 16 groups can be updated simultaneously (16-way parallelism). **Software analogy:** Like database sharding. Each shard (neuron group) has its own storage (URAM bank) and compute pipeline. **Why it matters:** Determines memory addresses, routing logic, and parallelism. Understanding groups is key to understanding performance scaling. **See:** Chapter 1 Section 1.1, Chapter 2 Section 2.2 --- ## P ### PCIe (Peripheral Component Interconnect Express) **What it is:** A high-speed serial communication bus connecting the CPU to peripherals (GPUs, FPGAs, SSDs). Uses differential signaling over point-to-point links. **Neuromorphic system:** PCIe Gen3 x16 - **Gen3:** Third generation (8 GT/s per lane) - **x16:** 16 lanes (parallel connections) - **Bandwidth:** ~14 GB/s bidirectional **Software analogy:** Like USB or Ethernet, but much faster and lower latency. From software, you use it via memory-mapped I/O or DMA. **How it works:** 1. **Physical layer:** Differential pairs (16 lanes × 2 directions = 32 pairs) 2. **Data link layer:** Packets with CRC error detection 3. **Transaction layer:** TLP (Transaction Layer Packets) for reads/writes **Why it matters:** PCIe is the bridge between host CPU (software) and FPGA (hardware). All data transfer (inputs, outputs, configuration) flows through PCIe. **See:** hardware_map.md Section 2, Chapter 1 Section 1.2 --- ### Pipeline **What it is:** A technique where operations are broken into stages, with multiple operations executing simultaneously at different stages. **Software analogy:** ```python # No pipeline: 3 cycles per item for item in data: stage1(item) # 1 cycle stage2(item) # 1 cycle stage3(item) # 1 cycle # Total: 3N cycles for N items # Pipeline: 1 cycle per item after initial fill # Cycle 0: stage1(item[0]) # Cycle 1: stage1(item[1]), stage2(item[0]) # Cycle 2: stage1(item[2]), stage2(item[1]), stage3(item[0]) # Cycle 3: stage1(item[3]), stage2(item[2]), stage3(item[1]) # Total: N+2 cycles for N items (3x speedup for large N) ``` **Example in neuromorphic system:** BRAM read pipeline - **Cycle 0:** Issue address for read 0 - **Cycle 1:** Issue address for read 1 (read 0 in pipeline stage 1) - **Cycle 2:** Issue address for read 2 (read 1 in stage 1, read 0 in stage 2) - **Cycle 3:** Read 0 data available, issue address for read 3 **Why it matters:** Pipelines enable high throughput despite high latency. BRAM has 3-cycle latency but can start a new read every cycle (throughput = 1 read/cycle). **See:** hardware_map.md (memory timing), Chapter 2 (pipeline filling) --- ### Pipeline Depth **What it is:** The number of stages in a pipeline, or equivalently, the number of clock cycles between input and output. **Example:** BRAM has 3-cycle read latency → pipeline depth = 3. **Why it matters:** Determines how many cycles you must wait before first result. Also affects hazard detection (must check all pipeline stages for address conflicts). --- ### Pointer Chain **What it is:** A linked-list structure in HBM where each neuron/axon has a pointer to the start of its synapse list. **Structure:** ``` Axon a0 → Pointer: 0x00080000 (address of synapse list in HBM) HBM[0x00080000] = [synapse_0, synapse_1, ..., synapse_4] Each synapse = {target: h0, weight: 1000} ``` **Software analogy:** ```python # Like a dictionary of lists synapses = { 'a0': [('h0', 1000), ('h1', 1000), ...], 'a1': [('h0', 1000), ('h1', 1000), ...], } # In hardware, stored as pointers: pointers = {'a0': 0x80000, 'a1': 0x80005} memory = { 0x80000: [('h0', 1000), ('h1', 1000), ...], 0x80005: [('h0', 1000), ('h1', 1000), ...], } ``` **Why it matters:** Enables variable fan-out (some neurons have 10 synapses, others 10,000). Pointer indirection allows efficient memory usage. **See:** Chapter 1 Section 1.1 (HBM layout), Chapter 2 Phase 1 --- ### Priority Arbitration **What it is:** An arbitration scheme where high-priority requesters always win over low-priority requesters. **Software analogy:** Like a priority queue or VIP line. **Example:** If both CPU and DMA request the bus, CPU (high priority) always wins. **Disadvantage:** Low-priority requesters can starve (never get access). **See also:** Round-robin (fair alternative) --- ## R ### Read Latency **What it is:** The number of clock cycles from when a read address is issued to when the data becomes valid. **Examples:** - **BRAM:** 3 cycles - **URAM:** 1 cycle - **HBM:** ~100-200 cycles (at 225 MHz) **Why it matters:** Determines pipeline depth and minimum loop iteration time. **See:** Memory specifications throughout documentation --- ### Register **What it is:** A storage element (or group of flip-flops) that holds a multi-bit value. **Software analogy:** Like an instance variable in a class. **Verilog:** ```verilog reg [7:0] counter; // 8-bit register always @(posedge clk) begin counter <= counter + 1; // Updates on every clock edge end ``` **Why it matters:** Registers store state in sequential circuits. Counters, addresses, data buffers—all implemented as registers. --- ### Register Slice **What it is:** A pipeline stage inserted into a data path to improve timing (break long combinational paths). **Software analogy:** Like adding an intermediate variable to break up a complex expression: ```python # Hard to optimize (long dependency chain) result = f(g(h(i(j(x))))) # Easier (can pipeline/parallelize) temp1 = j(x) temp2 = i(temp1) temp3 = h(temp2) temp4 = g(temp3) result = f(temp4) ``` **Why it matters:** High-speed interfaces (450 MHz) require short logic paths. Register slices reduce maximum combinational delay, allowing higher clock frequencies. **See:** internal_events_processor (450 MHz URAM access) --- ### RMW (Read-Modify-Write) **What it is:** A pattern where you read a value, modify it, and write it back. **Software analogy:** ```python x = memory[addr] # Read x = x + 1 # Modify memory[addr] = x # Write ``` **Hardware challenges:** With 3-cycle read latency, must ensure no other operation accesses the same address during the RMW sequence (hazard detection). **Example in neuromorphic system:** Masked URAM writes - Read 72-bit word containing 2 neurons - Modify upper 36 bits (neuron voltage) - Write back full 72-bit word **Why it matters:** Common pattern requiring careful hazard management. **See:** Chapter 2 Section 2.2 (internal_events_processor) --- ### Round-Robin **What it is:** A fair arbitration algorithm that cycles through requesters in order, giving each a turn. **Software analogy:** ```python requesters = [0, 1, 2, 3, 4, 5, 6, 7] current = 0 while True: if requesters[current].has_request(): service(requesters[current]) current = (current + 1) % len(requesters) # Cycle: 0→1→2→...→7→0 ``` **Verilog:** ```verilog reg [2:0] addr; // 3-bit counter for 8 requesters always @(posedge clk) begin addr <= addr + 1; // Automatically wraps 7→0 end ``` **Usage in neuromorphic system:** - **Pointer FIFO controller:** 16-way round-robin (checks ptr0, ptr1, ..., ptr15, ptr0, ...) - **Spike FIFO controller:** 8-way round-robin (checks spk0, spk1, ..., spk7, spk0, ...) **Why it matters:** Ensures fairness—no requester starves. Every requester gets regular turns regardless of activity. **See:** Chapter 2 (FIFO controllers), Verilog sources --- ## S ### Sequential Logic **What it is:** Digital circuits with memory—outputs depend on both current inputs and past state. **Software analogy:** Objects with instance variables: ```python class Counter: def __init__(self): self.count = 0 # State def increment(self): self.count += 1 # Output depends on previous state ``` **Verilog:** ```verilog reg [7:0] count; always @(posedge clk) begin count <= count + 1; // State updates on clock edges end ``` **Building blocks:** Flip-flops, registers, state machines, counters, shift registers. **Contrast with combinational logic:** Combinational has no memory (pure functions). **Why it matters:** All computation with memory/state requires sequential logic. **See:** hardware_map.md (flip-flops), Chapter 2 (state machines) --- ### Sign Extension **What it is:** Expanding a signed integer to a wider bit width by replicating the sign bit. **Example:** ``` 8-bit signed: 11111010 (-6 in two's complement) 16-bit signed: 11111111 11111010 (still -6) ``` **Rule:** Copy the most significant bit (sign bit) into all new upper bits. **Why it matters:** Prevents corruption when mixing different bit widths in arithmetic. Hardware often needs to extend weights (16-bit) to match neuron voltages (36-bit). **See:** Chapter 2 (synaptic accumulation) --- ### Spike **What it is:** An action potential—a neuron firing event that occurs when membrane potential crosses threshold. **In hardware:** Represented as a 17-bit value: `{valid_bit, neuron_address[16:0]}` **Example:** Neuron 5 spikes → spike packet = `0x00005` with valid bit = 1 **Why it matters:** Spikes are the fundamental events in spiking neural networks. All computation revolves around spike propagation and accumulation. **See:** Throughout documentation --- ### SRAM (Static RAM) **What it is:** Memory technology using 6 transistors per bit (6T). "Static" because it holds state as long as powered (no refresh needed). **Characteristics:** - **Fast:** 1-3 ns access time - **Expensive:** 6 transistors per bit (vs. 1T1C for DRAM) - **Low density:** Large silicon area **Examples:** CPU caches (L1, L2), FPGA BRAM **Contrast with DRAM:** - SRAM: 6T, fast, expensive, no refresh - DRAM: 1T1C, slow, cheap, needs refresh **Why it matters:** SRAM is used for speed-critical data (BRAM for spike masks). DRAM is used for bulk storage (HBM for synapses). **See:** hardware_map.md Section 1.1 (BRAM), Section 1.3 (HBM comparison) --- ### State Machine **What it is:** A sequential circuit that transitions through a defined set of states based on inputs and current state. **Software analogy:** ```python class StateMachine: def __init__(self): self.state = 'IDLE' def update(self, input): if self.state == 'IDLE' and input == 'start': self.state = 'RUNNING' elif self.state == 'RUNNING' and input == 'done': self.state = 'IDLE' ``` **Verilog:** ```verilog reg [1:0] state; localparam IDLE = 0, RUNNING = 1, DONE = 2; always @(posedge clk) begin case (state) IDLE: if (start) state <= RUNNING; RUNNING: if (done) state <= DONE; DONE: state <= IDLE; endcase end ``` **Example in neuromorphic system:** external_events_processor - **IDLE:** Waiting for execute command - **FILL_PIPE:** Filling BRAM read pipeline (3 cycles) - **READ:** Reading spike masks and fetching synapses - **DONE:** All spikes processed **Why it matters:** State machines coordinate complex multi-step operations. Almost every hardware module has at least one state machine for control flow. **See:** Chapter 2 Section 2.2 (Verilog state machines) --- ### Synapse **What it is:** A connection between two neurons with an associated weight. When the presynaptic neuron spikes, the postsynaptic neuron's voltage increases by the synaptic weight. **Hardware representation:** 32-bit value - **Bits [31:29]:** Opcode (usually 0 for normal synapse) - **Bits [28:16]:** Target neuron address (13 bits) - **Bits [15:0]:** Synaptic weight (signed 16-bit integer) **Example:** `0x0000_03E8` = target neuron 0, weight 1000 **Storage:** Synapses stored in HBM (billions of them for large networks). **Why it matters:** Synapses define the network structure. All learning involves modifying synaptic weights. **See:** Chapter 1 Section 1.1 (HBM synapse region), Chapter 2 Phase 1 --- ## T ### Threshold **What it is:** The membrane potential value at which a neuron fires (spikes). **Example:** If threshold = 2000 and neuron voltage reaches 2000, the neuron spikes and resets to 0. **Hardware:** Stored as a 36-bit signed integer in configuration registers. **Software analogy:** ```python if neuron.voltage >= threshold: neuron.spike() neuron.voltage = 0 # Reset ``` **Why it matters:** Determines network sensitivity and dynamics. Lower threshold → more spikes, higher threshold → sparse activity. **See:** Introduction (neuron model), Chapter 2 (threshold checking) --- ### Throughput **What it is:** The amount of data processed per unit time. **Software analogy:** Requests per second (RPS), or bandwidth in networking. **Examples:** - **HBM throughput:** 400 GB/s (can transfer 400 billion bytes per second) - **BRAM throughput:** 1 read per cycle @ 225 MHz = 225 million reads/s - **Pipeline throughput:** 1 result per cycle (after initial fill) **Contrast with latency:** Latency = time per operation, throughput = operations per time. **Relationship:** High latency can still have high throughput if pipelined. **Why it matters:** Throughput determines how fast you can process large datasets. Latency determines responsiveness for individual operations. --- ### Transaction ID (TID) **What it is:** An identifier tag attached to requests to allow out-of-order completion. **Software analogy:** ```python # Send multiple requests with IDs send_request(addr=0x1000, tid=1) send_request(addr=0x2000, tid=2) send_request(addr=0x3000, tid=3) # Responses can return in any order response = receive() # {tid: 3, data: ...} ← request 3 finished first response = receive() # {tid: 1, data: ...} ← request 1 finished second response = receive() # {tid: 2, data: ...} ← request 2 finished last ``` **Why it matters:** Allows parallelism. Without TIDs, you'd have to wait for request 1 to complete before issuing request 2. With TIDs, issue all requests immediately and match responses when they arrive. **See:** AXI4 protocol, HBM memory controller --- ### Transistor **What it is:** A semiconductor device that acts as an electrically-controlled switch. The fundamental building block of all digital circuits. **Types:** - **NMOS:** Conducts when gate voltage is high (switch closes with 1) - **PMOS:** Conducts when gate voltage is low (switch closes with 0) **Software analogy:** Like an `if` statement: ```python if gate_voltage: output = input # Transistor "on" (conducting) else: output = disconnected # Transistor "off" (not conducting) ``` **Scale:** Modern FPGAs contain billions of transistors. A 6T SRAM cell has 6 transistors, a DRAM cell has 1 transistor. **Why it matters:** Everything in hardware—logic gates, memory, CPUs—is built from transistors. **See:** hardware_map.md Section 1 (SRAM cells, DRAM cells) --- ## U ### URAM (UltraRAM) **What it is:** High-density on-chip DRAM-like memory blocks in Xilinx UltraScale+ FPGAs. Faster and denser than BRAM. **Specifications:** - **Capacity:** 4.5 MB total (16 banks × 288 KB per bank) - **Latency:** 1 cycle @ 450 MHz (~2 ns) - **Technology:** 1T1C DRAM-like cells (denser than BRAM's 6T SRAM) - **Organization:** 16 banks, each 4096 words × 72 bits **Software analogy:** Like L2 cache—larger than L1 (BRAM) but still much faster than main memory (HBM). **Usage in neuromorphic system:** - Stores neuron membrane potentials (131,072 neurons × 36 bits) - Each bank holds 8,192 neurons (1 neuron group) - Dual-neuron packing: 2 neurons per 72-bit word ([71:36]=upper, [35:0]=lower) **Why it matters:** URAM is the perfect middle ground—larger than BRAM, faster than HBM. Ideal for neuron state storage (frequent access, moderate size). **See:** Chapter 1 Section 1.1, Chapter 2 Section 2.2 (internal_events_processor) --- ## Memory Hierarchy Summary From fastest to slowest: | Memory | Capacity | Latency | Bandwidth | Use Case | |--------|----------|---------|-----------|----------| | **Registers** | ~1 KB | 0 cycles | N/A | Pipeline state | | **URAM** | 4.5 MB | 1 cycle (2 ns) | ~200 GB/s | Neuron states | | **BRAM** | 1 MB | 3 cycles (13 ns) | ~50 GB/s | Spike masks | | **HBM** | 8 GB | 100-200 ns | 400 GB/s | Synaptic weights | | **Host DDR4** | 64+ GB | 1-10 µs | 20 GB/s | Long-term storage | **Software analogy:** - Registers = CPU registers - URAM = L1 cache - BRAM = L2 cache - HBM = Main RAM - Host DDR4 = Disk/SSD --- ## Key Concepts Summary ### Pipelining Break operations into stages, overlap execution. Latency stays same, throughput increases. ### State Machines Sequential control logic stepping through states (IDLE → RUNNING → DONE). ### Hazard Detection Prevent read-modify-write conflicts by tracking pipeline addresses. ### Clock Domain Crossing Synchronize data between different clock frequencies using async FIFOs. ### Handshake Protocol `VALID && READY` ensures reliable data transfer with flow control. ### Round-Robin Arbitration Fair scheduling by cycling through requesters in order. ### Backpressure Downstream signals "I'm full" to pause upstream sender. --- ## Notation Conventions Throughout the documentation, you'll see: - **Hexadecimal:** `0x1234` or `0xABCD` - **Binary:** `0b1010` or `4'b1010` (4-bit binary) - **Bit ranges:** `[31:0]` means bits 31 down to 0 (32 bits total) - **Bit indexing:** `data[7]` means bit 7 of data - **Active-low signals:** `resetn` (n suffix) means 0=active, 1=inactive - **Register notation:** `reg [7:0] counter` = 8-bit register named counter --- ## Further Reading For deeper hardware understanding: - **AXI4 specification:** ARM IHI0022 (AXI protocol) - **Xilinx UltraScale+ architecture:** UG574 (FPGA fabric, memory) - **PCIe specification:** PCI-SIG PCIe Base 3.0 - **Digital design fundamentals:** "Digital Design and Computer Architecture" by Harris & Harris For neuromorphic computing: - **Spiking neural networks:** "Neuronal Dynamics" by Gerstner et al. - **Hardware acceleration:** "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson --- This glossary provides the foundation for understanding the hardware implementation details throughout the documentation. When you encounter an unfamiliar term, refer back to this appendix for clarification and context.