# 1.2 How the Network is Written to the FPGA Now that we know *where* everything ends up, let's understand *how* it gets there. This requires understanding how the host computer communicates with the FPGA hardware. ### What is Host-FPGA Communication? Think of the host (your computer running Python) and the FPGA (the specialized chip) as two separate computers that need to talk to each other. They communicate through a physical link called **PCIe** (Peripheral Component Interconnect Express). #### Analogy: Sending Mail Between Buildings Imagine two buildings (Host and FPGA) connected by a mail chute: ``` ┌─────────────────────┐ ┌─────────────────────┐ │ HOST BUILDING │ │ FPGA BUILDING │ │ │ │ │ │ Person (Python) │ PCIe "Mail Chute" │ Mailroom Worker │ │ writes a letter: │ ═══════════════════► │ (pcie2fifos.v) │ │ "Put 0x03E8 at │ │ reads letter │ │ address 0x8000" │ │ and delivers to │ │ │ │ Storage Room (HBM) │ └─────────────────────┘ └─────────────────────┘ ``` **Key differences from actual mail:** 1. **Speed:** PCIe sends "letters" (data packets) at ~14 GB/second 2. **Automation:** Software libraries handle packing/unpacking automatically 3. **Direct Memory Access (DMA):** During initialization, the host pushes data directly to FPGA memory without CPU involvement in each transfer #### The Physical Link: PCIe **PCIe (Peripheral Component Interconnect Express)** is a high-speed serial communication standard. **Physical layer:** 16 wires going from Host motherboard to FPGA card - Each wire (lane) carries 8 Gigabits/second - 16 lanes × 8 Gb/s = 128 Gb/s raw = ~14 GB/s usable bandwidth **What travels on PCIe:** Packets called **TLPs (Transaction Layer Packets)** - Each packet has: Header (address, command type) + Payload (data) - Example packet: "Write 32 bytes of data to FPGA address 0xD000_0000" **Two communication modes:** 1. **Host-to-FPGA (PCIe Memory Write TLPs):** - Host sends packet: "Write this data to FPGA address X" - FPGA receives packet, stores data - Used for: Initialization data transfers (HBM programming, commands, parameters) 2. **FPGA-to-Host (PCIe Memory Read TLPs):** - FPGA sends packet: "Read data from Host address Y and send it to me" - Host memory responds with data - Used for: Output spike data retrieval after execution During network **initialization**, we use **mode 1 exclusively** - the host pushes all network data to the FPGA via PCIe Memory Write TLPs. The FPGA is passive and only receives. During **execution**, the FPGA may use mode 2 to send spike outputs back to the host. --- ### The Software-Hardware Stack for Initialization When `CRI_network(target="CRI")` is called, a multi-layer software and hardware stack springs into action: ``` Layer 7: User Code ┌────────────────────────────────────────────────────────────────┐ │ from hs_api import CRI_network │ │ network = CRI_network(axons, connections, config, outputs, │ │ target="CRI") │ └────────────────┬───────────────────────────────────────────────┘ │ Calls ▼ Layer 6: hs_api Internals ┌────────────────▼───────────────────────────────────────────────┐ │ hs_api/api.py: CRI_network.__init__() │ │ - Validates network structure │ │ - Creates connectome object │ │ - Calls: from hs_bridge import network │ │ - Instantiates: self.CRI = network(...) │ └────────────────┬───────────────────────────────────────────────┘ │ Calls ▼ Layer 5: hs_bridge Network Class ┌────────────────▼───────────────────────────────────────────────┐ │ hs_bridge/network.py: network.__init__() │ │ - Calls compiler to generate HBM data │ │ - Calls controller to program FPGA │ └────────────────┬───────────────────────────────────────────────┘ │ Calls ▼ Layer 4: fpga_compiler (HBM Data Generation) ┌────────────────▼───────────────────────────────────────────────┐ │ hs_bridge/FPGA_Execution/fpga_compiler.py │ │ - create_axon_ptrs(): Builds axon pointer array │ │ - create_neuron_ptrs(): Builds neuron pointer array │ │ - create_synapses(): Builds synapse data array │ │ - Output: NumPy arrays ready for DMA transfer │ └────────────────┬───────────────────────────────────────────────┘ │ Calls ▼ Layer 3: fpga_controller (FPGA Programming) ┌────────────────▼───────────────────────────────────────────────┐ │ hs_bridge/FPGA_Execution/fpga_controller.py │ │ - write_parameters_simple(): Programs neuron counts │ │ - write_neuron_type(): Programs neuron model parameters │ │ - clear(): Zeros URAM │ │ - [Calls dmadump to transfer HBM data] │ └────────────────┬───────────────────────────────────────────────┘ │ Calls ▼ Layer 2: DMA Library (PCIe Transfer) ┌────────────────▼───────────────────────────────────────────────┐ │ hs_bridge/wrapped_dmadump/dmadump.py │ │ - dma_dump_write(data, length, ...): Sends data Host→FPGA │ │ - Underlying C library interfaces with Linux kernel driver │ └────────────────┬───────────────────────────────────────────────┘ │ PCIe TLPs ▼ Layer 1: FPGA Hardware Modules ┌────────────────▼───────────────────────────────────────────────┐ │ Verilog Modules (synthesized into FPGA fabric): │ │ - pcie2fifos.v: Receives PCIe packets → Input FIFO │ │ - command_interpreter.v: Parses commands, routes data │ │ - hbm_processor.v: Writes data to HBM │ │ - internal_events_processor.v: Writes data to URAM │ └─────────────────────────────────────────────────────────────────┘ ``` --- ### Step-by-Step: Initialization Sequence Let's trace the exact sequence of events when you run: ```python network = CRI_network(axons, connections, config, outputs, target="CRI") ``` #### **Phase 1: Network Compilation (Software - hs_bridge)** **Step 1.1: CRI_network.__init__() validates and calls compiler** File: `hs_api/api.py` (lines 141-156) ```python if self.target == "CRI": logging.info("Initilizing to run on hardware") self.connectome.pad_models() formatedOutputs = self.connectome.get_outputs_idx() print("formatedOutputs:", formatedOutputs) self.CRI = network( # ← Calls hs_bridge.network class self.connectome, formatedOutputs, self.config, simDump=simDump, coreOveride=coreID, ) self.CRI.initalize_network() # ← Triggers actual initialization ``` **Step 1.2: hs_bridge network class creates compiler** File: `hs_bridge/network.py` (conceptual - not shown in our files, but referenced) ```python def initalize_network(self): # Create compiler compiler = fpga_compiler( data=[self.axon_ptrs, self.neuron_ptrs, self.synapses], N_neurons=self.N_neurons, outputs=self.outputs, coreID=self.coreID ) # Generate HBM programming data compiler.create_axon_ptrs() # ← Generate axon pointer data compiler.create_neuron_ptrs() # ← Generate neuron pointer data compiler.create_synapses() # ← Generate synapse data # Program FPGA self.program_fpga() ``` **Step 1.3: fpga_compiler generates axon pointers** File: `hs_bridge/FPGA_Execution/fpga_compiler.py` (lines 157-200) ```python def create_axon_ptrs(self, simDump=False): '''Creates the necessary adxdma_dump commands to program axon pointers into HBM''' axn_ptrs = np.fliplr(self.axon_ptrs) # Reverse for little-endian batchCmd = [] for r, d in enumerate(axn_ptrs): # For each row cmd = [] for p in d: # For each pointer in row (8 pointers per row) # p = (start_row, end_row) tuple # Build 32-bit pointer: [31:23]=length, [22:0]=start_address binAddr = np.binary_repr(p[1] - p[0], PTR_LEN_BITS) + \ np.binary_repr(p[0] + SYN_BASE_ADDR, PTR_ADDR_BITS) # binAddr is now 32-bit string like "000000001" + "00000000000000000000000" # Convert to bytes (4 bytes per pointer) cmd = cmd + [int(binAddr[:8], 2), # Byte 0 int(binAddr[8:16], 2), # Byte 1 int(binAddr[16:24], 2), # Byte 2 int(binAddr[24:], 2)] # Byte 3 # Prepend HBM write command header # [511:504]=0x02 (HBM write opcode) # [503:496]=coreID # [495:0]=address + data rowAddress = '1' + np.binary_repr(r + AXN_BASE_ADDR, 23) # 24-bit HBM address cmd = self.HBM_OP_RW_LIST + \ [int(rowAddress[:8], 2), int(rowAddress[8:16], 2), int(rowAddress[16:], 2)] + cmd cmd.reverse() # Reverse for endianness batchCmd = batchCmd + cmd # Send to FPGA via DMA exitCode = dmadump.dma_dump_write(np.array(batchCmd), len(batchCmd), 1, 0, 0, 0, dmadump.DmaMethodNormal) ``` **What's happening here:** - `self.axon_ptrs` is a NumPy array: `[[start0, end0], [start1, end1], ...]` - For our network, axon 0: `[0, 0]` → length=1, start=0 - Converts to binary format: 9 bits for length + 23 bits for address - Adds HBM write command opcode (0x02) - Calls `dmadump.dma_dump_write()` to send via PCIe **Example for Axon 0 pointer:** ```python p = (0, 0) # Start row 0, end row 0 (1 row total) length = 0 - 0 = 0... wait, that's wrong! # Actually the code does p[1] - p[0] but these are (start, end) inclusive # So if start=0, end=0, that means 1 row (from 0 to 0 inclusive) # But the binary repr treats it as end - start = 0 # Actually looking closer, length = p[1] - p[0] = end - start # If there's 1 row, and we use inclusive indexing, end would equal start # So length = 0... but we want to represent "1 row" # # Let me re-read: PTR_LEN_BITS = 9, stores number of rows # The pointer stores: how many rows of synapses # For axon 0 with 5 synapses, that fits in 1 row (8 synapses per row) # So length should be 1 # # Looking at line 176: binAddr = np.binary_repr(p[1] - p[0], PTR_LEN_BITS) # If p = (start_row, end_row) and there's 1 row: # If 0-indexed and end is exclusive: p = (0, 1) → 1 - 0 = 1 ✓ # If 0-indexed and end is inclusive: p = (0, 0) → 0 - 0 = 0 ✗ # The code must use exclusive end indexing # So for axon 0: p = (0, 1) meaning rows [0, 1) = row 0 # # Correcting: p = (0, 1) # Start row 0, end row 1 (exclusive) = 1 row length = 1 - 0 = 1 # Binary: 0b000000001 (9 bits) start = 0 + SYN_BASE_ADDR = 0 + 0x8000 = 0x8000 # Binary: 23 bits binAddr = "000000001" + "00000000000001000000000" # 32 bits total = 0b00000000100000000000001000000000 = 0x0080_0000 Bytes: [0x00, 0x80, 0x00, 0x00] (little-endian order in array) ``` **Step 1.4: fpga_compiler generates neuron pointers** File: `hs_bridge/FPGA_Execution/fpga_compiler.py` (lines 225-268) Same process as axon pointers, but for `self.neuron_ptrs` array. Writes to HBM starting at `NRN_BASE_ADDR = 0x4000`. **Step 1.5: fpga_compiler generates synapses** File: `hs_bridge/FPGA_Execution/fpga_compiler.py` (lines 271-360) ```python def create_synapses(self, simDump=False): weights = self.synapses # 2D array: rows × 8 synapses per row bigCmdList = [] for r, d in enumerate(weights): # For each synapse row cmd = [] for w in d: # For each synapse in row (up to 8) if w[0] == 0: # Regular synapse # w = (opcode, target_address, weight) # Build 32-bit synapse: [31:29]=op, [28:16]=addr, [15:0]=weight binCmd = np.binary_repr(0, SYN_OP_BITS) + \ np.binary_repr(int(w[1]), SYN_ADDR_BITS) + \ np.binary_repr(int(w[2]), SYN_WEIGHT_BITS) # Example: op=0 (3 bits), addr=0 (13 bits), weight=1000 (16 bits) # binCmd = "000" + "0000000000000" + "0000001111101000" # = 0b000_0000000000000_0000001111101000 # = 0x0000_03E8 cmd = cmd + [int(binCmd[:8], 2), int(binCmd[8:16], 2), int(binCmd[16:24], 2), int(binCmd[24:], 2)] elif w[0] == 1: # Spike output entry # w = (1, neuron_index) binSpike = np.binary_repr(4, SYN_OP_BITS) + \ 12*'0' + \ np.binary_repr(w[1], 17) # OpCode=100 (4 in decimal), address=neuron index, weight=0 cmd = cmd + [int(binSpike[:8], 2), int(binSpike[8:16], 2), int(binSpike[16:24], 2), int(binSpike[24:], 2)] # Prepend HBM write command rowAddress = '1' + np.binary_repr(r + SYN_BASE_ADDR, 23) cmd = self.HBM_OP_RW_LIST + \ [int(rowAddress[:8], 2), int(rowAddress[8:16], 2), int(rowAddress[16:], 2)] + cmd cmd = np.flip(np.array(cmd, dtype=np.uint64)) bigCmdList.append(cmd) # Send to FPGA in batches split = np.concatenate(bigCmdList) n = 10 # Batch size while True: element = split[:n*64] split = split[n*64:] if element.size == 0: break exitCode = dmadump.dma_dump_write(element, len(element), 1, 0, 0, 0, dmadump.DmaMethodNormal) ``` **Example for first synapse (a0 → h0, weight=1000):** ```python w = (0, 0, 1000) # (opcode=0, target=h0=0, weight=1000) binCmd = "000" + "0000000000000" + "0000001111101000" = 0x0000_03E8 Bytes: [0x00, 0x00, 0x03, 0xE8] ``` At this point, all HBM data is prepared as NumPy arrays. Now we need to send it! --- #### **Phase 2: DMA Transfer (PCIe Communication)** **Step 2.1: dmadump.dma_dump_write() prepares DMA** File: `hs_bridge/wrapped_dmadump/dmadump.py` (Python wrapper for C library) ```python def dma_dump_write(data, length, flag1, flag2, flag3, flag4, method): ''' Sends data from host memory to FPGA via PCIe Memory Write TLPs Parameters: - data: NumPy array containing bytes to send - length: Number of bytes - method: DmaMethodNormal (0) for normal transfer Returns: - 0 on success, non-zero on error ''' # This Python function calls a C extension # The C library (adxdma_dmadump.cpp) handles: # 1. Calls ADXDMA_WriteDMA() from vendor library # 2. Vendor library interfaces with Linux/Windows kernel driver # 3. Kernel driver builds PCIe Memory Write TLPs # 4. TLPs are sent to FPGA's BAR (Base Address Register) address # 5. FPGA receives via PCIe endpoint and writes to Input FIFO ``` **Step 2.2: Physical DMA operation** What actually happens on the hardware: ``` 1. Host allocates DMA buffer in RAM: Virtual address: 0x7FFF_1234_5000 (example - OS virtual memory) Physical address: 0x1_2345_6000 (translated by OS page tables) Size: length bytes (e.g., 64 bytes for one 512-bit packet) 2. Host copies data into DMA buffer: memcpy(dma_buffer, data, length) 3. Host PCIe driver sends Memory Write TLP(s) directly to FPGA: The ADXDMA_WriteDMA() library function calls the kernel driver, which: - Builds PCIe Memory Write Transaction Layer Packets (TLPs) - Sends them to the FPGA's PCIe Base Address Register (BAR) address - No MMIO register programming needed - data goes directly to FPGA 4. PCIe TLP travels from Host → FPGA: Physical link: 16 lanes × differential pairs Packet format: Header + Payload + CRC The FPGA PCIe endpoint receives the TLP 5. FPGA PCIe Endpoint presents data via AXI4: - PCIe endpoint IP block decodes the TLP - Presents as AXI4 write transaction to pcie2fifos.v - AXI4 signals: awaddr, awvalid, wdata, wvalid, etc. 6. pcie2fifos.v receives AXI4 write: - Accepts write when wvalid=1 and wready=1 - Extracts 512-bit payload from s_axi_wdata - Writes to Input FIFO - FIFO stores data for command_interpreter.v to process **What pcie2fifos.v does (black box view):** INPUT: AXI4 Protocol (complex handshaking: awaddr, awvalid, awready, wdata, wvalid, wready) → Bursty timing, requires coordination between sender and receiver OUTPUT: FIFO Interface (simple: fifo_dout[511:0], fifo_empty, fifo_rd_en) → Smooth timing, command_interpreter reads at its own pace **Transformation:** Complex AXI4 protocol → Simple FIFO read interface **Buffering:** Can store up to 16 × 512-bit packets **Data:** The 512-bit payload is unchanged, just the access method differs Think of it like a mail slot: the mail carrier (PCIe) can drop letters whenever they arrive, and the recipient (command_interpreter) can pick them up whenever convenient. The letters aren't changed, just stored temporarily (up to 16 letters) so sender and receiver don't have to coordinate timing. ``` **IMPORTANT:** The FPGA is **passive** during initialization - it only receives data. The host is the DMA "master" that pushes data to the FPGA. There is **no FPGA DMA engine** reading from host memory during this process. --- ### PCIe Packet Details **PCIe Memory Write TLP (Host → FPGA):** ``` TLP Header (16 bytes for 64-bit addressing): ┌────────────────────────────────────────────────────────────┐ │ [127:125] Fmt = 011 (Memory Write, 64-bit address, data) │ │ [124:120] Type = 00000 (Memory Write) │ │ [119:110] Length = 16 DW (64 bytes = 16 dwords = 512 bits) │ │ [109:96] Requester ID = 00:00.0 (Host PCIe Root Complex) │ │ [95:88] Tag = 7 (identifies this transaction) │ │ [87:80] Last DW BE = 0xF (all bytes valid) │ │ [79:72] First DW BE = 0xF (all bytes valid) │ │ [71:64] Address[63:32] = 0x0000_0000 (upper 32 bits) │ │ [63:2] Address[31:2] = BAR0_BASE >> 2 (FPGA address) │ │ [1:0] Reserved = 0b00 │ └────────────────────────────────────────────────────────────┘ Payload (64 bytes = 512 bits): [511:504] = 0x02 (opcode: HBM write command) [503:496] = 0x00 (coreID) [495:280] = padding [279] = 0x1 (write flag) [278:256] = 0x8000 (HBM row address) [255:0] = synapse/pointer data (256 bits) CRC (4 bytes): 0x1A2B3C4D (example - calculated by PCIe controller) ``` **What happens when FPGA receives this TLP:** 1. **PCIe Endpoint IP Block:** - Receives serial data on 16 differential lane pairs - Deserializes and decodes TLP - Checks CRC (discards if bad) - Extracts address and payload 2. **Address Decode:** - Address 0x0000_0000_XXXX_XXXX falls within BAR0 (Base Address Register 0) - Routes to AXI4 master connected to pcie2fifos.v 3. **AXI4 Write Transaction:** ```verilog // PCIe endpoint drives these signals to pcie2fifos.v: s_axi_awaddr = 64'h0000_0000_XXXX_XXXX // Address (ignored by pcie2fifos) s_axi_awvalid = 1'b1 // Address valid s_axi_wdata = 512'h02... // The 512-bit payload s_axi_wvalid = 1'b1 // Data valid s_axi_wlast = 1'b1 // Last beat in burst ``` 4. **pcie2fifos.v accepts write:** ```verilog always @(posedge aclk) begin if (s_axi_wvalid && s_axi_wready) begin input_fifo_din <= s_axi_wdata[511:0]; input_fifo_wr_en <= 1'b1; end end ``` 5. **Input FIFO stores data:** - FIFO is a BRAM primitive (Xilinx XPM_FIFO) - Stores the 512-bit word - Asserts ~empty signal - command_interpreter.v reads on next cycle **Note:** If data exceeds one TLP's maximum payload size (typically 256 bytes), the PCIe driver automatically splits it into multiple TLPs. For our 512-bit (64-byte) packets, one TLP is sufficient. --- #### **Phase 3: FPGA Reception and HBM Programming** **Step 3.1: pcie2fifos.v receives packet** File: `hardware_code/gopa/CRI_proj/pcie2fifos.v` **What is pcie2fifos.v?** `pcie2fifos.v` is a **simple AXI4 slave bridge**, NOT a DMA engine. It: - Has NO MMIO registers for DMA control - Has NO ability to become PCIe bus master - Simply accepts AXI4 writes from the PCIe endpoint and stores them in a FIFO - Similarly, provides AXI4 reads from a different FIFO for outgoing data Think of it like a mailbox: - **Incoming mail slot (Input FIFO):** PCIe endpoint drops packets here - **Outgoing mail slot (Output FIFO):** command_interpreter puts responses here - pcie2fifos.v is just the slots - it doesn't "go get" mail from anywhere ```verilog // Simplified code from pcie2fifos.v // AXI4 Write Data Channel Handler always @(posedge aclk) begin if (s_axi_wvalid && s_axi_wready) begin // Received 512-bit word from PCIe endpoint input_fifo_wr_en <= 1'b1; input_fifo_din <= s_axi_wdata[511:0]; end end // Input FIFO instantiation (Xilinx XPM_FIFO primitive) xpm_fifo_sync #( .FIFO_WRITE_DEPTH(16), // Can store 16 × 512-bit packets .WRITE_DATA_WIDTH(512), // 512 bits per entry .READ_DATA_WIDTH(512) ) input_fifo ( .wr_clk(aclk), .wr_en(input_fifo_wr_en), .din(input_fifo_din), .dout(input_fifo_dout), .empty(input_fifo_empty), .full(input_fifo_full) ); ``` **What's happening physically:** - `s_axi_wdata` is 512 physical wires coming from PCIe endpoint - On clock rising edge where both `wvalid=1` and `wready=1`, data transfers - `input_fifo_wr_en` signal triggers FIFO write - FIFO is a BRAM primitive (36Kb blocks) configured as 16-deep × 512-bit - FIFO write pointer increments, `empty` flag deasserts - command_interpreter.v can now read from FIFO **Step 3.2: command_interpreter.v parses command** File: `hardware_code/gopa/CRI_proj/command_interpreter.v` ```verilog // State machine (simplified) reg [2:0] state; localparam IDLE = 0, READ_CMD = 1, ROUTE_DATA = 2; always @(posedge aclk) begin case (state) IDLE: begin if (!input_fifo_empty) begin input_fifo_rd_en <= 1'b1; state <= READ_CMD; end end READ_CMD: begin // FIFO output valid (FWFT mode) cmd_word <= input_fifo_dout[511:0]; opcode <= input_fifo_dout[511:504]; // Top 8 bits coreID <= input_fifo_dout[503:496]; // Next 8 bits payload <= input_fifo_dout[495:0]; // Remaining 496 bits state <= ROUTE_DATA; end ROUTE_DATA: begin case (opcode) 8'h02: begin // HBM write command // Extract HBM address from payload hbm_addr <= payload[495:472]; // 24-bit address hbm_data <= payload[255:0]; // 256-bit data hbm_wr_en <= 1'b1; // Signal hbm_processor to write end 8'h03: begin // Clear URAM command // Extract neuron address // Signal internal_events_processor end 8'h04: begin // Network parameters // Extract n_inputs, n_outputs // Store in registers end // ... other opcodes endcase state <= IDLE; end endcase end ``` **For our HBM write (opcode 0x02):** ``` Input: 512-bit word from Input FIFO Bits [511:504] = 0x02 → opcode = HBM write Bits [503:496] = 0x00 → coreID = 0 Bits [495:472] = 24-bit HBM row address Example: 0x800000 = row 0 in axon pointer region Bits [471:0] = HBM data (256 bits of actual pointers/synapses + padding) Command interpreter extracts: hbm_addr = 0x000000 (row address, relative to base) hbm_data[255:0] = pointer data Asserts hbm_wr_en signal to hbm_processor ``` **Step 3.3: hbm_processor.v writes to HBM** File: `hardware_code/gopa/CRI_proj/hbm_processor.v` ```verilog // HBM write state machine (simplified) reg [2:0] hbm_state; localparam HBM_IDLE = 0, HBM_WRITE_ADDR = 1, HBM_WRITE_DATA = 2; always @(posedge aclk) begin case (hbm_state) HBM_IDLE: begin if (hbm_wr_en) begin // Received write request from command_interpreter hbm_wr_addr_reg <= hbm_addr; hbm_wr_data_reg <= hbm_data; hbm_state <= HBM_WRITE_ADDR; end end HBM_WRITE_ADDR: begin // AXI4 Write Address Channel m_axi_awvalid <= 1'b1; m_axi_awaddr <= {hbm_wr_addr_reg, 5'b00000}; // Convert row to byte addr m_axi_awlen <= 8'd0; // 1 beat m_axi_awsize <= 3'd5; // 32 bytes = 2^5 if (m_axi_awready) begin m_axi_awvalid <= 1'b0; hbm_state <= HBM_WRITE_DATA; end end HBM_WRITE_DATA: begin // AXI4 Write Data Channel m_axi_wvalid <= 1'b1; m_axi_wdata <= {256'b0, hbm_wr_data_reg}; // Pad to 512 bits (HBM bus width) m_axi_wstrb <= 64'hFFFFFFFF; // All bytes valid m_axi_wlast <= 1'b1; // Last beat if (m_axi_wready) begin m_axi_wvalid <= 1'b0; hbm_state <= HBM_IDLE; // Write complete end end endcase end ``` **What's happening physically:** - `m_axi_awaddr` is a 33-bit wire bus to HBM controller - When `awvalid=1` and HBM controller asserts `awready=1`, address transfers - Next cycle: `wdata[511:0]` bus carries 512 bits (256 bits of data + 256 bits padding) - HBM controller decodes address: stack, channel, bank, row, column - HBM performs DRAM write: 1. Activate row (if different row than last access) 2. Write data to sense amplifiers 3. Precharge (close row) - Takes ~100-200ns total - `wready` asserts when HBM controller accepts data **Step 3.4: HBM physically stores the data** Inside the HBM chip (physical DRAM operation): ``` Address decoding: 33-bit address 0x0_0100_0000 (example for row 0x8000 × 32 bytes) [32:30] Stack select = 0b000 → Stack 0 [29:27] Channel select = 0b000 → Channel 0 within stack [26:13] Row address = 0b00000000100000 → Row 0x0020 [12:5] Column address = 0b00000000 → Column 0 [4:0] Byte offset = 0b00000 → Byte 0 HBM controller sequence: 1. Activate command: Open row 0x0020 in Bank 0 - Wordline voltage applied - Entire row (512 bytes) read into sense amps (row buffer) 2. Write command: Write 32 bytes at column 0 - Drive bitlines with new data - Sense amps latch data - Capacitors in DRAM cells charge/discharge 3. Precharge command: Close row - Write data from sense amps back to cells - Wordline deasserted 4. Data now stored in DRAM cells (1 transistor + 1 capacitor per bit) - Will persist for ~64ms before refresh needed ``` --- #### **Phase 4: Additional Initialization Steps** **Step 4.1: Program network parameters** File: `hs_bridge/FPGA_Execution/fpga_controller.py:683-721` ```python def write_parameters_simple(n_outputs, n_inputs, coreID=0, simDump=False): """Writes the network parameters to the FPGA""" command = np.zeros(512) command[:8] = list(np.binary_repr(4, 8)) # Opcode 0x04 command[8:16] = list(np.binary_repr(coreID, 8)) command[-17:] = list(np.binary_repr(n_inputs, 17)) # 17-bit input count command[-34:-17] = list(np.binary_repr(n_outputs, 17)) # 17-bit output count command = to_dump_format(command) # Convert to byte array exitCode = dmadump.dma_dump_write(command, len(command), ...) ``` This sends a command to `internal_events_processor.v` telling it: - How many input axons exist (5 in our network) - How many output neurons exist (5 in our network) **Step 4.2: Program neuron types** File: `hs_bridge/FPGA_Execution/fpga_controller.py:724-775` ```python def write_neuron_type(stopAddr, Threshold, neuronModel, shift, leak, coreID=0): """Configures neuron model parameters""" command = np.zeros(512) command[:8] = list(np.binary_repr(8, 8)) # Opcode 0x08 command[8:16] = list(np.binary_repr(coreID, 8)) command[-34:-17] = list(np.binary_repr(stopAddr, 17)) # Last neuron index command[-70:-34] = list(np.binary_repr(Threshold, 36)) # Spike threshold command[-72:-70] = list(np.binary_repr(neuronModel, 2)) # 0=IF, 1=LIF, etc. command[-78:-72] = list(np.binary_repr(shift, 6)) # Leak shift amount command[-84:-78] = list(np.binary_repr(leak, 6)) # Leak value command = to_dump_format(command) exitCode = dmadump.dma_dump_write(command, len(command), ...) ``` This configures: - **Threshold = 2000**: Neurons spike when V ≥ 2000 - **Neuron model = LIF**: Leaky integrate-and-fire - **Leak parameters**: How much voltage leaks each timestep The FPGA stores these in internal registers, which `internal_events_processor.v` uses during neuron updates. **Step 4.3: Clear URAM (zero all membrane potentials)** File: `hs_bridge/FPGA_Execution/fpga_controller.py:191-236` ```python def clear(n_internal, simDump=False, coreID=0): """This function clears the membrane potentials on the fpga.""" coreBits = np.binary_repr(coreID, 5) + 3*'0' for i in range(int(np.ceil(n_internal / ng_num))): # ng_num = 16 neurons/group commandTail = np.array([0]*55 + [int(coreBits, 2), 3], dtype=np.uint64) numCol = 16 # 16 columns (neuron groups) clearCommandList = [] for column in range(numCol): # Build clear command for this neuron group clearCommandList.append( np.concatenate([clear_address_packet(row=i, col=column), commandTail]) ) clearCommand = np.concatenate(clearCommandList) exitCode = dmadump.dma_dump_write(clearCommand, len(clearCommand), ...) ``` This sends opcode 0x03 commands to `internal_events_processor.v`, which writes zeros to all URAM addresses. **What happens in hardware:** ```verilog // internal_events_processor.v receives clear command always @(posedge aclk450) begin if (clear_cmd) begin // For each neuron in this group uram_addr <= neuron_row; uram_we <= 1'b1; uram_din <= 72'b0; // Write all zeros end end ``` This zeroes the membrane potential for all neurons. After this, every neuron starts with V=0. --- ### Summary: Complete Initialization Flow ``` User Python Code: network = CRI_network(target="CRI") ↓ hs_api validates network ↓ hs_bridge.network.__init__() ↓ fpga_compiler generates HBM data: - Axon pointers array - Neuron pointers array - Synapses array ↓ dmadump.dma_dump_write() sends data via PCIe: - Host allocates DMA buffer in RAM - Host sends Memory Write TLPs to FPGA - Data flows: Host RAM → PCIe → FPGA PCIe Endpoint → pcie2fifos.v → Input FIFO ↓ command_interpreter.v parses commands: - Opcode 0x02 → HBM write - Routes data to hbm_processor ↓ hbm_processor.v writes to HBM: - AXI4 transaction to HBM controller - Physical DRAM write (activate → write → precharge) ↓ fpga_controller.write_parameters_simple(): - Sends opcode 0x04 - Programs n_inputs, n_outputs ↓ fpga_controller.write_neuron_type(): - Sends opcode 0x08 - Programs threshold, neuron model, leak ↓ fpga_controller.clear(): - Sends opcode 0x03 - Zeros all URAM (membrane potentials) ↓ FPGA is now initialized: ✓ HBM contains network structure (pointers, synapses, weights) ✓ URAM cleared (all neurons at V=0) ✓ Network parameters programmed (threshold, neuron model) ✓ Ready to receive inputs and execute ``` **Time elapsed:** Typically 10-100 milliseconds depending on network size - Small network (our example): ~10 ms - Large network (millions of synapses): ~100 ms - Dominated by PCIe transfer time for large synapse arrays --- ## Conclusion Network initialization is a **one-time compilation and transfer process** that transforms your high-level Python network definition into a physical configuration in the FPGA's memory hierarchy. Once initialized: - **HBM stores the network structure** (connections and weights) - this doesn't change during execution - **URAM stores neuron states** (membrane potentials) - this updates every timestep - **BRAM stores input patterns** (which axons are firing) - this changes every timestep In the next chapter, we'll see how this initialized network comes to life when we send inputs and execute timesteps.