# FPGA HARVARD ARCHITECTURE

#### Von Neumann Architecture



- Shared memory space
- CPU to memory is bottleneck
- Data and instructions have to share the memory interface
- The basic word size is the same for data and instructions

### **Harvard Architecture**



- Separate memory spaces
- Data and instructions access can happen at the same time
- The basic word size can be different for data and instructions

### **Modified Harvard Architecture**



- Common Main memory space
- CPU to memory bottleneck only for things not in the cache
- Separate memory spaces for data cache and instruction cache
- Data and instructions access can happen at the same time
- The basic word size can be different for data and instructions
- Most modern processors use this approach inside the CPU chip

## Linear Sort in FPGA

The instruction memory is completely separate from the data memory



# Performance Comparison of Linear UBXSort vs C++ std::sort()

- UBXSort is our patented sorting algorithm, US patent # 5,278,987
- It has a complexity of O(N)
- Typical conventional sorting algorithms have complexity of O(N \* log(N))
- The larger the dataset size, the more advantageous of the UBXSort algorithm
- The benchmark test compares the C++ implementation of UBXSort against the sort function included in the standard C++ library
- Test is performed on a machine with Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz, 16 cores, 512GB RAM
- The host runs on CentOS 6.7, using gcc-4.x, libstdc++.so.6.0.13
- The test meassures the time it takes to sort up to 6 million double precision float numbers



**Record Count** 

# **FPGA Sort Implementation advantage**



- The C code to sort one character of the key field
- This is the step that takes most of the time. The for loop cycles through all of the records. For each cycle it reads the character to be sorted and does the steps needed to generate the sorted order for that column, which include: memory reads, memory writes, arithmetic functions and decision branch points.
- In the FPGA implementation all of this work is done in the time that it takes to do one memory read, about 12 ns. Therefore the sorting time for 100,000,000 records on 8 character (or 64 bit) field is about 10 seconds.

### UBXSort C++ code to sort one character position

```
void UBSorter::ProcColumn(unsigned char * pBuf)
int i, j, j0;
unsigned char c;
 i = startRcd :
 for (i = 0; i < nRcds ; ++i) {
  c = pBuf[j * stride_];
   if (flags_[c]) { pDest_[ curPocket_[c] ] = j; }
   else {
     iniPocket [c] = j;
     flags [c] = true;
   curPocket_[c] = j;
   i = pSrc [i];
i0 = -1;
 for (i = 0; i < 256; ++i) {
   if (flags [i]) {
     if (i0 == -1) {
       j0 = curPocket [i];
        startRcd_ = iniPocket_[i];
     } else {
        pDest [j0] = iniPocket [i];
       j0 = curPocket [i];
   }
endRcd = j0;
pDest [endRcd ] = -1;
```

### **FPGA Implementation**



#### Verilog or VHDL

-- IP VLNV: xilinx.com:ip:blk\_mem\_gen:8.4 -- IP Revision: 2

LIBRARY ieee; USE ieee.std\_logic\_1164.ALL; USE ieee.numeric\_std.ALL;

LIBRARY blk\_mem\_gen\_v8\_4\_2; USE blk\_mem\_gen\_v8\_4\_2.blk\_mem\_gen\_v8\_4\_2;

ENTITY rcd\_mem IS PORT ( clka : IN STD\_LOGIC; ena : IN STD\_LOGIC, VECTOR(0 DOWNTO 0); addra : IN STD\_LOGIC\_VECTOR(7 DOWNTO 0); dina : IN STD\_LOGIC\_VECTOR(31 DOWNTO 0); douta : OUT STD\_LOGIC\_VECTOR(31 DOWNTO 0) );

END rcd\_mem;

ARCHITECTURE rcd\_mem\_arch OF rcd\_mem IS COMPONENT blk\_mem\_gen\_v8\_4\_2 IS GENERIC ( C\_FAMILY : STRING; C\_XDEVICEFAMILY : STRING; C\_ELABORATION\_DIR : STRING; C\_INTERFACE\_TYPE : INTEGER;

#### 70 more lines of text

C\_DISABLE\_WARN\_BHV\_RANGE : INTEGER; C\_COUNT\_36K\_BRAM : STRING; C\_COUNT\_18K\_BRAM : STRING; C\_EST\_POWER\_SUMMARY : STRING

#### PORT (

clka : IN STD\_LOGIC; rsta : IN STD\_LOGIC; ena : IN STD\_LOGIC; regcea : IN STD\_LOGIC; wea : IN STD\_LOGIC\_VECTOR(0 DOWNTO 0); addra : IN STD\_LOGIC\_VECTOR(7 DOWNTO 0); dina : IN STD\_LOGIC\_VECTOR(31 DOWNTO 0); douta : OUT STD\_LOGIC\_VECTOR(31 DOWNTO 0);

#### 70 more lines of text

s\_axi\_dbiterr : OUT STD\_LOGIC; s\_axi\_rdaddrecc : OUT STD\_LOGIC\_VECTOR(7 DOWNTO 0) ); END COMPONENT blk\_mem\_gen\_v8\_4\_2; ATTRIBUTE X\_CORE\_INFO OF rcd\_mem\_arch: ARCHITECTURE IS "blk\_mem\_gen\_v8\_4\_2,Vivado 2018.3"; ATTRIBUTE CHECK\_LICENSE\_TYPE : STRING; ATTRIBUTE CHECK\_LICENSE\_TYPE OF rcd\_mem\_arch : ARCHITECTURE IS "rcd\_mem,blk\_mem\_gen\_v8\_4\_2,{}"; ATTRIBUTE CORE\_GENERATION\_INFO : STRING;

## **FPGA Chip Implementation**



#### Block of memory

The colored lines are a representation of the connections between the logical elements

Slice of 8 LUTs 6 of which are used

