If the vector processor implements chaining with two lanes and has a single vector load/store unit, using the pseudo assembly code from question 1, show how convoys would be constructed to execute in the vector pipeline.

Vector Processors and Data Level Parallelism

Introduction

A complex number consists of a real and imaginary component and is usually written in the form where and are either integer or floating-point values and (the imaginary value) . Sometimes in engineering, the letter is used in place of because is used for other values.

Multiplying two complex numbers is done by applying the FOIL (Firsts, Outers, Inners and Lasts) method, similar to that of binomial multiplication. For example, multiplying (a + bi)(c + di) is accomplished as follows:

Firsts: a * c

Outers: a * di

Inners: bi * c

Lasts: bi * di

This produces (a+bi)(c+di) = ac + adi + bci + bdi2. The terms are combined to produce the product back in the form a + bi. Keep in mind that i2 = -1.

An example using actual values: (2.5 + 3i)(4.0 + 2i)

Firsts: 2.5 * 4.0

Outers: 2.5 * 2i

Inners: 3i * 4.0

Lasts: 3i * 2i

This produces 10 + 5i + 12i + 6i2 = 10 + 17i + 6(-1) = 4 + 17i.

Some contemporary programming languages natively support complex numbers (Python, MATLAB). Newer revisions of some older languages (C, FORTRAN) have added support for complex numbers. Some programming languages have no native support for complex numbers.

Assignment Definition

Consider the following high-level language code which multiplies two vectors that contain single-precision complex numbers:

Values a, b and c are vectors; _re is the real component element and _im is the imaginary component element in each vector.

Convert this loop into pseudo RV64V assembly code using strip mining assuming the following architectural features:

Vector registers: v0 – v31

MVL (maximum vector length) = 64

Instructions: vld (vector load)

vst (vector store)

vadd (vector add)

vsub (vector subtract)

vmul (vector multiply)

bne (branch if not equal)*

blt (branch if less than)*

j (unconditional jump)*

addi (integer add immediate)*

ori (logical or immediate)*

Note: instructions with an asterisk indicate the instructions are used only for setting initial index value and increments, and for loop control.

If the vector processor implements chaining with two lanes and has a single vector load/store unit, using the pseudo assembly code from question 1, show how convoys would be constructed to execute in the vector pipeline. How many chimes are required to execute the convoys?

Assume in the vector processor, the functional units have the following startup overhead: load/store unit: 12 cycles, multiply unit: 7 cycles, and the add/subtract unit: 6 cycles. How many clock cycles are required for each iteration of the loop, including startup overhead?

How many iterations are required to complete processing the vectors?

Instruction Formats

vld (vector load): vld vD, vec_ref

vst (vector store): vst vD, vec_ref

vadd (vector add): vadd vD, vS1, vS2

vsub (vector subtract): vsub vD, vS1, vS2

vmul (vector multiply): vmul vD, vS1, vS2

bne (branch if not equal): bne x1, x2, target_label

blt (branch if less than): blt x1, x2, target_label

j (unconditional jump): j target_label

addi (integer add immediate): addi xD, xS1, xS2

ori (logical or immediate): ori xD, xS1, const

Format Definitions

vD = destination vector register

vS1 = first source vector register

vS2 = second source vector register

vec_ref = vector reference

x1 = first general purpose register for comparison

x2 = second general purpose register for comparison

xS1 = first source general purpose register

xS2 = second source general purpose register

target_label = label of the target instruction for branch

const = an integer constant