ARC VPX Specific Details¶

The ARC VPX family of processors combines the ARCv2 baseline ISA with ARCv2 Vector DSP ISA extension. The latter one is actively used in MLI Library implementation for this family of processors, allowing us to achieve high efficiency.

VPX Memory Allocation

VPX Memory Allignement

Accumulator

Operands Limitations and Shifting Ranges

VPX Memory Allocation¶

Implementation of almost all kernels uses vector instructions and assumes presence of operands in the vector memory (VCCM). Which means that:

A memory location reference by a data container of all input and output tensors must be allocated within VCCM memory region.

Memory pointed to by data container of the mli_lut structure must be allocated within VCCM memory region.

Tensors structures, LUT structures, configuration structures and memory pointed to by containers inside el_params field of a tensor may be allocated within any memory region.

This applies to:

All functions from kernels group (see MLI Kernels (Operators))
All functions related to conversion group (see Data Conversion Group)

This doesn’t apply to:

All functions from helpers group (see Helper Functions Group)
All functions from move group (see Data Movement)

VPX Memory Allignement¶

Addresses of all elements including data, quantization parameters and structure fields must be aligned on an element boundary. This is also applicable for data allocated in the vector memory (VCCM). Addresses of vectors and vector elements must be properly aligned on a vector-element boundary.

Important

There is one type of memory access that has 8-bit alignment: a unit-stride vector load or store with 8-bit elements (fx8 and sa8 data). For the best performance vector load and store access for such data must use even byte addresses (aligned on 16-bit boundary). This can be achieved by using even shapes or memstrides for sa8 and fx8 tensors. Odd byte addresses are allowed but less efficient.

Accumulator¶

The accumulator width used in calculations depends on the Xvec_guard_bit_option HW configuration parameter. See Quantization: Influence of Accumulator Bit Depth section for more info on how it influence the usage of the library. The following table summaries available options an d how much accumulations it allows to do without overflow.

VPX HW Accumulator width¶
Kernel Type	Description	guard bit option = 2	guard bit option = 1	guard bit option = 0
`sa8`	Accum width	24 (8 guard bits)	20 (4 guard bits)	16 (0 guard bits)
`sa8`	MACs w/o overflow guaranty	256	16	1
`fx16`	Accum width	40 (8 guard bits)	36 (4 guard bits)	32 (0 guard bits)
`fx16`	MACs guaranty	256	16	1
`fx16_fx8_fx8`	Accum width	40 (16 guard bits)	36 (12 guard bits)	32 (8 guard bits)
`fx16_fx8_fx8`	MACs guaranty	65536	4096	256

Operands Limitations and Shifting Ranges¶

This section describes VPX specific limitations to kernels. In this section, \(n_\text{tensor}\) denotes the fractional bits of a tensor and \(s_\text{fx,tensor}\) is its scale in case of an asymmetric data type (see Data Formats).

Weighted Kernels¶

For the following kernels:

conv2d
depthwise_conv2d
transpose_conv2d
group_conv2d
fully_connected
rnn_dense
gru_cell
lstm_cell

Firstly, to avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the the following shift restrictions must be adhered to:

Kernel Type	Restriction
`fx8`	\(0 \leq n_{in} + n_{weight} - n_{out} \leq 15\)
`fx16` and `fx16_fx8_fx8`	\(0 \leq n_{in} + n_{weight} - n_{out} \leq 31\)
`sa8_sa8_sa32`	No Limitations

Secondly, the following restrictions relate to shifting left the bias inside an accumulator:

Kernel Type	Restriction
`fx8`	\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 8\)
`fx16`	\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 16\)
`fx16_fx8_fx8`	\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 24\)
`sa8_sa8_sa32`	No Limitations

Avepool¶

FX16

To avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the in and out fraction bits must be adhered to:

\[-14 - \text{ceil}(\text{log}_2 (\text{Wk} \cdot \text{Hk})) < n_\text{in} - n_\text{out} < 16 - \text{ceil}(\text{log}_2 (\text{Wk} \cdot \text{Hk}))\]

with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.

SA8

To avoid internal large shifts below lower-bound and to avoid negative shifts above upper-bound, the in and out scale factors must be adhered to:

\[127 \cdot 2^{-15} \cdot \text{Wk} \cdot \text{Hk} < \frac{s_\text{fx,in} \cdot 2^{-n_\text{in}}} {s_\text{fx,out} \cdot 2^{-n_\text{out}}} < 64 \cdot \text{Wk} \cdot \text{Hk}\]

with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.

RNN Dense¶

FX16 and FX16_FX8_FX8

\[0 \leq n_\text{in} + n_\text{weights} - n_\text{out}\]

SA8_SA8_SA32

\[ \begin{align}\begin{aligned}\begin{split}acc\_ scale = \frac{ s_{fx,in} s_{fx,weights}}{s_{fx,out}} 2^{n_{in} + n_{weights} - n_{out}} \\\end{split}\\0 < acc\_ scale \leq 2^{32 - acc\_ size - {ceil}({log}_2 {input\_ count})}\end{aligned}\end{align} \]

where \(acc\_ size\) is the accumulator size including the guard bits. Restriction is to avoid saturation between multiple inputs accumulators after the scale since accumulators are scaled and added in 32 bits vectors.

Leaky and Parametric ReLU¶

To avoid an extra shift-left instruction in the inner loop, a negative ‘slope_coeff’/’alpha’ tensor fractional bits is not permitted:

Kernel	Kernel Type	Restriction
Leaky ReLU	`fx8` and `fx16`	\(0 \leq n_{slope\_coeff}\)
Parametric ReLU	`fx8` and `fx16`	\(0 \leq n_{alpha}\)

Element-wise Add and Element-wise Sub¶

FX16

Below restriction relates to shifting both inputs such that their fractional bits align.

\[\text{abs}(n_\text{in1} - n_\text{in2}) \leq 15\]

\[\text{max}(n_\text{in1}, n_\text{in2}) - 31 \leq n_\text{out} \leq \text{max}(n_\text{in1}, n_\text{in2}) + 31\]

SA8

No VPX specific limitations (see Element-wise Kernels Group for general limitations/requirements).