ARC VPX Specific Details

The ARC VPX family of processors combines the ARCv2 baseline ISA with ARCv2 Vector DSP ISA extension. The latter one is actively used in MLI Library implementation for this family of processors, allowing us to achieve high efficiency.

VPX Memory Allocation

Implementation of almost all kernels uses vector instructions and assumes presence of operands in the vector memory (VCCM). Which means that:

  • A memory location reference by a data container of all input and output tensors must be allocated within VCCM memory region.

  • Memory pointed to by data container of the mli_lut structure must be allocated within VCCM memory region.

  • Tensors structures, LUT structures, configuration structures and memory pointed to by containers inside el_params field of a tensor may be allocated within any memory region.

This applies to:
This doesn’t apply to:

VPX Memory Allignement

Addresses of all elements including data, quantization parameters and structure fields must be aligned on an element boundary. This is also applicable for data allocated in the vector memory (VCCM). Addresses of vectors and vector elements must be properly aligned on a vector-element boundary.

Important

There is one type of memory access that has 8-bit alignment: a unit-stride vector load or store with 8-bit elements (fx8 and sa8 data). For the best performance vector load and store access for such data must use even byte addresses (aligned on 16-bit boundary). This can be achieved by using even shapes or memstrides for sa8 and fx8 tensors. Odd byte addresses are allowed but less efficient.

Accumulator

The accumulator width used in calculations depends on the Xvec_guard_bit_option HW configuration parameter. See Quantization: Influence of Accumulator Bit Depth section for more info on how it influence the usage of the library. The following table summaries available options an d how much accumulations it allows to do without overflow.

VPX HW Accumulator width

Kernel Type

Description

guard bit option = 2

guard bit option = 1

guard bit option = 0

sa8

Accum width

24 (8 guard bits)

20 (4 guard bits)

16 (0 guard bits)

MACs w/o overflow guaranty

256

16

1

fx16

Accum width

40 (8 guard bits)

36 (4 guard bits)

32 (0 guard bits)

MACs guaranty

256

16

1

fx16_fx8_fx8

Accum width

40 (16 guard bits)

36 (12 guard bits)

32 (8 guard bits)

MACs guaranty

65536

4096

256

Operands Limitations and Shifting Ranges

This section describes VPX specific limitations to kernels. In this section, \(n_\text{tensor}\) denotes the fractional bits of a tensor and \(s_\text{fx,tensor}\) is its scale in case of an asymmetric data type (see Data Formats).

Weighted Kernels

For the following kernels:

  • conv2d

  • depthwise_conv2d

  • transpose_conv2d

  • group_conv2d

  • fully_connected

  • rnn_dense

  • gru_cell

  • lstm_cell

Firstly, to avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the the following shift restrictions must be adhered to:

Kernel Type

Restriction

fx8

\(0 \leq n_{in} + n_{weight} - n_{out} \leq 15\)

fx16 and fx16_fx8_fx8

\(0 \leq n_{in} + n_{weight} - n_{out} \leq 31\)

sa8_sa8_sa32

No Limitations

Secondly, the following restrictions relate to shifting left the bias inside an accumulator:

Kernel Type

Restriction

fx8

\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 8\)

fx16

\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 16\)

fx16_fx8_fx8

\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 24\)

sa8_sa8_sa32

No Limitations

Avepool

FX16

To avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the in and out fraction bits must be adhered to:

\[-14 - \text{ceil}(\text{log}_2 (\text{Wk} \cdot \text{Hk})) < n_\text{in} - n_\text{out} < 16 - \text{ceil}(\text{log}_2 (\text{Wk} \cdot \text{Hk}))\]

with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.

SA8

To avoid internal large shifts below lower-bound and to avoid negative shifts above upper-bound, the in and out scale factors must be adhered to:

\[127 \cdot 2^{-15} \cdot \text{Wk} \cdot \text{Hk} < \frac{s_\text{fx,in} \cdot 2^{-n_\text{in}}} {s_\text{fx,out} \cdot 2^{-n_\text{out}}} < 64 \cdot \text{Wk} \cdot \text{Hk}\]

with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.

RNN Dense

FX16 and FX16_FX8_FX8

\[0 \leq n_\text{in} + n_\text{weights} - n_\text{out}\]

SA8_SA8_SA32

\[ \begin{align}\begin{aligned}\begin{split}acc\_ scale = \frac{ s_{fx,in} s_{fx,weights}}{s_{fx,out}} 2^{n_{in} + n_{weights} - n_{out}} \\\end{split}\\0 < acc\_ scale \leq 2^{32 - acc\_ size - {ceil}({log}_2 {input\_ count})}\end{aligned}\end{align} \]

where \(acc\_ size\) is the accumulator size including the guard bits. Restriction is to avoid saturation between multiple inputs accumulators after the scale since accumulators are scaled and added in 32 bits vectors.

Leaky and Parametric ReLU

To avoid an extra shift-left instruction in the inner loop, a negative ‘slope_coeff’/’alpha’ tensor fractional bits is not permitted:

Kernel

Kernel Type

Restriction

Leaky ReLU

fx8 and fx16

\(0 \leq n_{slope\_coeff}\)

Parametric ReLU

fx8 and fx16

\(0 \leq n_{alpha}\)

Element-wise Add and Element-wise Sub

FX16

Below restriction relates to shifting both inputs such that their fractional bits align.

\[\text{abs}(n_\text{in1} - n_\text{in2}) \leq 15\]
\[\text{max}(n_\text{in1}, n_\text{in2}) - 31 \leq n_\text{out} \leq \text{max}(n_\text{in1}, n_\text{in2}) + 31\]

SA8

No VPX specific limitations (see Element-wise Kernels Group for general limitations/requirements).