ARC VPX Specific Details¶
The ARC VPX family of processors combines the ARCv2 baseline ISA with ARCv2 Vector DSP ISA extension. The latter one is actively used in MLI Library implementation for this family of processors, allowing us to achieve high efficiency.
VPX Memory Allocation¶
Implementation of almost all kernels uses vector instructions and assumes presence of operands in the vector memory (VCCM). Which means that:
A memory location reference by a data container of all input and output tensors must be allocated within VCCM memory region.
Memory pointed to by data container of the
mli_lut
structure must be allocated within VCCM memory region.Tensors structures, LUT structures, configuration structures and memory pointed to by containers inside
el_params
field of a tensor may be allocated within any memory region.
- This applies to:
All functions from kernels group (see MLI Kernels (Operators))
All functions related to conversion group (see Data Conversion Group)
- This doesn’t apply to:
All functions from helpers group (see Helper Functions Group)
All functions from move group (see Data Movement)
VPX Memory Allignement¶
Addresses of all elements including data, quantization parameters and structure fields must be aligned on an element boundary. This is also applicable for data allocated in the vector memory (VCCM). Addresses of vectors and vector elements must be properly aligned on a vector-element boundary.
Important
There is one type of memory access that has 8-bit alignment: a unit-stride vector load or store
with 8-bit elements (fx8
and sa8
data). For the best performance vector load
and store access for such data must use even byte addresses (aligned on 16-bit boundary).
This can be achieved by using even shapes or memstrides for sa8
and fx8
tensors.
Odd byte addresses are allowed but less efficient.
Accumulator¶
The accumulator width used in calculations depends on the Xvec_guard_bit_option
HW configuration parameter. See Quantization: Influence of Accumulator Bit Depth section for more info on how
it influence the usage of the library. The following table summaries available options an
d how much accumulations it allows to do without overflow.
Kernel Type |
Description |
guard bit option = 2 |
guard bit option = 1 |
guard bit option = 0 |
---|---|---|---|---|
|
Accum width |
24 (8 guard bits) |
20 (4 guard bits) |
16 (0 guard bits) |
MACs w/o overflow guaranty |
256 |
16 |
1 |
|
|
Accum width |
40 (8 guard bits) |
36 (4 guard bits) |
32 (0 guard bits) |
MACs guaranty |
256 |
16 |
1 |
|
|
Accum width |
40 (16 guard bits) |
36 (12 guard bits) |
32 (8 guard bits) |
MACs guaranty |
65536 |
4096 |
256 |
Operands Limitations and Shifting Ranges¶
This section describes VPX specific limitations to kernels. In this section, \(n_\text{tensor}\) denotes the fractional bits of a tensor and \(s_\text{fx,tensor}\) is its scale in case of an asymmetric data type (see Data Formats).
Weighted Kernels¶
For the following kernels:
conv2d
depthwise_conv2d
transpose_conv2d
group_conv2d
fully_connected
rnn_dense
gru_cell
lstm_cell
Firstly, to avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the the following shift restrictions must be adhered to:
Kernel Type |
Restriction |
---|---|
|
\(0 \leq n_{in} + n_{weight} - n_{out} \leq 15\) |
|
\(0 \leq n_{in} + n_{weight} - n_{out} \leq 31\) |
|
No Limitations |
Secondly, the following restrictions relate to shifting left the bias inside an accumulator:
Kernel Type |
Restriction |
---|---|
|
\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 8\) |
|
\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 16\) |
|
\(0 \leq n_{in} + n_{weight} - n_{bias} \leq 24\) |
|
No Limitations |
Avepool¶
FX16
To avoid negative shifts below lower-bound and to avoid internal large shifts above upper-bound, the in and out fraction bits must be adhered to:
with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.
SA8
To avoid internal large shifts below lower-bound and to avoid negative shifts above upper-bound, the in and out scale factors must be adhered to:
with \(\text{Wk}\) and \(\text{Hk}\) the width and height of the kernel respectively.
RNN Dense¶
FX16 and FX16_FX8_FX8
SA8_SA8_SA32
where \(acc\_ size\) is the accumulator size including the guard bits. Restriction is to avoid saturation between multiple inputs accumulators after the scale since accumulators are scaled and added in 32 bits vectors.
Leaky and Parametric ReLU¶
To avoid an extra shift-left instruction in the inner loop, a negative ‘slope_coeff’/’alpha’ tensor fractional bits is not permitted:
Kernel |
Kernel Type |
Restriction |
---|---|---|
Leaky ReLU |
|
\(0 \leq n_{slope\_coeff}\) |
Parametric ReLU |
|
\(0 \leq n_{alpha}\) |
Element-wise Add and Element-wise Sub¶
FX16
Below restriction relates to shifting both inputs such that their fractional bits align.
SA8
No VPX specific limitations (see Element-wise Kernels Group for general limitations/requirements).