.. _data_mvmt: Data Movement ============= Most processors and accelerators achieve optimal performance by keeping the kernel input data in "close" local memories on those processors (for example, CCMs). Meanwhile, the total amount of data that needs to be operated on is usually bigger than the sizes of those CCMs, so you must copy data into CCMs for processing. The functions in the Data Movement Group assist with this moving (copying) of data. For further efficiency, the APIs allow data to be manipulated in several ways, thus allowing copy and manipulation to happen at the same time, rather than in two separate steps. The supported transformations are described here and shown pictorially in Figure :ref:`f_data_mv_grp_t` - Copy - a simple copy operation only - Slicing - select a subset of a larger tensor and copy it to the destination tensor - Concatenation - copy the content of a smaller tensor into a larger one - Subsample - Reduce the size of the output by taking only every nth element from the source - Permute - Reorder dimensions of a tensor - Padding - Add zeros around the specified dimensions More than one transform can be combined into a single operation. .. _f_data_mv_grp_t: .. figure:: ../images/data_mv_grp_transfm.png :align: center :alt: Data Movement Group Transformations Data Movement Group Transformations .. The data movement APIs support both synchronous (blocking) and asynchronous (non-blocking) APIs. The caller can choose which is most suitable for their use-case. The functions are described in the following sections. Synchronous Copy from Source Tensor to Destination Tensor --------------------------------------------------------- This function performs a blocking data copy from the source tensor to the destination tensor according to the settings in the configuration structure, ``mli_mov_cfg``. Any combination of the previously-mentioned transformations can be specified. The destination tensor needs to contain a valid pointer to a large enough buffer. The size of this buffer needs to be specified in the capacity field of the destination tensor. The other fields of the destination tensor are filled by the move function. The function returns after the complete data transfer completes. For platforms with a DMA, the synchronous move function internally acquires one or more DMA channels from a pool of resources. The application needs to use the function ``mli_mov_set_num_dma_ch`` to assign a set of channels to MLI for its exclusive use. See section :ref:`dma_res_mgmt` for a detailed explanation. .. code:: c mli_status mli_mov_tensor_sync ( mli_tensor *src, mli_mov_cfg *cfg, mli_tensor *dst); .. where ``mli_mov_cfg`` is defined as: .. code:: c typedef struct { uint32_t offset[MLI_MAX_RANK]; uint32_t size[MLI_MAX_RANK]; uint32_t sub_sample_step[MLI_MAX_RANK]; uint32_t dst_offset[MLI_MAX_RANK]; int32_t dst_mem_stride[MLI_MAX_RANK]; uint8_t perm_dim[MLI_MAX_RANK]; uint8_t padding_pre[MLI_MAX_RANK]; uint8_t padding_post[MLI_MAX_RANK]; } mli_mov_cfg_t; .. The fields of this structure are described in Table :ref:`t_mli_mov_cfg_desc`. All the fields are arrays with size ``MLI_MAX_RANK``. Fields are stored in order starting from the one with the largest stride between the data portions. For example, for a matrix A(rows, columns), shape[0] = rows, shape[1] = columns. The data move function does not change the number of dimensions. The rank of the source tensor determines the amount of values that are read from the array. The other values are "don't care". The size of the array is defined by ``MLI_MAX_RANK``. .. tabularcolumns:: |\Y{0.25}|\Y{0.15}|\Y{0.4}| .. _t_mli_mov_cfg_desc: .. table:: mli_mov_cfg Structure Field Description :align: center :class: longtable +---------------------+----------------+---------------------------------------------------------------------+ | **Field Name** | **Type** | **Description** | +=====================+================+=====================================================================+ | ``offset`` | ``uint32_t[]`` | Start coordinate in the source tensor. Values must be smaller | | | | than the shape of the source tensor. | +---------------------+----------------+---------------------------------------------------------------------+ | | | Size of the copy in elements per dimension. | | | | | | ``size`` | ``uint32_t[]`` | Restrictions: | | | | | | | | :math:`Size[d] + offset[d] <= src->shape[d]` | | | | | | | | If :math:`size=0` is provided, the size is computed from the input | | | | size and the cfg parameters. | +---------------------+----------------+---------------------------------------------------------------------+ | | | Subsample factor for each dimension. Default value is 1, which | | ``sub_sample_step`` | ``uint32_t[]`` | means no subsampling. For example, a subsample step of 3 means that | | | | every third sample is copied to the output. | +---------------------+----------------+---------------------------------------------------------------------+ | ``dst_offset`` | ``uint32_t[]`` | Start coordinate in the destination tensor. Values must be | | | | smaller than the memstride of the destination tensor. | +---------------------+----------------+---------------------------------------------------------------------+ | | | Distance in elements to the next dimension in the destination | | ``dst_mem_stride`` | ``int32_t[]`` | tensor. If zero, it is computed from input tensor and cfg | | | | parameters. | +---------------------+----------------+---------------------------------------------------------------------+ | ``perm_dim`` | ``uint8_t[]`` | Array to specify reordering of dimensions. For example, to convert | | | | from CHW layout to HWC layout this array would be {1, 2, 0}. | +---------------------+----------------+---------------------------------------------------------------------+ | ``padding_pre`` | ``uint8_t[]`` | Number of padded samples before the input data for each dimension. | | | | Padding is a virtual extension of the input tensor. | | | | Padded samples are set to zero. | +---------------------+----------------+---------------------------------------------------------------------+ | ``padding_post`` | ``uint8_t[]`` | Number of padded samples after the data for each dimension. | | | | Padding is a virtual extension of the input tensor. | | | | Padded samples are set to zero. | +---------------------+----------------+---------------------------------------------------------------------+ .. It is possible to combine multiple 'operations' in one move. In that the internal order of how the parameters are applied is relevant: - padding_pre/padding_post (using cfg->padding_pre and cfg->padding_post as described in figure :ref:`f_mli_mov_cfg_params_pad`) - crop (using cfg->offset and cfg->size as described in figure :ref:`f_mli_mov_cfg_params_crop`) - subsampling (using cfg->sub_sample_step as described in figure :ref:`f_mli_mov_cfg_params_sub`) - permute (using cfg->perm_dim) - write at offset (using cfg->dst_offset and cfg->dst_mem_stride) .. _f_mli_mov_cfg_params_pad: .. figure:: ../images/mli_mov_cfg_params_pad.png :align: center mli_mov_cfg Structure Parameters for padding .. _f_mli_mov_cfg_params_crop: .. figure:: ../images/mli_mov_cfg_params_crop.png :align: center mli_mov_cfg Structure Parameters for cropping .. _f_mli_mov_cfg_params_sub: .. figure:: ../images/mli_mov_cfg_params_sub.png :align: center mli_mov_cfg Structure Parameters for subsampling Ensure that you satisfy the following conditions before calling the function: - ``src`` tensor must be valid. - ``dst`` tensor must contain a valid pointer to a buffer with sufficient capacity (that is, the total amount of elements in input tensor). Other fields are filled by the kernel (shape, rank and element-specific parameters). - Buffers of ``src`` and ``dst`` tensors must point to different, non-overlapped memory regions. For ``src`` tensor of **sa8** or **sa32** types, and in case of per-axis quantization, the ``el_params`` field of ``dst`` tensor is filled by the function using ``src`` quantization parameters. The following fields are affected: - ``dst.el_params.sa.zero_point.mem.pi16`` and related capacity field - ``dst.el_params.sa.scale.mem.pi16`` and related capacity field - ``dst.el_params.sa.scale_frac_bits.mem.pi8`` and related capacity field Depending on the state of the preceding pointer fields, ensure that you choose only one of the following options to initialize all the fields in a consistent way: - If you initialize the pointers with ``nullptr``, then corresponding fields from the ``in`` tensor are copied to ``dst`` tensor. No copy of quantization parameters itself is performed. - If you initialize the pointers and capacity fields with the corresponding fields from the ``in`` tensor, then no action is applied. - If you initialize the pointers and capacity fields with pre-allocated memory and its capacity, then a copy of quantization parameters itself is performed. Capacity of allocated memory must be big enough to keep related data from input tensor. Some operations (like padding, concat, slice, subsample) when applied on the quantization axis will affect the quantization parameters (e.g. If subsampling on the quantization axis is applied, also the quantization parameters will be subsampled). For these cases the output tensor should contain valid pre-allocated memory buffers to get correct quantization parameters. If this is not done, and the pointers are initialized with ``nullptr`` or with the same pointers as the src tensor, the result is undefined. - In case of per-axis quantization of the src tensor, the axis needs to match the (permuted) quantization axis in the dst tensor. - A combination of per-axis and per-tensor quantization is not allowed. - In case of padding on the quantization axis, the quantization parameters for the padded area will be set to scale=1, scale_frac_bits=0, zero_point=0 Depending on the debug level (see section :ref:`err_codes`) this function performs a parameter check and returns the result as an ``mli_status`` code as described in section :ref:`kernl_sp_conf`. Helper Functions for Data Move Config Struct -------------------------------------------- When only one of the transformations is needed during the copy, several helper functions can be used to fill the config struct. These are described in :ref:`t_desc_helper_func`. The arguments to the function are copied into the cfg struct while the remaining parameters are set to their default values. In the case of multiple transformations, there is a generic helper function available or the user can manually fill the cfg struct parameters. Note that the mli_mov_cfg structure is described in detail in :ref:`t_mli_mov_cfg_desc`. .. tabularcolumns:: |\Y{0.4}|\Y{0.6}| .. _t_desc_helper_func: .. table:: Description of Helper Functions for Data Move Config Struct :align: center :class: longtable +------------------------------------+---------------------------------------------------------------------+ | **Function Name** | **Description** | +====================================+=====================================================================+ | .. code:: c | | | | | | mli_mov_cfg_for_copy( | Fills the cfg struct with the values needed for a full tensor | | mli_mov_cfg_t *cfg) | copy and sets all the other fields to a neutral value. | | .. | | | | - **cfg**: pointer to the config structure that is filled | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_slice ( | Fill the cfg struct with the values needed for copying a | | mli_mov_cfg_t *cfg, | slice from the source to the destination tensor. | | int* offsets | | | int* sizes, | - **cfg**: pointer to the config structure that is filled | | int* dst_mem_stride); | | | .. | - **offsets**: Start coordinate in the source tensor. Values must | | | be smaller than the shape of the source tensor. | | | | | | - **sizes**: Size of the copy in elements per dimension. | | | | | | - **dst_mem_stride**: Distance in elements to the next dimension in | | | the destination tensor. | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_concat( | Fill the cfg struct with the values needed for copying a complete | | mli_mov_cfg_t *cfg, | tensor into a larger tensor at a specified position. | | int* dst_offsets, | | | int* dst_mem_stride); | - **cfg**: pointer to the config structure that is filled | | .. | | | | - **dst_offsets**: Start coordinate in the destination tensor. | | | Values must be smaller than the memstride of the destination | | | tensor. | | | | | | - **dst_mem_strides**: Distance in elements to the next dimension | | | in the destination tensor. | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_subsample( | Fill the cfg struct with the values needed for subsampling a | | mli_mov_cfg_t *cfg, | tensor. | | int* sub_sample_step, | | | int* dst_mem_stride); | - **cfg**: pointer to the config structure that is filled | | .. | | | | - **subsample_step**: Subsample factor for each dimension. Default | | | value is 1, which means no subsampling | | | | | | - **dst_mem_strides**: Distance in elements to the next dimension | | | in the destination tensor | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_permute( | | | mli_mov_cfg_t *cfg, | Fill the cfg struct with the values needed for reordering the order | | uint8_t* perm_dim); | of the dimensions in a tensor. | | | | | .. | - **cfg**: pointer to the config structure that is filled | | | | | | - **perm_dim**: Array to specify reordering of dimensions, see | | | :ref:`t_mli_mov_cfg_desc` for details | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_padding2d_chw( | Fill the cfg struct with the values needed to add zero padding to a | | mli_mov_cfg_t *cfg, | tensor in CHW layout. | | uint8_t padleft, | | | uint8_t padright, | - **cfg**: pointer to the config structure that is filled | | uint8_t padtop, | | | uint8_t padbot, | - **padleft**: number of zero samples to be added to the left of | | int* dst_mem_stride); | the W dimension | | .. | | | | - **padright**: number of zero samples to be added to the right of | | | the W dimension | | | | | | - **padtop**: number of zero samples to be added at the top of the | | | H dimension | | | | | | - **padbot**: number of zero samples to be added at the bottom of | | | the H dimension | | | | | | - **dst_mem_strides**: Distance in elements to the next dimension | | | in the destination tensor | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_for_padding2d_hwc( | Fill the cfg struct with the values needed to add zero padding to a | | mli_mov_cfg_t *cfg, | tensor in HWC layout. | | uint8_t padleft, | | | uint8_t padright, | - **cfg**: pointer to the config structure that is filled | | uint8_t padtop, | | | uint8_t padbot, | - **padleft**: number of zero samples to be added to the left of | | int* dst_mem_stride); | the W dimension | | .. | | | | - **padright**: number of zero samples to be added to the right of | | | the W dimension | | | | | | - **padtop**: number of zero samples to be added at the top of the | | | H dimension | | | | | | - **padbot**: number of zero samples to be added at the bottom of | | | the H dimension | | | | | | - **dst_mem_strides**: Distance in elements to the next dimension | | | in the destination tensor | +------------------------------------+---------------------------------------------------------------------+ | .. code:: c | | | | | | mli_mov_cfg_all( | This function fills the cfg struct with the values provided as | | mli_mov_cfg_t *cfg, | function arguments. It is recommended that applications use this | | int* offsets, | function instead of direct structure access, so that application | | int* sizes, | code does not have to change if the structure format ever changes. | | int* subsample_step, | | | int* dst_offsets, | - **cfg**: pointer to the config structure that is filled | | int* dst_mem_strides, | | | uint8_t* perm_dim, | - **offsets**: Start coordinate in the source tensor. Values must | | uint8_t* pad_pre, | be smaller than the shape of the source tensor. | | uint8_t* pad_post); | | | .. | - **sizes**: Size of the copy in elements per dimension. | | | | | | - **subsample_step**: Subsample factor for each dimension. Default | | | value is 1, which means no subsampling | | | | | | - **dst_offsets**: Start coordinate in the destination tensor. | | | Values must be smaller than the memstride of the destination | | | tensor. | | | | | | - **dst_mem_strides**: Distance in elements to the next dimension | | | in the destination tensor | | | | | | - **perm_dim**: Array to specify reordering of dimensions. | | | | | | - **pad_pre**: Number of padded samples before the data for each | | | dimension. Padded samples are set to zero. | | | | | | - **pad_post**: Number of padded samples after the data for each | | | dimension. Padded samples are set to zero | +------------------------------------+---------------------------------------------------------------------+ .. Asynchronous Data Move Functions -------------------------------- Certain implementations might choose to perform other processing while the move operations are in progress. This is especially helpful for systems that use a DMA to move the data. The asynchronous API can be used in that case. The operation is divided into three separate steps, each with corresponding APIs: 1. Preparation (DMA programming) 2. Start processing (trigger DMA) 3. Done notification (DMA finished, data is ready) – via either callback or polling Between steps 2 & 3, the application can do other processing. These APIs use the ``mli_mov_handle_t`` type. The definition of this type is private to the implementation, but to avoid dynamic memory allocation the definition is put in the public header file. This way the caller can allocate it on the stack. .. Preparation ^^^^^^^^^^^ The ``mli_mov_prepare`` function is called first to set up the transfer. The functionality of this function varies depending on the implementation, but it often performs DMA initialization. Table :ref:`t_mli_mov_prep` describes the parameters of this function. .. code:: c mli_status mli_mov_prepare(mli_mov_handle_t* h, const mli_tensor* src, const mli_mov_cfg_t* cfg, mli_tensor* dst); .. .. tabularcolumns:: |\Y{0.35}|\Y{0.65}| .. _t_mli_mov_prep: .. table:: mli_mov_prepare Parameters :align: center +--------------------------+-------------------------------------------------------------+ | **Parameter Name** | **Description** | +==========================+=============================================================+ | ``mli_mov_handle_t* h`` | Pointer to a handle obtained by ``mli_mov_acquire_handle``. | | | See :ref:`dma_res_mgmt` for details | +--------------------------+-------------------------------------------------------------+ | ``mli_tensor* src`` | Pointer to Source tensor | +--------------------------+-------------------------------------------------------------+ | ``mli_mov_cfg_t* cfg`` | Pointer to a cfg structure (see | | | :ref:`t_mli_mov_cfg_desc` for details) | +--------------------------+-------------------------------------------------------------+ | ``mli_tensor* dst`` | Pointer to Destination tensor | +--------------------------+-------------------------------------------------------------+ .. Depending on the debug level (see section :ref:`err_codes`), this function performs a parameter check and returns the result as an ``mli_status`` code as described in section :ref:`kernl_sp_conf`. Start Processing ^^^^^^^^^^^^^^^^ The ``mli_mov_start`` function is called to begin the previously-setup transfer. Table :ref:`t_mli_mov_start` describes the parameters of this function. If this function is called without first calling ``mli_mov_prepare`` for a given handle, the DMA might be triggered with an old configuration leading to undefined behavior. In a debug build, an assert is triggered. .. code:: c mli_status mli_mov_start(mli_mov_handle_t* h, const mli_tensor* src, const mli_mov_cfg_t* cfg, mli_tensor* dst); .. .. _t_mli_mov_start: .. table:: mli_mov_start Parameters :align: center :widths: auto +--------------------------+--------------------------------------------+ | **Parameter Name** | **Description** | +==========================+============================================+ | ``mli_mov_handle_t* h`` | Pointer to handle used when calling | | | associated ``mli_move_prepare`` | +--------------------------+--------------------------------------------+ | ``mli_tensor* src`` | Pointer to Source tensor | +--------------------------+--------------------------------------------+ | ``mli_mov_cfg_t* cfg`` | Pointer to a cfg structure (see | | | :ref:`t_mli_mov_cfg_desc` for description) | +--------------------------+--------------------------------------------+ | ``mli_tensor* dst`` | Pointer to Destination tensor | +--------------------------+--------------------------------------------+ .. Depending on the debug level (see section :ref:`err_codes`), this function performs a parameter check and returns the result as an ``mli_status`` code as described in section :ref:`kernl_sp_conf`. Done Notification - Callback ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can register a callback function which is called after the data move is finished. A callback is registered with the following function. The parameters are described in Table :ref:`t_mli_mov_regcb`. .. code:: c mli_status mli_mov_registercallback(mli_mov_handle_t* h, void (*cb)(int32_t), int32_t cookie); .. .. _t_mli_mov_regcb: .. table:: mli_mov_registercallback Parameters :align: center :widths: auto +--------------------------+-------------------------------------------------+ | **Parameter Name** | **Description** | +==========================+=================================================+ | ``mli_mov_handle_t* h`` | Pointer to handle used when calling associated | | | ``mli_move_prepare`` | +--------------------------+-------------------------------------------------+ | ``void (*cb)(int32_t)`` | Pointer to user-supplied callback function | +--------------------------+-------------------------------------------------+ | ``int32_t cookie`` | Parameter passed to callback function | +--------------------------+-------------------------------------------------+ .. .. note:: If a callback is used, ``mli_mov_registercallback`` must be called before ``mli_mov_start`` to avoid race conditions. A race condition would arise if the DMA transaction is faster than the registration of the callback and would cause the callback to not be called. .. If a callback function has been registered, this callback is called after the DMA transaction completes, and the value of cookie is passed in as an argument. Done Notification - Polling ^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can also simply poll for the completion of the DMA transaction using this function: .. code:: c bool mli_mov_isdone(mli_mov_handle_t* h); .. This function takes a pointer to the handle used for ``mli_mov_prepare`` and returns: - True - if the transaction is complete - False - if the transaction is still in progress You can also wait for the DMA to compete using the following function: .. code:: c mli_status mli_mov_wait(mli_mov_handle_t* h); .. This function takes a pointer to the handle used for ``mli_mov_prepare`` and returns after the transaction completes or in case of an error. Restrictions for Source and Destination Tensors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``src`` and ``dst`` tensors for all functions of asynchronous data move set must comply to the following conditions: - ``src`` tensor must be valid. - ``dst`` tensor must contain a valid pointer to a buffer with sufficient capacity. (that is, the total amount of elements in input tensor). Other fields are filled by the kernel (shape, rank and element-specific parameters). - Buffers of ``src`` and ``dst`` tensors must point to different, non-overlapped memory regions. For ``src`` tensor of **sa8** or **sa32** types, and in case of per-axis quantization, the ``el_params`` field of ``dst`` tensor is filled by the function using ``src`` quantization parameters. The following fields are affected: - ``dst.el_params.sa.zero_point.mem.pi16`` and related capacity field - ``dst.el_params.sa.scale.mem.pi16`` and related capacity field - ``dst.el_params.sa.scale_frac_bits.mem.pi8`` and related capacity field Depending on the state of the preceding pointer fields, ensure that you choose only one of the following options to initialize all the fields in a consistent way: - If you initialize the pointers with ``nullptr``, then corresponding fields from the ``in`` tensor are copied to ``dst`` tensor. No copy of quantization parameters itself is performed. - If you initialize the pointers and capacity fields with the corresponding fields from the ``in`` tensor, then no action is applied. - If you initialize the pointers and capacity fields with pre-allocated memory and its capacity, then a copy of quantization parameters itself is performed. Capacity of allocated memory must be big enough to keep related data from input tensor. .. _dma_res_mgmt: DMA Resource Management ----------------------- The MLI API permits multiple mov transactions occurring in parallel, if the particular target hardware has a DMA engine which supports multiple channels. MLI also assumes that other parts of the system might want to access the DMA Engine at the same time and relies on the application/caller to provide it with a pool of available DMA channels that can be used exclusively by MLI. The following functions are used for this purpose: The ``mli_mov_set_num_dma_ch`` is called once at initialization time to assign a set of channels to MLI for its exclusive use. .. code:: c mli_status mli_mov_set_num_dma_ch(int ch_offset, int num_ch); .. - ``ch_offset`` - first channel number that MLI should use - ``num_ch`` - max number of channels that MLI can use The asynchronous move functions require a handle to a DMA resource. This handle can be obtained from the pool using ``mli_mov_acquire_handle``: .. code:: c mli_status mli_mov_acquire_handle(int num_ch, mli_mov_handle_t* h); .. - ``num_ch`` - Number of DMA channels required for this move. Certain complex transactions might be more efficient when multiple channels can be used. By default, a value of 1 should be used. - ``mli_mov_handle_t* h`` - Pointer to a handle type which is initialized by this function After the move has completed, the resources must be released back to the pool to avoid exhaustion: .. code:: c mli_status mli_mov_release_handle(mli_mov_handle_t* h); .. - ``mli_mov_handle_t* h`` - Pointer to a handle type which is used by the now-completed transaction Depending on the debug level (see section :ref:`err_codes`) this function performs a parameter check and returns the result as an ``mli_status`` code as described in section :ref:`kernl_sp_conf`. .. note:: The synchronous move function ``mli_mov_tensor_sync`` manages these DMA operations internally. ..