Using the API

This section illustrates using TIDL APIs to leverage deep learning in user applications. The overall flow is as follows:

  • Create a Configuration object to specify the set of parameters required for network exectution.
  • Create Executor objects - one to manage overall execution on the EVEs, the other for C66x DSPs.
  • Use the Execution Objects (EO) created by the Executor to process frames. There are two approaches to processing frames using Execution Objects:
Table 1 TIDL API Use Cases
Use Case Application/Network characteristic Examples
Each EO processes a single frame. The network consists of a single Layer Group and the entire Layer Group is processed on a single EO. See Each EO processes a single frame.
  • The latency to run the network on a single frame meets application requirements
  • If frame processing time is fairly similar across EVE and C66x, multiple EOs can be used to increase throughtput
  • The latency to read an input frame can be hidden by using 2 EOPs with the same EO.
one_eo_per_frame, imagenet, segmentation
Split processing a single frame across multiple EOs using an ExecutionObjectPipeline. The network consists of 2 or more Layer Groups. See Frame split across EOs and Using EOPs for double buffering.
  • Network execution must be split across EVE and C66x to meet single frame latency requirements.
  • Split processing a single frame across multiple EOs using an ExecutionObjectPipeline.
  • Splitting can lower the overall latency because memory bound layers tend to run faster on the C66x.
two_eo_per_frame, two_eo_per_frame_opt, ssd_multibox

Refer Section API Reference for API documentation.

Use Cases

Each EO processes a single frame

In this approach, the network is set up as a single Layer Group. An EO runs the entire layer group on a single frame. To increase throughput, frame processing can be pipelined across available EOs. For example, on AM5749, frames can be processed by 4 EOs: one each on EVE1, EVE2, DSP1, and DSP2.

_images/tidl-one-eo-per-frame.png

Fig. 3 Processing a frame with one EO. Not to scale. Fn: Frame n, LG: Layer Group.

  1. Determine if there are any TIDL capable compute cores on the AM57x Processor:

    1
    2
        uint32_t num_eve = Executor::GetNumDevices(DeviceType::EVE);
        uint32_t num_dsp = Executor::GetNumDevices(DeviceType::DSP);
    
  2. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See examples/test/testvecs/config/infer for examples of configuration files.

    1
    2
    3
        Configuration c;
        if (!c.ReadFromFile(config_file))
            return false;
    
  3. Create Executor on C66x and EVE. In this example, all available C66x and EVE cores are used (lines 2-3 and CreateExecutor).

  4. Create a vector of available ExecutionObjects from both Executors (lines 7-8 and CollectEOs).

  5. Allocate input and output buffers for each ExecutionObject (AllocateMemory)

  6. Run the network on each input frame. The frames are processed with available execution objects in a pipelined manner. The additional num_eos iterations are required to flush the pipeline (lines 15-26).

    • Wait for the EO to finish processing. If the EO is not processing a frame (the first iteration on each EO), the call to ProcessFrameWait returns false. ReportTime is used to report host and device execution times.
    • Read a frame and start running the network. ProcessFrameStartAsync is asynchronous and returns before processing is complete. ReadFrame is application specific and used to read an input frame for processing. For example, with OpenCV, ReadFrame is implemented using OpenCV APIs to capture a frame from the camera.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
            // Create Executors - use all the DSP and EVE cores available
            unique_ptr<Executor> e_dsp(CreateExecutor(DeviceType::DSP, num_dsp, c));
            unique_ptr<Executor> e_eve(CreateExecutor(DeviceType::EVE, num_eve, c));
    
            // Accumulate all the EOs from across the Executors
            vector<ExecutionObject *> EOs;
            CollectEOs(e_eve.get(), EOs);
            CollectEOs(e_dsp.get(), EOs);
    
            AllocateMemory(EOs);
    
            // Process frames with EOs in a pipelined manner
            // additional num_eos iterations to flush the pipeline (epilogue)
            int num_eos = EOs.size();
            for (int frame_idx = 0; frame_idx < c.numFrames + num_eos; frame_idx++)
            {
                ExecutionObject* eo = EOs[frame_idx % num_eos];
    
                // Wait for previous frame on the same eo to finish processing
                if (eo->ProcessFrameWait())
                    ReportTime(eo);
    
                // Read a frame and start processing it with current eo
                if (ReadFrame(eo, frame_idx, c, input_data_file))
                    eo->ProcessFrameStartAsync();
            }
    
    Listing 1 CreateExecutor
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    Executor* CreateExecutor(DeviceType dt, int num, const Configuration& c)
    {
        if (num == 0) return nullptr;
    
        DeviceIds ids;
        for (int i = 0; i < num; i++)
            ids.insert(static_cast<DeviceId>(i));
    
        return new Executor(dt, ids, c);
    }
    
    Listing 2 CollectEOs
    1
    2
    3
    4
    5
    6
    7
    void CollectEOs(const Executor *e, vector<ExecutionObject *>& EOs)
    {
        if (!e) return;
    
        for (unsigned int i = 0; i < e->GetNumExecutionObjects(); i++)
            EOs.push_back((*e)[i]);
    }
    
    Listing 3 AllocateMemory
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    void AllocateMemory(const vector<ExecutionObject *>& eos)
    {
        // Allocate input and output buffers for each execution object
        for (auto eo : eos)
        {
            size_t in_size  = eo->GetInputBufferSizeInBytes();
            size_t out_size = eo->GetOutputBufferSizeInBytes();
            void*  in_ptr   = malloc(in_size);
            void*  out_ptr  = malloc(out_size);
            assert(in_ptr != nullptr && out_ptr != nullptr);
    
            ArgInfo in  = { ArgInfo(in_ptr,  in_size)};
            ArgInfo out = { ArgInfo(out_ptr, out_size)};
            eo->SetInputOutputBuffer(in, out);
        }
    }
    

The complete example is available at /usr/share/ti/tidl/examples/one_eo_per_frame/main.cpp.

Note

The double buffering technique described in Using EOPs for double buffering can be used with a single ExecutionObject to overlap reading an input frame with the processing of the previous input frame. Refer to examples/imagenet/main.cpp.

Frame split across EOs

This approach is typically used to reduce the latency of processing a single frame. Certain network layers such as Softmax and Pooling run faster on the C66x vs. EVE. Running these layers on C66x can lower the per-frame latency.

Time to process a single frame 224x224x3 frame on AM574x IDK EVM (Arm @ 1GHz, C66x @ 0.75GHz, EVE @ 0.65GHz) with JacintoNet11 (tidl_net_imagenet_jacintonet11v2.bin), TIDL API v1.1:

EVE C66x EVE + C66x
~112ms ~120ms ~64ms 1

1 BatchNorm and Convolution layers run on EVE are placed in a Layer Group and run on EVE. Pooling, InnerProduct, SoftMax layers are placed in a second Layer Group and run on C66x. The EVE layer group takes ~57.5ms, C66x layer group takes ~6.5ms.

_images/tidl-frame-across-eos.png

Fig. 4 Processing a frame across EOs. Not to scale. Fn: Frame n, LG: Layer Group.

The network consists of 2 Layer Groups. Execution Objects are organized into Execution Object Pipelines (EOP). Each EOP processes a frame. The API manages inter-EO synchronization.

  1. Determine if there are any TIDL capable compute cores on the AM57x Processor:

    1
    2
        uint32_t num_eve = Executor::GetNumDevices(DeviceType::EVE);
        uint32_t num_dsp = Executor::GetNumDevices(DeviceType::DSP);
    
  2. Create a Configuration object by reading it from a file or by initializing it directly. The example below parses a configuration file and initializes the Configuration object. See examples/test/testvecs/config/infer for examples of configuration files.

    1
    2
    3
        Configuration c;
        if (!c.ReadFromFile(config_file))
            return false;
    
  3. Update the default layer group index assignment. Pooling (layer 12), InnerProduct (layer 13) and SoftMax (layer 14) are added to a second layer group. Refer Overriding layer group assignment for details.

    1
    2
        const int DSP_LG = 2;
        c.layerIndex2LayerGroupId = { {12, DSP_LG}, {13, DSP_LG}, {14, DSP_LG} };
    
  4. Create Executors on C66x and EVE. The EVE Executor runs layer group 1, the C66x executor runs layer group 2.

  5. Create two Execution Object Pipelines. Each EOP contains one EVE and one C66x Execution Object respectively.

  6. Allocate input and output buffers for each ExecutionObject in the EOP. (AllocateMemory)

  7. Run the network on each input frame. The frames are processed with available EOPs in a pipelined manner. For ease of use, EOP and EO present the same interface to the user.

    • Wait for the EOP to finish processing. If the EOP is not processing a frame (the first iteration on each EOP), the call to ProcessFrameWait returns false. ReportTime is used to report host and device execution times.
    • Read a frame and start running the network. ProcessFrameStartAsync is asynchronous and returns before processing is complete. ReadFrame is application specific and used to read an input frame for processing. For example, with OpenCV, ReadFrame is implemented using OpenCV APIs to capture a frame from the camera.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
            int num_eops = EOPs.size();
            for (int frame_idx = 0; frame_idx < c.numFrames + num_eops; frame_idx++)
            {
                EOP* eop = EOPs[frame_idx % num_eops];
    
                // Wait for previous frame on the same EOP to finish processing
                if (eop->ProcessFrameWait())
                {
                }
    
                // Read a frame and start processing it with current eo
                if (ReadFrame(eop, frame_idx, c, input))
                    eop->ProcessFrameStartAsync();
            }
    
    Listing 4 AllocateMemory
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    void AllocateMemory(const vector<ExecutionObjectPipeline *>& eops)
    {
        // Allocate input and output buffers for each execution object
        for (auto eop : eops)
        {
            size_t in_size  = eop->GetInputBufferSizeInBytes();
            size_t out_size = eop->GetOutputBufferSizeInBytes();
            void*  in_ptr   = malloc(in_size);
            void*  out_ptr  = malloc(out_size);
            assert(in_ptr != nullptr && out_ptr != nullptr);
    
            ArgInfo in  = { ArgInfo(in_ptr,  in_size)};
            ArgInfo out = { ArgInfo(out_ptr, out_size)};
            eop->SetInputOutputBuffer(in, out);
        }
    }
    

The complete example is available at /usr/share/ti/tidl/examples/two_eo_per_frame/main.cpp. Another example of using the EOP is SSD.

Using EOPs for double buffering

The timeline shown in Fig. 4 indicates that EO-EVE1 waits for processing on E0-DSP1 to complete before it starts processing its next frame. It is possible to optimize the example further and overlap processing F n-2 on EO-DSP1 and F n on E0-EVE1. This is illustrated in Fig. 5.

_images/tidl-frame-across-eos-opt.png

Fig. 5 Optimizing using double buffered EOPs. Not to scale. Fn: Frame n, LG: Layer Group.

EOP1 and EOP2 use the same EOs: E0-EVE1 and E0-DSP1. Each EOP has it’s own input and output buffer. This enables EOP2 to read an input frame when EOP1 is processing its input frame. This in turn enables EOP2 to start processing on EO-EVE1 as soon as EOP1 completes processing on E0-EVE1.

The only change in the code compared to Frame split across EOs is to create an additional set of EOPs for double buffering:

Listing 5 Setting up EOPs for double buffering
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
        // On AM5749, create a total of 4 pipelines (EOPs):
        // EOPs[0] : { EVE1, DSP1 }
        // EOPs[1] : { EVE1, DSP1 } for double buffering
        // EOPs[2] : { EVE2, DSP2 }
        // EOPs[3] : { EVE2, DSP2 } for double buffering

        const uint32_t pipeline_depth = 2;  // 2 EOs in EOP => depth 2
        std::vector<EOP *> EOPs;
        uint32_t num_pipe = std::max(num_eve, num_dsp);
        for (uint32_t i = 0; i < num_pipe; i++)
            for (uint32_t j = 0; j < pipeline_depth; j++)
                EOPs.push_back(new EOP( { (*eve)[i % num_eve],
                                          (*dsp)[i % num_dsp] } ));

Note

EOP1 in Fig. 5 -> EOPs[0] in Listing 5.

EOP2 in Fig. 5 -> EOPs[1] in Listing 5.

EOP3 in Fig. 5 -> EOPs[2] in Listing 5.

EOP4 in Fig. 5 -> EOPs[3] in Listing 5.

The complete example is available at /usr/share/ti/tidl/examples/two_eo_per_frame_opt/main.cpp.

Sizing device side heaps

TIDL API allocates 2 heaps for device size allocations during network setup/initialization:

Heap Name Configuration parameter Default size
Parameter Configuration::PARAM_HEAP_SIZE 9MB, 1 per Executor
Network Configuration::NETWORK_HEAP_SIZE 64MB, 1 per ExecutionObject

Depending on the network being deployed, these defaults may be smaller or larger than required. In order to determine the exact sizes for the heaps, the following approach can be used:

Start with the default heap sizes. The API displays heap usage statistics when Configuration::showHeapStats is set to true.

Configuration configuration;
bool status = configuration.ReadFromFile(config_file);
configuration.showHeapStats = true;

If the heap size is larger than required by device side allocations, the API displays usage statistics. When Free > 0, the heaps are larger than required.

# ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
API Version: 01.01.00.00.e4e45c8
[eve 0]         TIDL Device Trace: PARAM heap: Size 9437184, Free 6556180, Total requested 2881004
[eve 0]         TIDL Device Trace: NETWORK heap: Size 67108864, Free 47047680, Total requested 20061184

Update the application to set the heap sizes to the “Total requested size” displayed:

configuration.PARAM_HEAP_SIZE   = 2881004;
configuration.NETWORK_HEAP_SIZE = 20061184;
# ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
API Version: 01.01.00.00.e4e45c8
[eve 0]         TIDL Device Trace: PARAM heap: Size 2881004, Free 0, Total requested 2881004
[eve 0]         TIDL Device Trace: NETWORK heap: Size 20061184, Free 0, Total requested 20061184

Now, the heaps are sized as required by network execution (i.e. Free is 0) and the configuration.showHeapStats = true line can be removed.

Note

If the default heap sizes are smaller than required, the device will report an allocation failure and indicate the required minimum size. E.g.

# ./test_tidl -n 1 -t e -c testvecs/config/infer/tidl_config_j11_v2.txt
API Version: 01.01.00.00.0ba86d4
[eve 0]         TIDL Device Error:  Allocation failure with NETWORK heap, request size 161472, avail 102512
[eve 0]         TIDL Device Error: Network heap must be >= 20061184 bytes, 19960944 not sufficient. Update Configuration::NETWORK_HEAP_SIZE
TIDL Error: [src/execution_object.cpp, Wait, 548]: Allocation failed on device

Note

The memory for parameter and network heaps is itself allocated from OpenCL global memory (CMEM). Refer Insufficient OpenCL global memory for details.

In addition, the following environment variables are provided to overwrite the heap sizes and heap allocation optimization level (1 or 2) that are specified by default or by application.

TIDL_PARAM_HEAP_SIZE_EVE
TIDL_PARAM_HEAP_SIZE_DSP
TIDL_NETWORK_HEAP_SIZE_EVE
TIDL_NETWORK_HEAP_SIZE_DSP
TIDL_EXTMEM_ALLOC_OPT_EVE
TIDL_EXTMEM_ALLOC_OPT_DSP
# # for example,
# TIDL_PARAM_HEAP_SIZE_EVE=3000000 TIDL_NETWORK_HEAP_SIZE_EVE=21000000 TIDL_PARAM_HEAP_SIZE_DSP=3000000 TIDL_NETWORK_HEAP_SIZE_DSP=2000000 ./tidl_classification -g 2 -d 1 -e 4 -l ./imagenet.txt -s ./classlist.txt -i ./clips/test10.mp4 -c ./stream_config_j11_v2.txt

Accessing outputs of network layers

TIDL API v1.1 and higher provides the following APIs to access the output buffers associated with network layers:

  • ExecutionObject::WriteLayerOutputsToFile - write outputs from each layer into individual files. Files are named <filename_prefix>_<layer_index>.bin.
  • ExecutionObject::GetOutputsFromAllLayers - Get output buffers from all layers.
  • ExecutionObject::GetOutputFromLayer - Get a single output buffer from a layer.

See examples/layer_output/main.cpp, ProcessTrace() for examples of using these tracing APIs.

Note

The ExecutionObject::GetOutputsFromAllLayers method can be memory intensive if the network has a large number of layers. This method allocates sufficient host memory to hold all output buffers from all layers.