Performance Considerations

Realtime or Better

Machine learning models are massive computation graphs that push computing hardware to the limit. NatML is specifically designed for interactive applications, and as such has many considerations for maximizing performance.

‚ÄčNatML Hub shows realtime performance data for every model on the platform. Use this as a guide when choosing models.

Model Considerations

The single most important factor in runtime performance is the choice of the model. Different model architectures can have performance implications. Below are some general guidelines when designing or choosing models:

Supported Operators

When NatML runs inference on a model, it will always try to run the model computation graph on a dedicated machine learning accelerator (like CoreML on Apple and NNAPI on Android). If the model graph has any operators which are unsupported by the accelerator, it must then move the computation back to the CPU.

You can check what layers are supported by ML accelerators by looking at their Layer Coverage.

This process incurs a sizable penalty, both because of the memory transfers involved; and because CPU execution is typically much slower than dedicated ML hardware. As a result, it is highly recommended to use models which can be executed entirely by the ML accelerator.

Dynamic Features

When working on vision tasks, some models can accept dynamically-sized inputs (i.e. images of arbitrary sizes). This is in contrast with models that accept statically-sized inputs, which require resizing the input image to a fixed size before prediction.

Using models with dynamic inputs is strongly discouraged, because doing so usually translates into orders of magnitude more computation.

Currently, models with dynamic inputs will always use the CPU instead of NNAPI on Android.

Model Quantization

NatML has preliminary support for running quantized models. Quantized model store the coefficients of a model's computation graph as either half-precision floats (fp16) or integers. Though model quantization is not a silver bullet for performance, it typically results in better performance on certain accelerators.

Hub Considerations

The second layer of optimization lies with NatML Hub. It is highly recommended to upload your models to Hub, even if you have no intention of distributing them. This is because Hub is able to optimize performance for every model, and for every specific device which fetches the model.

When MLModelData is fetched from Hub, Hub has several choices to make which can highly influence the runtime performance of the model. Some of these choices include:

  • Accelerators to run a model on, like the CPU, SIMD accelerators, or neural processors.

  • Quantization schemes, like half-float (fp16) or integer quantization.

  • On-device graph optimization, where some accelerators might benefit more than others.

  • Graph execution schemes, like distributing execution between accelerators.

Because Hub collects performance analytics data, it is able to analyze the performance of a given model on a given device, for all of the choices it makes. As such, it can perform a purely data-driven model optimization specific to each device that requests model data.

Hub's optimization is purely data-driven, and factors the target device in addition to the model.

Predictor Considerations

The third layer of optimization lies within the predictor. The main concern around predictors is on memory allocations and memory copies. A predictor is responsible for converting input features into data that can be ingested by the model, and for converting prediction feature data into an output type that can easily be used by developers.

MLFeature implementations already provide highly-optimized routines for converting input features into prediction-ready data. If your predictor cannot benefit from these routines, then you should attempt write highly-parallelized conversion code. For this, we highly recommend using Unity's Burst compiler or other SIMD routines.

We do not recommend using compute shaders for conversions in predictors because the GPU readback can take much more time than the actual conversion.

For converting prediction feature data into a usable output type, we recommend working directly on the native feature data instead of copying the data into a managed array.

When working with predictors, you should try to minimize the amount of memory copies you perform as much as possible.

Feature Considerations

Features are the final point of optimization. When you instantiate an MLFeature, whether by using an implicit conversion or by using a specific feature constructor, no computation is expected to happen. As such, features are very lightweight objects that can be used liberally.

An MLFeature will usually only perform any computation when it is used to make a prediction.

The only thing to keep in mind is to reuse features wherever necessary. This can provide a slight performance gain because such features can reuse any allocations they might have made previously, as opposed to recreating them from scratch.

Asynchronous Predictions

Once all the above optimizations have been made, your model might not be able to run in realtime. This doesn't mean it can't still be used; in fact many models run slower-than-realtime in interactive applications. In this situation, it becomes beneficial to run predictions asynchronously.

NatML provides the MLAsyncPredictor which is a wrapper around any existing predictor for this purpose:

// Create a predictor
var predictor = new MLClassificationPredictor(...);
// Then make it async!
var asyncPredictor = predictor.ToAsync();

The async predictor spins up a dedicated worker thread for making predictions, completely freeing up your app to perform other processing:

// Before, we used to make predictions on the main thread:
var (label, confidence) = predictor.Predict(...);
// Now, we can make predictions on a dedicated worker thread:
var (label, confidence) = await asyncPredictor.Predict(...);

When making predictions in streaming applications (like in camera apps), you can check if the async predictor is ready to make more predictions, so as not to backup the processing queue:

// If the predictor is ready, queue more work
if (asyncPredictor.readyForPrediction)
var output = await asyncPredictor.Predict(...);

Finally, you must Dispose the predictor when you are done with it, so as not to leave threads and other resources dangling.

Do not use predictors from multiple threads. So once you create an MLAsyncPredictor from an inner predictor, do not use the inner predictor.