Machine learning models are massive computation graphs that push computing hardware to the limit. NatML is specifically designed for interactive applications, and as such has many considerations for maximizing performance.
The single most important factor in runtime performance is the choice of the model. Different model architectures can have performance implications. Below are some general guidelines when designing or choosing models:
When NatML runs inference on a model, it will always try to run the model computation graph on a dedicated machine learning accelerator (like CoreML on Apple and NNAPI on Android). If the model graph has any operators which are unsupported by the accelerator, it must then move the computation back to the CPU.
This process incurs a sizable penalty, both because of the memory transfers involved; and because CPU execution is typically much slower than dedicated ML hardware. As a result, it is highly recommended to use models which can be executed entirely by the ML accelerator.
When working on vision tasks, some models can accept dynamically-sized inputs (i.e. images of arbitrary sizes). This is in contrast with models that accept statically-sized inputs, which require resizing the input image to a fixed size before prediction.
Using models with dynamic inputs is strongly discouraged, because doing so usually translates into orders of magnitude more computation.
NatML has preliminary support for running quantized models. Quantized model store the coefficients of a model's computation graph as either half-precision floats (
fp16) or integers. Though model quantization is not a silver bullet for performance, it typically results in better performance on certain accelerators.
The second layer of optimization lies with NatML Hub. It is highly recommended to upload your models to Hub, even if you have no intention of distributing them. This is because Hub is able to optimize performance for every model, and for every specific device which fetches the model.
MLModelData is fetched from Hub, Hub has several choices to make which can highly influence the runtime performance of the model. Some of these choices include:
Accelerators to run a model on, like the CPU, SIMD accelerators, or neural processors.
Quantization schemes, like half-float (
fp16) or integer quantization.
On-device graph optimization, where some accelerators might benefit more than others.
Graph execution schemes, like distributing execution between accelerators.
Because Hub collects performance analytics data, it is able to analyze the performance of a given model on a given device, for all of the choices it makes. As such, it can perform a purely data-driven model optimization specific to each device that requests model data.
The third layer of optimization lies within the predictor. The main concern around predictors is on memory allocations and memory copies. A predictor is responsible for converting input features into data that can be ingested by the model, and for converting prediction feature data into an output type that can easily be used by developers.
MLFeature implementations already provide highly-optimized routines for converting input features into prediction-ready data. If your predictor cannot benefit from these routines, then you should attempt write highly-parallelized conversion code. For this, we highly recommend using Unity's Burst compiler or other SIMD routines.
For converting prediction feature data into a usable output type, we recommend working directly on the native feature data instead of copying the data into a managed array.
Features are the final point of optimization. When you instantiate an
MLFeature, whether by using an implicit conversion or by using a specific feature constructor, no computation is expected to happen. As such, features are very lightweight objects that can be used liberally.
The only thing to keep in mind is to reuse features wherever necessary. This can provide a slight performance gain because such features can reuse any allocations they might have made previously, as opposed to recreating them from scratch.
Once all the above optimizations have been made, your model might not be able to run in realtime. This doesn't mean it can't still be used; in fact many models run slower-than-realtime in interactive applications. In this situation, it becomes beneficial to run predictions asynchronously.
NatML provides the
MLAsyncPredictor which is a wrapper around any existing predictor for this purpose:
// Create a predictorvar predictor = new MLClassificationPredictor(...);// Then make it async!var asyncPredictor = predictor.ToAsync();
The async predictor spins up a dedicated worker thread for making predictions, completely freeing up your app to perform other processing:
// Before, we used to make predictions on the main thread:var (label, confidence) = predictor.Predict(...);// Now, we can make predictions on a dedicated worker thread:var (label, confidence) = await asyncPredictor.Predict(...);
When making predictions in streaming applications (like in camera apps), you can check if the async predictor is ready to make more predictions, so as not to backup the processing queue:
// If the predictor is ready, queue more workif (asyncPredictor.readyForPrediction)var output = await asyncPredictor.Predict(...);
Finally, you must
Dispose the predictor when you are done with it, so as not to leave threads and other resources dangling.