Difference between Training and Inference of Deep Learning Frameworks
The process of using a framework for training and inference have a similar process. During training, a known data set is put through an untrained neural network. The framework's results is compared with known data set results. Then the framework re-evaluates the error value and updates the weight of the data set in the layers of the neural network based on how correct or incorrect the value is. This re-evaulation is important to training as it adjusts the neural network to improve the performance of the task it is learning.


Unlike training, inference doesn't re-evaulate or adjust the layers of the neural network based on the results. Inference applies knowledge from a trained neural network model and a uses it to infer a result. So, when a new unknown data set is input through a trained neural network, it outputs a prediction based on predictive accuracy of the neural network. Inference comes after training as it requires a trained neural network model.

Difference between Training and Inference of Deep Learning Frameworks
The process of using a framework for training and inference have a similar process. During training, a known data set is put through an untrained neural network. The framework's results is compared with known data set results. Then the framework re-evaluates the error value and updates the weight of the data set in the layers of the neural network based on how correct or incorrect the value is. This re-evaulation is important to training as it adjusts the neural network to improve the performance of the task it is learning.

Unlike training, inference doesn't re-evaulate or adjust the layers of the neural network based on the results. Inference applies knowledge from a trained neural network model and a uses it to infer a result. So, when a new unknown data set is input through a trained neural network, it outputs a prediction based on predictive accuracy of the neural network. Inference comes after training as it requires a trained neural network model.


Important aspects of Inference
While a deep learning system can be used to do inference, the important aspects of inference makes a deep learning system not ideal. Deep learning systems are optimized to handle large amounts of data to process and re-evaulate the neural network. This requires high performance compute which is more energy which means more cost. Inference may be smaller data sets but hyperscaled to many devices.

Important aspects of Inference
While a deep learning system can be used to do inference, the important aspects of inference makes a deep learning system not ideal. Deep learning systems are optimized to handle large amounts of data to process and re-evaulate the neural network. This requires high performance compute which is more energy which means more cost. Inference may be smaller data sets but hyperscaled to many devices.

Optimizing with TensorRT
TensorRT is Nvidia's deep learning inference platform built on CUDA and synergizes with Nvidia's GPU to enable the most efficient deep learning performance. TensorRT uses FP32 algorithms for performing inference to obtain the highest possible inference accuracy. However, you can use FP16 and INT8 precisions for inference with minimal impact to accuracy of results in many cases. Mixed computations in FP32 and FP16 precision can be used in TensoRT to further improve performance.

Trained models from every deep learning framework can be imported into TensorRT and can be optimized with platform specific kernels to maximize performance on Tesla GPUs in the data center and the Jetson embedded platform.
Optimizing with TensorRT
TensorRT is Nvidia's deep learning inference platform built on CUDA and synergizes with Nvidia's GPU to enable the most efficient deep learning performance. TensorRT uses FP32 algorithms for performing inference to obtain the highest possible inference accuracy. However, you can use FP16 and INT8 precisions for inference with minimal impact to accuracy of results in many cases. Mixed computations in FP32 and FP16 precision can be used in TensoRT to further improve performance.

Trained models from every deep learning framework can be imported into TensorRT and can be optimized with platform specific kernels to maximize performance on Tesla GPUs in the data center and the Jetson embedded platform.
Inference at the Data Center

While many GPUs can do some level of inference, NVIDIA introduces the T4 GPU optimized for inference inside data centers. Responsiveness is key to user engagement for services such as conversational AI, recommender systems, and visual search. The T4 brings breakthrough performance in FP32, FP16, INT8, and INT4 precisions. This low profile single slot GPU uses an energy efficient 70W without the need for additional power cables.
Inference at the Data Center
While many GPUs can do some level of inference, NVIDIA introduces the T4 GPU optimized for inference inside data centers. Responsiveness is key to user engagement for services such as conversational AI, recommender systems, and visual search. The T4 brings breakthrough performance in FP32, FP16, INT8, and INT4 precisions. This low profile single slot GPU uses an energy efficient 70W without the need for additional power cables.

Inference on the Edge
With the Jetson low power GPU module, latency is greatly reduced with these solutions as they are doing inference in real time. This is vital when connectivity is not possible like remote devices or latency to send information to and from a data center is too long. Since data doesn't need to be sent to the cloud on a edge device, there is more privacy and security.

Inference on the Edge
With the Jetson low power GPU module, latency is greatly reduced with these solutions as they are doing inference in real time. This is vital when connectivity is not possible like remote devices or latency to send information to and from a data center is too long. Since data doesn't need to be sent to the cloud on a edge device, there is more privacy and security.
