defensive distillation

Defensive distillation is an adversarial training technique used in the field of machine learning, specifically in the context of deep learning. The technique protects neural networks from adversarial attacks and makes an algorithm’s classification process more flexible so the model is less vulnerable to exploitation.

In distillation training a student neural network is trained to predict the output probabilities of a teacher neural network. This process strives to enhance the generalizability and robustness of the student model while maintaining its performance.

The main advantage of the distillation approach is that it’s adaptable to unknown threats. Since adversarial learning – the other most effective adversarial training method – demands continuously feeding the signatures of all known vulnerabilities and attacks into the system, distillation is more dynamic and requires less human intervention.

The main drawback is that while the student model has more leeway to reject input manipulation, it is still bound by the general rules of the teacher model. So with enough computing power and fine-tuning on the attacker’s part, both models can be reverse-engineered to discover fundamental exploits. Distillation models are also vulnerable to so-called poisoning attacks, where the initial training database is corrupted by a threat actor.

In conclusion defensive distillation is a promising approach to improving the robustness of neural networks against adversarial attacks. By exploiting the knowledge contained in the soft probabilities of a teacher network, a student network can learn to be more resilient to malicious inputs. However, it is important to recognize its limitations and continue to develop and combine it with other defense approaches such as adversarial learning to guarantee the security of machine learning systems.

Sources:

https://deepai.org/machine-learning-glossary-and-terms/defensive-distillation

https://www.activeloop.ai/resources/glossary/defensive-distillation/