LAD: Layer-Wise Adaptive Distillation for BERT Model Compression
Syrups and Sauces Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing.However, the large model size hinders their use in IoT and edge devices.Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models.However,