ALDEN: Dual-Level Disentanglement with Meta-learning for Generalizable Audio Deepfake Detection

Yuxiong Xu, Bin Li^*, Weixiang Li, Sara Mandelli, Viola Negroni, Sheng Li

Shenzhen University, Politecnico di Milano, Afirstsoft Technology Group Co.
ACM MM 2025
^*Corresponding author

An illustration of the ALDEN framework, structured along two key axes: low-level signal disentanglement (vertical) and high-level semantic disentanglement (horizontal). ALDEN incorporates dual-level disentangled learning (scissors) and meta-learning (recycling) to improve generalization across different vocoders. By focusing on vocoder-agnostic features and synthetic-relevant cues, ALDEN enhances the model's generalization ability while minimizing sensitivity to irrelevant variations.

Framework

Overall framework of the proposed ALDEN. The ALDEN consists of three key components: (a) An adversarial-training-based disentanglement learning (ADL) module employs a multi-task learning strategy to disentangle vocoder-specific features f^d from vocoder-agnostic features f^a. (b) A reconstruction-based disentanglement learning (RDL) module uses audio reconstruction to disentangle f^a, content features f^c, and speaker features f^s. (c) A vocoder-agnostic meta-learning (VAML) module mitigates overfitting to specific vocoders and facilitates the effective updating of the vocoder-agnostic encoder E_a and the forgery classifier C_a.

Algorithm

Algorithm 1: The Proposed ALDEN Framework

Cross-vocoder and In-the-wild Scenarios

Table 2: EER (%) comparison of ADD methods under cross-vocoder scenarios.

Figure 3: EER (%) comparison of ADD methods under in-the-wild scenario.

Detailed Results on Different Datasets

Table 3: EER (%) Comparison under cross-vocoder evaluation on the 21DF Dataset.

Table 4: EER (%) Comparison under cross-vocoder evaluation on the WF Dataset.

Table 5: EER (%) Comparison under cross-vocoder evaluation on the LSV Dataset.

Table 6: EER (%) Comparison under cross-vocoder evaluation on the CVF Dataset.

Table 7: Ablation results (EER (%)) for each component of the ALDEN framework.

Figure 4: Ablation results (EER (\%)) across various classification loss functions. CE denotes cross-entropy loss, and LN denotes LogitNorm loss.

Figure 5: Ablation results (EER (%)) across various hyperparameter settings.

Table 8: Model complexity and throughput (samples per second) of different ADD methods.

BibTeX

If you find our work useful, please consider citing:

@inproceedings{xu2025alden,
author = {Xu, Yuxiong and Li, Bin and Li, Weixiang and Mandelli, Sara and Negroni, Viola and Li, Sheng},
title = {ALDEN: Dual-Level Disentanglement with Meta-learning for Generalizable Audio Deepfake Detection},
year = {2025},
url = {https://doi.org/10.1145/3746027.3754741},
doi = {10.1145/3746027.3754741},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
pages = {7277–7286},
numpages = {10},
}