Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes Z Wang, H Huang, Y Zhao, Z Zhang, Z Zhao NAACL 2025, 2023 | 57 | 2023 |
Connecting Multi-modal Contrastive Representations Z Wang, Y Zhao, X Cheng, H Huang, J Liu, L Tang, L Li, Y Wang, A Yin, ... NeurIPS 2023, 2023 | 35 | 2023 |
Make-a-voice: Revisiting voice large language models as scalable multilingual and multitask learners R Huang, C Zhang, Y Wang, D Yang, J Tian, Z Ye, L Liu, Z Wang, Z Jiang, ... Proceedings of the 62nd Annual Meeting of the Association for Computational …, 2024 | 34* | 2024 |
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo, Q Yang, X Cheng, Z Wang, ... ICLR 2025, 2024 | 33 | 2024 |
Chat-3d v2: Bridging 3d scene and large language models with object identifiers H Huang*, Z Wang*, R Huang, L Liu, X Cheng, Y Zhao, T Jin, Z Zhao arXiv preprint arXiv:2312.08168, 2023 | 31 | 2023 |
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition X Cheng, T Jin, R Huang, L Li, W Lin, Z Wang, Y Wang, H Liu, A Yin, ... ICCV 2023, 15735-15745, 2023 | 25 | 2023 |
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT L Zhuo, R Du, H Xiao, Y Li, D Liu, R Huang, W Liu, L Zhao, FY Wang, ... NeurIPS 2024, 2024 | 21* | 2024 |
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding Z Wang, H Huang, Y Zhao, L Li, X Cheng, Y Zhu, A Yin, Z Zhao EMNLP 2023, 2023 | 19 | 2023 |
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding Z Wang, H Huang, Y Zhao, L Li, X Cheng, Y Zhu, A Yin, Z Zhao ICCV 2023, 2023 | 19 | 2023 |
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion Z Wang, Z Zhang, X Cheng, R Huang, L Liu, Z Ye, H Huang, Y Zhao, T Jin, ... ICML 2024, 2024 | 18* | 2024 |
Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching Y Wang, W Guo, R Huang, J Huang, Z Wang, F You, R Li, Z Zhao NeurIPS 2024, 2024 | 17* | 2024 |
Chat-scene: Bridging 3d scene and large language models with object identifiers H Huang*, Y Chen*, Z Wang*, R Huang, R Xu, T Wang, L Liu, X Cheng, ... NeurIPS 2024, 2024 | 15 | 2024 |
Wavchat: A survey of spoken dialogue models S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang, Z Jiang, L Zhou, S Liu, ... arXiv preprint arXiv:2411.13577, 2024 | 15 | 2024 |
Extending multi-modal contrastive representations Z Wang, Z Zhang, L Liu, Y Zhao, H Huang, T Jin, Z Zhao NeurIPS 2024, 2023 | 10 | 2023 |
Omnibind: Large-scale omni multimodal representation via binding spaces Z Wang, Z Zhang, H Zhang, L Liu, R Huang, X Cheng, H Zhao, Z Zhao ICLR 2025, 2024 | 9 | 2024 |
Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen, Z Jiang, H Huang, ... arXiv preprint arXiv:2406.01205, 2024 | 9 | 2024 |
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation X Cheng, R Huang, L Li, T Jin, Z Wang, A Yin, M Li, X Duan, Z Zhao ACL 2024, 2023 | 8 | 2023 |
Scene-robust natural language video localization via learning domain-invariant representations Z Wang, Y Zhao, H Huang, Y Xia, Z Zhao ACL 2023, 144-160, 2023 | 6 | 2023 |
Action Imitation in Common Action Space for Customized Action Image Synthesis W Lin, J Chen, J Shi, Z Guo, Y Zhu, Z Wang, T Jin, Z Zhao, F Wu, ... NeurIPS 2024, 2024 | 3 | 2024 |
VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation R Huang, Y Wang, R Hu, X Xu, Z Hong, D Yang, X Cheng, Z Wang, ... Proceedings of the 32nd ACM International Conference on Multimedia, 10630-10639, 2024 | 2 | 2024 |