ESPnet Hackathon 2023
Published:
TL;DR: The experience collaborating with Prof. Shinji’s research group from CMU on ESPnet Hackathon. Our paper: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study was accepted in ICASSP 2024.
1 ESPnet Hackathon 2023
Back in the summer of 2023, I joined the ESPnet Hackathon after seeing information about it in a post of Prof. Hong-Yi, Lee, my current MS advisor. At that time, I was a third year undergrad with only a basic understanding of speech processing. Little did I know that these three days would become the most remarkable experience in my research journey.
The intense schedule of meetings pushed everyone to submit PRs to the espnet toolkit. The CI tests passed, many bugs were fixed, and some new scripts were developed in an astonishing speed. It is worth mentioning that Prof. Shinji turned out to be the most productive member, submitting hundreds of PR during those three days.
2 Discrete unit ASR
At that time, I was studying Meta’s dGSLM and had developed an interest in discrete tokens. I even shared a talk at Prof. Hsuan-Tien’s regular group meeting. Fortunately, I encountered a group of researchers there with the similar interests. We gathered in a Slack channel and decided to extend the asr2 recipe to many other corpora, including AISHELL, Wenetspeech, CommonVoice, Gigaspeech, SwitchBoard, TEDLIUM3, etc.
We define asr1
and asr2
as:
asr1
: ASR task using Kaldi’s fbank feature as input signal.asr2
: ASR task modeled as a machine translation task from HuBERT’s unit(with deduplication + BPE) to text(with BPE)
2.1 Train ASR Model on Huge Corpus
I chose to conduct discrete ASR experiments on the Wenetspeech corpus, a Chinese 10,000-hour speech corpus with transcriptions. At that time, I had no idea what working with 10,000 hours of data meant.
We utilized a relatively small E-branchformer-transformer encoder-decoder model with 41M parameters. However, running experiments on a 2TB corpus proved to be challenging. I encounter endless disk-quota exceed, Cuda OOM, and even server shutdown(for which I am truly sorry) during extracting HuBERT feature, training Kmeans model, and training the ASR model.
I would now think twice before conducting experiment with Wenetspeech again.
2.2 Results
asr1
and pengcheng
are the baseline ASR model that convert fbank feature into transcription. asr2
is our proposed method. We model the ASR task as a machine translation task from discrete Chinese HuBERT’s units to Chinese BPE transcription.
From Table 1 we can see that asr2
model(41M) outperform the asr1
baseline. On top of that, the training time of discrete units is significantly faster than fbank. Even though the 41M model cannot beat the 114M previous results on ESPnet, we believe that it’s due to the model size limitation.
3 Acknowledges
During the Hackathon, everyone was kind and willing to reach out. Among them, Xuankai and Kwanghee helped me the most:
I met Xuankai(personal website), an senior PhD student at CMU. He was our project leader and provided fast, detailed response to solve my difficulties on Slack. Without his help, I will not be able to complete the experiment. He also gave me some refreshing senior advices about doing research. I am sincerely graceful to his help.
I also met Kwanghee(personal website), an outstanding master student at CMU. We had several conversation. He was preparing to fly to the US to start his MS journey that summer. He worked actively throughout the hackathon, and was incredibly kind. He even shared his reasons for choosing to pursue master at CMU, which was invaluable for an undergrad like me.
3.1 Special Thanks
Working and collaborating with the CMU team was a life-changing experience to me. Being able to work with prominent researchers provided me with many insights. I would like to thank Kwanghee and Xuankai again. Your help means a lot to me. Huge Thanks!