Benchmark Intended Grouping of Open Speech (BIGOS).
In the rapidly developing field of Automatic Speech Recognition (ASR), a robust ecosystem is needed to monitor technological progress and compare the effectiveness of different solutions across various applications. Inspired by the global trend towards transparent and collective AI development, UAM CAI (Adam Mickiewicz University Center for Artificial Intelligence) proudly introduces the BIGOS corpus (Benchmark Intended Grouping of Open Speech) and the Polish ASR Leaderboard (PAL).
Polish ASR Leaderboard (PAL)
The mission of PAL is to establish a dynamic Polish ASR evaluation ecosystem, ensuring equal opportunities for benchmarking both commercial and open ASR systems. Our vision is for PAL to serve as a comprehensive resource, informing potential ASR users about the advantages, limitations, and expected performance of ASR technology in various practical scenarios.
Ultimately, we aim to bridge the gap between controlled benchmarking (typically described in scientific publications) and continuous, real-world evaluations of ASR applications, which are often conducted internally by Big Tech companies.
Our goal is for the PAL leaderboard to become the primary source of information for anyone considering ASR technology for Polish (and in the future, other languages). To achieve this, comprehensive evaluation data is essential, accurately reflecting specific use cases and linguistic characteristics. This is made possible through BIGOS (Benchmark Intended Grouping of Open Speech).
BIGOS – Making Open ASR Data More Useful
The mission of BIGOS is to make open ASR speech datasets more accessible and useful. We discover, organize, and refine existing ASR datasets, enhancing their availability and value for ASR development and evaluation. Our goal is to save time for ASR researchers and developers by providing standardized data formats and efficient data management tools, utilizing industry best practices such as the Hugging Face dataset framework.
The integration of AMU BIGOS and the Polish ASR Leaderboard provides the community with:
✔ The largest unified collection of open Polish speech datasets, selected to maximize evaluation usability and ease of use.
✔ The most extensive ASR system benchmark for Poland, covering both commercial and open-source systems.
✔ A scalable data management framework for cataloging and curating ASR speech data.
✔An expandable evaluation framework for benchmarking new ASR systems.
Evaluation Tasks and Data:
The Polish ASR leaderboard currently utilizes the following evaluation corpora:
● BIGOS V2 – a collection of 12 well-known ASR speech datasets for Polish ASR development, including Google’s FLEURS, Facebook MLS, Mozilla’s Common Voice, CLARIN-PL, and others.
For more information, refer to the dataset card on the HF hub.
● PELCRA for BIGOS – a collection of annotated conversational speech data for linguistic research and ASR development, created by the PELCRA group at the University of Łódź. It includes datasets such as SpokesMix, SpokesBiz, and DiaBiz.
For more information, refer to the dataset card on the HF hub.
Each corpus consists of multiple subsets with unique acoustic and linguistic properties, resulting in a diverse set of evaluation tasks with varying levels of difficulty in assessing ASR capabilities.
Ranking in Action: The Polish ASR Barometer
A total of 19 systems (8 commercial and 11 publicly available) were evaluated on 24 unique subsets, comprising over 4,000 recordings and 10 hours of speech. In total, more than 80,000 unique recording-ASR output pairs were used to calculate accuracy results.
We observed that Whisper-large models demonstrate the highest performance in BIGOS and PELCRA tasks. However, considering that the PELCRA dataset contains conversational speech, whereas BIGOS mostly features read speech, the average error rate in the PELCRA dataset is twice as high.
Type | Best System | Word Error Rate [%] (BIGOS) | Word Error Rate [%] (PELCRA) |
---|---|---|---|
Open-source | Whisper Large | 8.38% | 23.4% |
Commercial | Whisper Cloud | 10.05% | 23.5% |
We also observed that freely available Whisper models (large and medium-sized) offer accuracy comparable to commercial services like Google and Microsoft. In contrast, free models such as Nemo, MMS, and Wav2Vec provide subpar accuracy.
Check out the full results on the Hugging Face dashboard.
Our Vision and Next Steps
We recognize the following limitations of BIGOS and PAL:
● Impact of Data Quality on Reliability: Despite our efforts to maintain open data, some recordings and transcriptions remain of lower quality. We continuously refine the BIGOS corpus to eliminate such instances from evaluations.
● Representativeness of Data for Real-World Scenarios: Open datasets become outdated over time. Given the evolving use cases of ASR and its target audiences, it is essential to systematically add new datasets and analyze ASR performance across various socio-demographic dimensions to ensure the leaderboard remains reflective of real-world ASR capabilities.
● Risk of Data Leakage: Since BIGOS corpora originate from public resources, there is a risk that evaluated systems may have been trained on test data. Therefore, it is necessary to incorporate test sets that do not expose collected data from scenarios not covered by open datasets or specifically requested by PAL users. Keeping new test sets private prevents dataset contamination and ensures a fairer comparison framework.
● Limited Language Support: Currently, BIGOS and PAL focus solely on Polish. We believe that applying an established data curation process to other languages would significantly reduce the overall cost of developing a comprehensive ASR benchmark, though it would not entirely eliminate the expensive data preparation stages.
We plan to address these limitations to become a trusted and widely recognized resource. By incorporating diverse benchmarks that strongly correlate with real-world use cases, we aim to make the leaderboard valuable for businesses. Our goal is to bridge the gap between academic research and practical applications, continuously updating and improving the rankings based on feedback from both the research community and industry professionals. This ensures that our benchmarks remain rigorous, comprehensive, and up-to-date.
Through these efforts, we hope to contribute to the advancement of this field by providing a platform that accurately measures and drives ASR progress in Polish and beyond.
If you are developing ASR systems or speech datasets and would like to collaborate with us, get in touch!
Corpus Maintainer: Michał Juńczyk
Source Corpus Authors
License: Creative Commons (Creative Commons By Attribution Share Alike 4.0 License)