NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti
Abstract
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.- Anthology ID:
- 2023.findings-acl.868
- Original:
- 2023.findings-acl.868v1
- Version 2:
- 2023.findings-acl.868v2
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13745–13818
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.868
- DOI:
- 10.18653/v1/2023.findings-acl.868
- Bibkey:
- Cite (ACL):
- Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, et al.. 2023. NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13745–13818, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- NusaCrowd: Open Source Initiative for Indonesian NLP Resources (Cahyawijaya et al., Findings 2023)
- Copy Citation:
- PDF:
- https://aclanthology.org/2023.findings-acl.868.pdf
Export citation
@inproceedings{cahyawijaya-etal-2023-nusacrowd, title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources", author = "Cahyawijaya, Samuel and Lovenia, Holy and Aji, Alham Fikri and Winata, Genta and Wilie, Bryan and Koto, Fajri and Mahendra, Rahmad and Wibisono, Christian and Romadhony, Ade and Vincentio, Karissa and Santoso, Jennifer and Moeljadi, David and Wirawan, Cahya and Hudi, Frederikus and Wicaksono, Muhammad Satrio and Parmonangan, Ivan and Alfina, Ika and Putra, Ilham Firdausi and Rahmadani, Samsul and Oenang, Yulianti and Septiandri, Ali and Jaya, James and Dhole, Kaustubh and Suryani, Arie and Putri, Rifki Afina and Su, Dan and Stevens, Keith and Nityasya, Made Nindyatama and Adilazuarda, Muhammad and Hadiwijaya, Ryan and Diandaru, Ryandito and Yu, Tiezheng and Ghifari, Vito and Dai, Wenliang and Xu, Yan and Damapuspita, Dyah and Wibowo, Haryo and Tho, Cuk and Karo Karo, Ichwanul and Fatyanosa, Tirana and Ji, Ziwei and Neubig, Graham and Baldwin, Timothy and Ruder, Sebastian and Fung, Pascale and Sujaini, Herry and Sakti, Sakriani and Purwarianti, Ayu", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.868", doi = "10.18653/v1/2023.findings-acl.868", pages = "13745--13818", abstract = "We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd{'}s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="cahyawijaya-etal-2023-nusacrowd"> <titleInfo> <title>NusaCrowd: Open Source Initiative for Indonesian NLP Resources</title> </titleInfo> <name type="personal"> <namePart type="given">Samuel</namePart> <namePart type="family">Cahyawijaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Holy</namePart> <namePart type="family">Lovenia</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alham</namePart> <namePart type="given">Fikri</namePart> <namePart type="family">Aji</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Genta</namePart> <namePart type="family">Winata</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bryan</namePart> <namePart type="family">Wilie</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Fajri</namePart> <namePart type="family">Koto</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rahmad</namePart> <namePart type="family">Mahendra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christian</namePart> <namePart type="family">Wibisono</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ade</namePart> <namePart type="family">Romadhony</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Karissa</namePart> <namePart type="family">Vincentio</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jennifer</namePart> <namePart type="family">Santoso</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">David</namePart> <namePart type="family">Moeljadi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Cahya</namePart> <namePart type="family">Wirawan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Frederikus</namePart> <namePart type="family">Hudi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="given">Satrio</namePart> <namePart type="family">Wicaksono</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ivan</namePart> <namePart type="family">Parmonangan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ika</namePart> <namePart type="family">Alfina</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ilham</namePart> <namePart type="given">Firdausi</namePart> <namePart type="family">Putra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Samsul</namePart> <namePart type="family">Rahmadani</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yulianti</namePart> <namePart type="family">Oenang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ali</namePart> <namePart type="family">Septiandri</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">James</namePart> <namePart type="family">Jaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kaustubh</namePart> <namePart type="family">Dhole</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Arie</namePart> <namePart type="family">Suryani</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rifki</namePart> <namePart type="given">Afina</namePart> <namePart type="family">Putri</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dan</namePart> <namePart type="family">Su</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Keith</namePart> <namePart type="family">Stevens</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Made</namePart> <namePart type="given">Nindyatama</namePart> <namePart type="family">Nityasya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Muhammad</namePart> <namePart type="family">Adilazuarda</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ryan</namePart> <namePart type="family">Hadiwijaya</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ryandito</namePart> <namePart type="family">Diandaru</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tiezheng</namePart> <namePart type="family">Yu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vito</namePart> <namePart type="family">Ghifari</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Wenliang</namePart> <namePart type="family">Dai</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yan</namePart> <namePart type="family">Xu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dyah</namePart> <namePart type="family">Damapuspita</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Haryo</namePart> <namePart type="family">Wibowo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Cuk</namePart> <namePart type="family">Tho</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ichwanul</namePart> <namePart type="family">Karo Karo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tirana</namePart> <namePart type="family">Fatyanosa</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ziwei</namePart> <namePart type="family">Ji</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Graham</namePart> <namePart type="family">Neubig</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Timothy</namePart> <namePart type="family">Baldwin</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sebastian</namePart> <namePart type="family">Ruder</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pascale</namePart> <namePart type="family">Fung</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Herry</namePart> <namePart type="family">Sujaini</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sakriani</namePart> <namePart type="family">Sakti</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ayu</namePart> <namePart type="family">Purwarianti</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-07</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: ACL 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Anna</namePart> <namePart type="family">Rogers</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jordan</namePart> <namePart type="family">Boyd-Graber</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Naoaki</namePart> <namePart type="family">Okazaki</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Toronto, Canada</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.</abstract> <identifier type="citekey">cahyawijaya-etal-2023-nusacrowd</identifier> <identifier type="doi">10.18653/v1/2023.findings-acl.868</identifier> <location> <url>https://aclanthology.org/2023.findings-acl.868</url> </location> <part> <date>2023-07</date> <extent unit="page"> <start>13745</start> <end>13818</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T NusaCrowd: Open Source Initiative for Indonesian NLP Resources %A Cahyawijaya, Samuel %A Lovenia, Holy %A Aji, Alham Fikri %A Winata, Genta %A Wilie, Bryan %A Koto, Fajri %A Mahendra, Rahmad %A Wibisono, Christian %A Romadhony, Ade %A Vincentio, Karissa %A Santoso, Jennifer %A Moeljadi, David %A Wirawan, Cahya %A Hudi, Frederikus %A Wicaksono, Muhammad Satrio %A Parmonangan, Ivan %A Alfina, Ika %A Putra, Ilham Firdausi %A Rahmadani, Samsul %A Oenang, Yulianti %A Septiandri, Ali %A Jaya, James %A Dhole, Kaustubh %A Suryani, Arie %A Putri, Rifki Afina %A Su, Dan %A Stevens, Keith %A Nityasya, Made Nindyatama %A Adilazuarda, Muhammad %A Hadiwijaya, Ryan %A Diandaru, Ryandito %A Yu, Tiezheng %A Ghifari, Vito %A Dai, Wenliang %A Xu, Yan %A Damapuspita, Dyah %A Wibowo, Haryo %A Tho, Cuk %A Karo Karo, Ichwanul %A Fatyanosa, Tirana %A Ji, Ziwei %A Neubig, Graham %A Baldwin, Timothy %A Ruder, Sebastian %A Fung, Pascale %A Sujaini, Herry %A Sakti, Sakriani %A Purwarianti, Ayu %Y Rogers, Anna %Y Boyd-Graber, Jordan %Y Okazaki, Naoaki %S Findings of the Association for Computational Linguistics: ACL 2023 %D 2023 %8 July %I Association for Computational Linguistics %C Toronto, Canada %F cahyawijaya-etal-2023-nusacrowd %X We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken. %R 10.18653/v1/2023.findings-acl.868 %U https://aclanthology.org/2023.findings-acl.868 %U https://doi.org/10.18653/v1/2023.findings-acl.868 %P 13745-13818
Markdown (Informal)
[NusaCrowd: Open Source Initiative for Indonesian NLP Resources](https://aclanthology.org/2023.findings-acl.868) (Cahyawijaya et al., Findings 2023)
- NusaCrowd: Open Source Initiative for Indonesian NLP Resources (Cahyawijaya et al., Findings 2023)
ACL
- Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, et al.. 2023. NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13745–13818, Toronto, Canada. Association for Computational Linguistics.