Автори
Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta
Дата на публикуване
2023/7/3
Конференция
International Conference on Machine Learning
Страници
26619-26645
Издател
PMLR
Описание
Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models’ memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al., 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model’s performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $ pass@ k $ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $ pass@ k $ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $ pass@ k $ while having 19.58% worse high-resource $ pass@ k $.
Общ брой позовавания
Статии в Google Наука
G Orlanski, K Xiao, X Garcia, J Hui, J Howland… - International Conference on Machine Learning, 2023