A wrapper library for tokenizers from CodeNet Project.
Install the library from this repository.
pip install git+https://github.com/ashirafj/codenet-tokenizers
For Python source code,
from codenet_tokenizers.tokenizers import PyTokenizer
tokenizer = PyTokenizer()
For C source code,
from codenet_tokenizers.tokenizers import CTokenizer
tokenizer = CTokenizer()
For C++ source code,
from codenet_tokenizers.tokenizers import CppTokenizer
tokenizer = CppTokenizer()
For Java source code,
from codenet_tokenizers.tokenizers import JavaTokenizer
tokenizer = JavaTokenizer()
To tokenize the source code, separate by each line, and remove unnecessary tokens,
normalized_tokens = tokenizer.normalize_separated(code)
To normalize the source code based on tokenized results,
normalized_code = tokenizer.normalize(code)