rb-tokenizer
is a flexible, rule-based tokenizer written in Rust, designed to make text tokenization customizable and extendable. It supports a wide range of applications, from simple text parsing to complex programming language lexers.
- Customizable Tokenization: Easily define your own tokenization rules with regular expressions and symbols.
- Extensible Architecture: Add new rule types to suit your specific tokenization needs.
- Performance: Optimized for speed and efficiency, handling large texts swiftly.
- Easy Integration: Designed to be integrated into larger parsing or text analysis projects.
Ensure you have Rust installed on your system. You can download Rust and cargo
via rustup.
Add rb-tokenizer
to your Cargo.toml
:
[dependencies]
rb-tokenizer = { git = "https://github.com/maniartech/rb-tokenizer.git" }
To use rb-tokenizer
in your project, start by creating a Tokenizer
instance and adding rules:
use rb_tokenizer::Tokenizer;
let mut tokenizer = Tokenizer::new();
tokenizer.add_regex_rule(r"^\d+", "Number", None);
tokenizer.add_regex_rule(r"^[a-zA-Z_][a-zA-Z0-9_]*", "Identifier", None);
tokenizer.add_symbol_rule("(", "Operator", Some("OpenParen"));
tokenizer.add_symbol_rule(")", "Operator", Some("CloseParen"));
tokenizer.add_symbol_rule("+", "Operator", Some("Plus"));
let tokens = tokenizer.tokenize("ADD(2 + 2)").unwrap();
println!("{:?}", tokens);
// Output:
// Ok([
// Token { token_type: "Identifier", token_sub_type: None, value: "ADD", line: 1, column: 1 },
// Token { token_type: "Operator", token_sub_type: Some("OpenParen"), value: "(", line: 1, column: 4 },
// Token { token_type: "Number", token_sub_type: None, value: "2", line: 1, column: 5 },
// Token { token_type: "Operator", token_sub_type: Some("Plus"), value: "+", line: 1, column: 7 },
// Token { token_type: "Number", token_sub_type: None, value: "2", line: 1, column: 12 },
// Token { token_type: "Operator", token_sub_type: Some("CloseParen"), value: ")", line: 1, column: 13 }
// ])
You can find more examples in the tests/
directory of the repository, demonstrating various use cases and configurations.
Contributions to rb-tokenizer
are welcome! Here are a few ways you can help:
- Reporting Issues: Found a bug or have a feature request? Please open an issue.
- Pull Requests: Want to contribute code? Pull requests are warmly welcomed. Please ensure your code adheres to the project's coding standards and includes tests, if applicable.
- Documentation: Improvements to documentation or new examples are always appreciated.
Before contributing, please read our CONTRIBUTING.md guide.
rb-tokenizer
is distributed under the MIT License. See LICENSE for more information.
- Inspired by the flexibility of rule-based tokenization in various programming languages and frameworks.