The repo provides the main binary file generated with FastLLM (https://github.com/ztxz16/fastllm.git
) for the latest mainstream ARMv8-based Android mobile devices (i.e. smartphones and tablets). The file has been tested compatible to Qualcomm Snapdragon 8+ Gen 1, 8 Gen 2, and MediaTek Dimensity 8100 platform devices.
Install Termux application on your Android device. Make sure your device has more than 6GB of RAM.
Download a supported model file from HuggingFace. You only need to download the model file suffixed with ".flm". It is better downloaded directly with your target Android device.
Download the main binary file. Copy or move the main file and the model file to a storage path of your target device, e.g. downloads
.
Open Termux, and execute the command: termux-setup-storage
.
Execute the command: mv storage/downloads/<model_filename>.flm . && mv storage/downloads/main . && chmod 777 main
. Replace the <model_filename>.flm
with the filename of your model, e.g. chatglm2-6b-int4.flm
. (Notice that the example command works for you ONLY if you put the aforementioned 2 files under the downloads
directory, which is STRONGLY recommended for common users!)
Run the streamlined inference of the language model with the command: ./main -p <model_filename>.flm
.
You're ready for the mobile on-device inference with the latest GLM(s)!
- If you encounter error like
FORTIFY: read: count XXXXXXXX > SSIZE_MAX
, you can try adding-l
at the end of the command for low memory mode inference. For instance, it is observed that the Qwen-7B-Chat-int4.flm model cannot run without low memory mode on devices with ≤12GB RAM.