In 2024, with the empowerment of AI, we will enter the era of AI PC. On May 20, Microsoft also released the concept of Copilot + PC, which means that PC can run SLM/LLM more efficiently with the support of NPU. We can use models from different Phi-3 family combined with the new AI PC to build a simple personalized Copilot application for individuals. This content will combine Intel's AI PC, use Intel's OpenVINO, NPU Acceleration Library, and Microsoft's DirectML to create a local Copilot An on-demand recording of Microsoft Copilot +PC event from the May 20 event is available.
Phi-3-Mini is a Transformer-based language model with 3.8 billion parameters. The Phi-3-Mini model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
Phi-3-mini is a 3.8B parameter language model, available in two context lengths 128K and 4K.
Phi-3-Small is a Transformer-based language model with 7 billion parameters. The Phi-3-Small model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. Phi-3-Small is also trained more intensively on multilingual datasets compared to Phi-3-Mini. The model family has two variants 8K and 128K which is the context length (in tokens) that it can support.
Phi-3-small is a 7B parameter language model, available in two context lengths 128K and 8K.
Phi-3-Medium is a Transformer-based language model with 14 billion parameters. The Phi-3-Medium model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model family has two variants 4K and 128K which is the context length (in tokens) that it can support.
Phi-3-medium is a 14B parameter language model, available in two context lengths 128K and 4K.
Phi-3-Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
The Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities.
For suitable models for AI PC, I personally recommend Phi-3-mini . As for Phi-3-small, Phi-3 Vision and Phi-3-medium, they are more suitable for running on Nvidia CUDA devices.
An NPU (Neural Processing Unit) is a dedicated processor or processing unit on a larger SoC designed specifically for accelerating neural network operations and AI tasks. Unlike general-purpose CPUs and GPUs, NPUs are optimized for a data-driven parallel computing, making them highly efficient at processing massive multimedia data like videos and images and processing data for neural networks. They are particularly adept at handling AI-related tasks, such as speech recognition, background blurring in video calls, and photo or video editing processes like object detection.
While many AI and machine learning workloads run on GPUs, there’s a crucial distinction between GPUs and NPUs. GPUs are known for their parallel computing capabilities, but not all GPUs are equally efficient beyond processing graphics. NPUs, on the other hand, are purpose-built for complex computations involved in neural network operations, making them highly effective for AI tasks.
In summary, NPUs are the math whizzes that turbocharge AI computations, and they play a key role in the emerging era of AI PCs!
This example is based on Intel’s latest Intel Core Ultra Processor
Intel® NPU device is an AI inference accelerator integrated with Intel client CPUs, starting from Intel® Core™ Ultra generation of CPUs (formerly known as Meteor Lake). It enables energy-efficient execution of artificial neural network tasks.
Intel NPU Acceleration Library
The Intel NPU Acceleration Library https://github.com/intel/intel-npu-acceleration-library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
Install the Python Library with pip
pip install intel-npu-acceleration-library
Note The project is still under development, but the reference model is already very complete.
Using Intel NPU acceleration, this library does not affect the traditional encoding process. You only need to use this library to quantize the original Phi-3 model, such as FP16, INT4:
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM,pipeline
import intel_npu_acceleration_library
import torch
model_id = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", use_cache=True,trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.float16)
After the quantification is successful, continue execution to call the NPU to run the Phi-3 model.
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
query = "<|system|>You are a helpful AI assistant.<|end|><|user|>Can you introduce yourself?<|end|><|assistant|>"
output = pipe(query, **generation_args)
output[0]['generated_text']
When executing code, we can view the running status of the NPU through Task Manager
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.
Note : The latest DirectML already supports NPU(https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-dev...)
DirectML is a machine learning library developed by Microsoft. It is designed to accelerate machine learning workloads on Windows devices, including desktops, laptops, and edge devices.
CUDA is NVIDIA’s parallel computing platform and programming model. It allows developers to harness the power of NVIDIA GPUs for general-purpose computing, including machine learning and scientific simulations.
The choice between DirectML and CUDA depends on your specific use case, hardware availability, and preferences. If you’re looking for broader compatibility and ease of setup, DirectML might be a good choice. However, if you have NVIDIA GPUs and need highly optimized performance, CUDA remains a strong contender. In summary, both DirectML and CUDA have their strengths and weaknesses, so consider your requirements and available hardware when making a decision
In the era of AI , the portability of AI models is very important. ONNX Runtime can easily deploy trained models to different devices. Developers do not need to pay attention to the inference framework and use a unified API to complete model inference. In the era of generative AI, ONNX Runtime has also performed code optimization (https: //onnxruntime.ai/docs/genai/). Through the optimized ONNX Runtime, the quantized generative AI model can be inferred on different terminals. In Generative AI with ONNX Runtime, you can inferene AI model API through Python, C#, C / C++. of course,Deployment on iPhone can take advantage of C++'s Generative AI with ONNX Runtime API.
compile generative AI with ONNX Runtime library
winget install --id=Kitware.CMake -e
git clone https://github.com/microsoft/onnxruntime.git
cd .\onnxruntime\
./build.bat --build_shared_lib --skip_tests --parallel --use_dml --config Release
cd ../
git clone https://github.com/microsoft/onnxruntime-genai.git
cd .\onnxruntime-genai\
mkdir ort
cd ort
mkdir include
mkdir lib
copy ..\onnxruntime\include\onnxruntime\core\providers\dml\dml_provider_factory.h ort\include
copy ..\onnxruntime\include\onnxruntime\core\session\onnxruntime_c_api.h ort\include
copy ..\onnxruntime\build\Windows\Release\Release\*.dll ort\lib
copy ..\onnxruntime\build\Windows\Release\Release\onnxruntime.lib ort\lib
python build.py --use_dml
Install library
pip install .\onnxruntime_genai_directml-0.3.0.dev0-cp310-cp310-win_amd64.whl
This is running result
OpenVINO is an open-source toolkit for optimizing and deploying deep learning models. It provides boosted deep learning performance for vision, audio, and language models from popular frameworks like TensorFlow, PyTorch, and more. Get started with OpenVINO.OpenVINO can also be used in combination with CPU and GPU to run the Phi3 model.
Note: Currently, OpenVINO does not support NPU at this time.
pip install git+https://github.com/huggingface/optimum-intel.git
pip install git+https://github.com/openvinotoolkit/nncf.git
pip install openvino-nightly
Like NPU, OpenVINO completes the call of generative AI models by running quantitative models. We need to quantize the Phi-3 model first and complete the model quantization on the command line through optimum-cli
INT4
optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.6 --sym --trust-remote-code ./openvinomodel/phi3/int4
FP16
optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format fp16 --trust-remote-code ./openvinomodel/phi3/fp16
the converted format , like this
Load model paths(model_dir), related configurations(ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}), and hardware-accelerated devices(GPU.0) through OVModelForCausalLM
ov_model = OVModelForCausalLM.from_pretrained(
model_dir,
device='GPU.0',
ov_config=ov_config,
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
trust_remote_code=True,
)
When executing code, we can view the running status of the GPU through Task Manager
Phi-3 technical report https://aka.ms/phi3-tech-report
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.