Running llama.cpp on VisionFive 2 (RISC-V)
A comprehensive tutorial for compiling and installing llama.cpp with OpenBLAS acceleration on the StarFive VisionFive 2 board. This guide uses the latest version of llama.cpp and fixes common compilation errors.
Prerequisites
Hardware: StarFive VisionFive 2 (RISC-V64)
OS: Debian-based system (tested on latest Image69 updates)
(link to debian system image69)
What You'll Get
- Full llama.cpp installation with OpenBLAS acceleration
- Significantly improved inference performance
- System-wide installation of llama.cpp tools
- llama-server accessible from anywhere
Step 1: Install Dependencies
First, update your system and install required packages:
sudo apt update
sudo apt-get install -y libopenblas-dev wget g++ git cmake build-essential
What we're installing:
libopenblas-dev- OpenBLAS library for optimized linear algebra operationsg++- C++ compiler (fixes common make errors)cmake- Build system required for llama.cppgit- For cloning the repository
Step 2: Clone llama.cpp Repository
Clone the latest version of llama.cpp:
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Step 3: Configure and Compile with CMake
This is the critical step. We need to disable RISC-V vector extensions (RVV) because GCC 11.3.0 doesn't support the modern extensions that llama.cpp tries to use by default.
rm -rf build
mkdir build
cd build
cmake .. \
-DLLAMA_BUILD_SERVER=ON \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DGGML_NATIVE=OFF \
-DGGML_RVV=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-march=rv64gc -mabi=lp64d" \
-DCMAKE_CXX_FLAGS="-march=rv64gc -mabi=lp64d"
make -j$(nproc)
Configuration explained:
-DLLAMA_BUILD_SERVER=ON- Builds the llama-server binary-DGGML_BLAS=ON- Enables BLAS acceleration-DGGML_BLAS_VENDOR=OpenBLAS- Uses OpenBLAS specifically-DGGML_NATIVE=OFF- Disables native CPU optimizations-DGGML_RVV=OFF- Critical: Disables RISC-V Vector extensions (fixes compilation errors)-DCMAKE_BUILD_TYPE=Release- Optimized release build-march=rv64gc -mabi=lp64d- Standard RISC-V architecture flags
Compilation time: Expect 10-20 minutes depending on your system.
Step 4: Verify Compilation Success
Check that compilation completed successfully:
ls ~/llama.cpp/build/bin/
You should see binaries including:
llama-serverllama-clillama-quantize- And many more tools
Step 5: Install System-Wide (Optional but Recommended)
Install llama.cpp binaries and libraries system-wide:
cd ~/llama.cpp/build
sudo make install
sudo ldconfig
This installs everything to /usr/local/bin and /usr/local/lib.
Verify installation:
which llama-server
llama-server --version
You should now be able to run llama-server from anywhere without specifying the full path.
Step 6: Download a Model
Download a GGUF format model. For example, a 7B model:
mkdir -p ~/models
cd ~/models
# Example: Download a quantized model
wget https://huggingface.co/soob3123/amoral-gemma3-1B-v2-gguf/resolve/main/amoral-gemma3-1B-v2-Q4_K_M.gguf
Step 7: Run llama-server
Start the server:
llama-server --host 0.0.0.0 --port 8080 --model ~/models/amoral-gemma3-1B-v2-Q4_K_M.gguf
Access from browser:
Navigate to http://YOUR_VISIONFIVE2_IP:8080
Performance Comparison
With OpenBLAS enabled (this guide):
- First prompt: ~532 seconds (7B model)
- Subsequent prompts: Faster due to context caching
Without OpenBLAS:
- First prompt: ~1038 seconds (7B model)
- Almost 2x slower!
Bonus: Create a systemd Service
To run llama-server automatically on boot:
sudo nano /etc/systemd/system/llama-server.service
Add this content (adjust paths as needed):
[Unit]
Description=Llama.cpp Server
After=network.target
[Service]
Type=simple
User=user
WorkingDirectory=/home/user
ExecStart=/usr/local/bin/llama-server --host 0.0.0.0 --port 8080 --model /home/user/models/amoral-gemma3-1B-v2-Q4_K_M.gguf
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-server
Troubleshooting
Error: "unexpected ISA string at end: 'v_zvfh_zicbop'"
Solution: Your GCC version doesn't support modern RISC-V vector extensions. Make sure you include -DGGML_RVV=OFF in your cmake command (this guide already includes it).
Error: "g++: command not found"
Solution:
sudo apt-get install g++
OpenBLAS not being used
Verify in cmake output:
-- Found BLAS: /usr/lib/riscv64-linux-gnu/libopenblas.so
-- Including BLAS backend
If you don't see this, reinstall OpenBLAS:
sudo apt-get install --reinstall libopenblas-dev
Additional Tools Included
After installation, you'll have access to many useful tools:
llama-cli- Command-line chat interfacellama-quantize- Model quantization toolllama-bench- Benchmarking utilityllama-server- Web server with API- And 40+ other utilities
Tips for Best Performance
- Use quantized models: Q4_K_M or Q5_K_M offer good balance of quality and speed
- Adjust context size: Use
--ctx-size 2048for faster inference with smaller contexts - Monitor resources: Use
htopto check CPU usage - Use appropriate batch sizes:
--batch-size 512can help with throughput
Recommended Models for VisionFive 2
- 7B models with Q4 quantization - Best balance
- 3B models - Very responsive
- 13B models with Q2/Q3 quantization - Possible but slow
Avoid unquantized or large models (>13B) as they will be extremely slow.
Credits
Based on original work by the community and improved with modern cmake build system. Thanks to the llama.cpp team and the RISC-V community for making this possible.
Additional Resources
Last updated: November 2025
Tested on: VisionFive 2, Debian-based Image69, GCC 11.3.0
llama.cpp version: Latest main branch
This blog its writed with the help of Claude but its content was make by the owner of this blog, Claude only helps to write it better and more detailed.

Comentarios
Publicar un comentario