Running llama.cpp on VisionFive 2 (RISC-V)


 

A comprehensive tutorial for compiling and installing llama.cpp with OpenBLAS acceleration on the StarFive VisionFive 2 board. This guide uses the latest version of llama.cpp and fixes common compilation errors.

Prerequisites

Hardware: StarFive VisionFive 2 (RISC-V64)
OS: Debian-based system (tested on latest Image69 updates)

(link to debian system image69)

What You'll Get

  • Full llama.cpp installation with OpenBLAS acceleration
  • Significantly improved inference performance
  • System-wide installation of llama.cpp tools
  • llama-server accessible from anywhere

Step 1: Install Dependencies

First, update your system and install required packages:

sudo apt update
sudo apt-get install -y libopenblas-dev wget g++ git cmake build-essential

What we're installing:

  • libopenblas-dev - OpenBLAS library for optimized linear algebra operations
  • g++ - C++ compiler (fixes common make errors)
  • cmake - Build system required for llama.cpp
  • git - For cloning the repository

Step 2: Clone llama.cpp Repository

Clone the latest version of llama.cpp:

cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Step 3: Configure and Compile with CMake

This is the critical step. We need to disable RISC-V vector extensions (RVV) because GCC 11.3.0 doesn't support the modern extensions that llama.cpp tries to use by default.

rm -rf build
mkdir build
cd build
cmake .. \
  -DLLAMA_BUILD_SERVER=ON \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS \
  -DGGML_NATIVE=OFF \
  -DGGML_RVV=OFF \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_FLAGS="-march=rv64gc -mabi=lp64d" \
  -DCMAKE_CXX_FLAGS="-march=rv64gc -mabi=lp64d"
make -j$(nproc)

Configuration explained:

  • -DLLAMA_BUILD_SERVER=ON - Builds the llama-server binary
  • -DGGML_BLAS=ON - Enables BLAS acceleration
  • -DGGML_BLAS_VENDOR=OpenBLAS - Uses OpenBLAS specifically
  • -DGGML_NATIVE=OFF - Disables native CPU optimizations
  • -DGGML_RVV=OFF - Critical: Disables RISC-V Vector extensions (fixes compilation errors)
  • -DCMAKE_BUILD_TYPE=Release - Optimized release build
  • -march=rv64gc -mabi=lp64d - Standard RISC-V architecture flags

Compilation time: Expect 10-20 minutes depending on your system.

Step 4: Verify Compilation Success

Check that compilation completed successfully:

ls ~/llama.cpp/build/bin/

You should see binaries including:

  • llama-server
  • llama-cli
  • llama-quantize
  • And many more tools

Step 5: Install System-Wide (Optional but Recommended)

Install llama.cpp binaries and libraries system-wide:

cd ~/llama.cpp/build
sudo make install
sudo ldconfig

This installs everything to /usr/local/bin and /usr/local/lib.

Verify installation:

which llama-server
llama-server --version

You should now be able to run llama-server from anywhere without specifying the full path.

Step 6: Download a Model

Download a GGUF format model. For example, a 7B model:

mkdir -p ~/models
cd ~/models
# Example: Download a quantized model
wget https://huggingface.co/soob3123/amoral-gemma3-1B-v2-gguf/resolve/main/amoral-gemma3-1B-v2-Q4_K_M.gguf

Step 7: Run llama-server

Start the server:

llama-server --host 0.0.0.0 --port 8080 --model ~/models/amoral-gemma3-1B-v2-Q4_K_M.gguf

Access from browser:
Navigate to http://YOUR_VISIONFIVE2_IP:8080

Performance Comparison

With OpenBLAS enabled (this guide):

  • First prompt: ~532 seconds (7B model)
  • Subsequent prompts: Faster due to context caching

Without OpenBLAS:

  • First prompt: ~1038 seconds (7B model)
  • Almost 2x slower!

Bonus: Create a systemd Service

To run llama-server automatically on boot:

sudo nano /etc/systemd/system/llama-server.service

Add this content (adjust paths as needed):

[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
User=user
WorkingDirectory=/home/user
ExecStart=/usr/local/bin/llama-server --host 0.0.0.0 --port 8080 --model /home/user/models/amoral-gemma3-1B-v2-Q4_K_M.gguf
Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-server

Troubleshooting

Error: "unexpected ISA string at end: 'v_zvfh_zicbop'"

Solution: Your GCC version doesn't support modern RISC-V vector extensions. Make sure you include -DGGML_RVV=OFF in your cmake command (this guide already includes it).

Error: "g++: command not found"

Solution:

sudo apt-get install g++

OpenBLAS not being used

Verify in cmake output:

-- Found BLAS: /usr/lib/riscv64-linux-gnu/libopenblas.so
-- Including BLAS backend

If you don't see this, reinstall OpenBLAS:

sudo apt-get install --reinstall libopenblas-dev

Additional Tools Included

After installation, you'll have access to many useful tools:

  • llama-cli - Command-line chat interface
  • llama-quantize - Model quantization tool
  • llama-bench - Benchmarking utility
  • llama-server - Web server with API
  • And 40+ other utilities

Tips for Best Performance

  1. Use quantized models: Q4_K_M or Q5_K_M offer good balance of quality and speed
  2. Adjust context size: Use --ctx-size 2048 for faster inference with smaller contexts
  3. Monitor resources: Use htop to check CPU usage
  4. Use appropriate batch sizes: --batch-size 512 can help with throughput

Recommended Models for VisionFive 2

  • 7B models with Q4 quantization - Best balance
  • 3B models - Very responsive
  • 13B models with Q2/Q3 quantization - Possible but slow

Avoid unquantized or large models (>13B) as they will be extremely slow.

Credits

Based on original work by the community and improved with modern cmake build system. Thanks to the llama.cpp team and the RISC-V community for making this possible.

Additional Resources


Last updated: November 2025
Tested on: VisionFive 2, Debian-based Image69, GCC 11.3.0
llama.cpp version: Latest main branch

This blog its writed with the help of Claude but its content was make by the owner of this blog, Claude only helps to write it better and more detailed.

Comentarios

Entradas populares