llama.cpp Setup Guide
Overview
llama.cpp is a powerful C++ implementation of LLaMA inference, optimized for CPU usage and efficient model running. This guide will help you set up and run models using llama.cpp.
Installation
Prerequisites
- C++ compiler (GCC, Clang, or MSVC)
- CMake
- Git
- Python 3.8+ (for scripts)
Steps
- Clone the repository:
bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp- Build the project:
bash
cmake -B build
cmake --build build --config Release- Download a GGUF model file (e.g., from Hugging Face)
Usage
Basic Model Running
- Place your GGUF model file in the models directory
- Run the model:
bash
./build/bin/main -m models/your-model.ggufCommand Line Options
-m: Specify model path-n: Number of tokens to generate-t: Number of threads to use--ctx-size: Context window size
Optimization Tips
CPU Optimization
- Use
-tto set optimal thread count (usually number of CPU cores) - Enable AVX2/AVX512 during compilation if supported
- Adjust batch size with
--batch-size
Memory Management
- Use
--ctx-sizeto adjust context window - Monitor RAM usage and adjust accordingly
- Consider using memory mapping for large models
Common Issues and Solutions
Memory Errors
- Reduce context size
- Use smaller model variants
- Enable memory mapping
Performance Issues
- Optimize thread count
- Enable hardware-specific optimizations
- Use appropriate batch sizes