Skip to content

llama.cpp Setup Guide

Overview

llama.cpp is a powerful C++ implementation of LLaMA inference, optimized for CPU usage and efficient model running. This guide will help you set up and run models using llama.cpp.

Installation

Prerequisites

  • C++ compiler (GCC, Clang, or MSVC)
  • CMake
  • Git
  • Python 3.8+ (for scripts)

Steps

  1. Clone the repository:
bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
  1. Build the project:
bash
cmake -B build
cmake --build build --config Release
  1. Download a GGUF model file (e.g., from Hugging Face)

Usage

Basic Model Running

  1. Place your GGUF model file in the models directory
  2. Run the model:
bash
./build/bin/main -m models/your-model.gguf

Command Line Options

  • -m: Specify model path
  • -n: Number of tokens to generate
  • -t: Number of threads to use
  • --ctx-size: Context window size

Optimization Tips

CPU Optimization

  • Use -t to set optimal thread count (usually number of CPU cores)
  • Enable AVX2/AVX512 during compilation if supported
  • Adjust batch size with --batch-size

Memory Management

  • Use --ctx-size to adjust context window
  • Monitor RAM usage and adjust accordingly
  • Consider using memory mapping for large models

Common Issues and Solutions

Memory Errors

  • Reduce context size
  • Use smaller model variants
  • Enable memory mapping

Performance Issues

  • Optimize thread count
  • Enable hardware-specific optimizations
  • Use appropriate batch sizes

Additional Resources

Built to share knowledge <3