Llama 2 7B Complete Guide

Overview

What is Llama 2 7B?

Llama 2 7B is Meta's powerful open-source language model with 7 billion parameters, trained on 2 trillion tokens. It's the sweet spot between capability and efficiency, perfect for running on your Raspberry Pi 5 with 16GB RAM.

Model Size: 7 Billion Parameters

Training Data: 2 Trillion Tokens

Context Window: 4096 Tokens

Current Setup: Running on Node 02 (192.168.1.43) with Q4_K_M quantization for optimal performance on Raspberry Pi 5 (16GB).

Quantization Options

Choose the right quantization for your hardware and quality needs:

Quantization	Size (GB)	RAM Required	Quality	Use Case
`Q2_K`	2.83	~3.5 GB	⭐⭐	Emergency/Testing only
`Q3_K_S`	2.95	~3.5 GB	⭐⭐⭐	Low memory systems
`Q3_K_M`	3.30	~4 GB	⭐⭐⭐	Balanced memory/quality
`Q4_K_M` ✓	4.08	~5 GB	⭐⭐⭐⭐	Recommended for Pi 5
`Q5_K_M`	4.78	~6 GB	⭐⭐⭐⭐⭐	High quality, more RAM
`Q6_K`	5.53	~7 GB	⭐⭐⭐⭐⭐	Maximum quality

💡 Pro Tip: Q4_K_M offers the best balance between quality and performance on Raspberry Pi 5. It's what we're currently using on Node 02!

Essential Commands

Basic Usage

Quick Question

ask7b "What is the meaning of life?"

Full Command

cd ~/llama.cpp/build
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 128 -c 2048 --color \
  -p "Explain quantum computing in simple terms"

Interactive Chat Mode

Start Chat Session

cd ~/llama.cpp/build
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -c 2048 --color -i \
  --reverse-prompt "User:" \
  --in-prefix " " \
  --in-suffix "Assistant:"

Advanced Configurations

Creative Writing Mode

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 512 -c 2048 --color \
  --temp 0.8 --top-k 40 --top-p 0.9 \
  --repeat-penalty 1.1 \
  -p "Write a short story about a robot learning to paint:"

Precise/Factual Mode

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 256 -c 2048 --color \
  --temp 0.3 --top-k 10 \
  -p "List the steps to configure a Raspberry Pi cluster:"

Memory-Optimized Mode

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 4 -n 64 -c 1024 --color \
  --batch-size 256 \
  -p "Summarize this in one sentence:"

Parameter Deep Dive

-t, --threads

Number of CPU threads to use

Default: 4 | Pi 5: 8

Use all cores on Pi 5 for best performance

-n, --n-predict

Max tokens to generate

Range: 1-2048 | Default: 128

Higher = longer responses, more time

-c, --ctx-size

Context window size

Max: 4096 | Safe: 2048

Reduce if running out of memory

--temp

Temperature (creativity)

Range: 0.0-2.0 | Default: 0.8

Lower = focused, Higher = creative

--top-k

Top K sampling

Range: 1-100 | Default: 40

Limits word choices for coherence

--top-p

Nucleus sampling

Range: 0.0-1.0 | Default: 0.9

Cumulative probability cutoff

--repeat-penalty

Repetition penalty

Range: 0.0-2.0 | Default: 1.1

Prevents repetitive output

--batch-size

Batch size for prompt

Range: 1-512 | Default: 512

Lower = less memory usage

--mlock

Lock model in RAM

Flag: Present/Absent

Prevents swapping, faster inference

Advanced Techniques

🎯 Prompt Engineering

System Prompt Template

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 256 -c 2048 --color \
  -p "[INST] <>
You are a helpful AI assistant. Be concise and accurate.
<>

User question here [/INST]"

🔧 Performance Optimization

1. Memory Management

# Check available memory
free -h

# Run with memory lock
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  --mlock -t 8 -c 1024 -n 128 \
  -p "Your prompt"

2. Batch Processing

# Create prompts file
echo "Explain AI" > prompts.txt
echo "What is quantum computing?" >> prompts.txt
echo "How do neural networks work?" >> prompts.txt

# Process in batch
while IFS= read -r prompt; do
    ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
      -t 8 -n 128 -c 1024 --color -p "$prompt" \
      >> responses.txt
    echo "---" >> responses.txt
done < prompts.txt

3. Stream to File

# Stream response to file
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 512 -c 2048 \
  -p "Write a detailed guide about Raspberry Pi clusters" \
  2>&1 | tee output.txt

🌐 Network Integration

Remote Query via SSH

# From another node
ssh brand@192.168.1.43 "ask7b 'What is the weather like?'"

# With proper escaping for complex prompts
ssh brand@192.168.1.43 'ask7b "Explain the concept of \"distributed computing\" in simple terms"'

API Server Mode

# Start llama.cpp server
cd ~/llama.cpp/build
./bin/llama-server -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -c 2048 --host 0.0.0.0 --port 8080

# Query from another machine
curl -X POST http://192.168.1.43:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, how are you?",
    "n_predict": 128,
    "temperature": 0.7
  }'

Practical Examples

🎭 Different Personalities

Technical Expert

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 256 -c 2048 --temp 0.3 --top-k 10 \
  -p "[INST] You are a senior software engineer. Explain microservices architecture with examples. [/INST]"

Creative Writer

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 512 -c 2048 --temp 0.9 --top-p 0.95 \
  -p "[INST] You are a creative storyteller. Write a short sci-fi story about AI consciousness. [/INST]"

Code Assistant

./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 256 -c 2048 --temp 0.2 \
  -p "[INST] Write a Python function to connect to multiple Raspberry Pis via SSH and run commands. [/INST]"

🛠️ System Administration

Monitor and Log

#!/bin/bash
# Save as llama-monitor.sh

LOG_FILE="/home/brand/llama_usage.log"
TEMP=$(vcgencmd measure_temp | cut -d'=' -f2)

echo "=== Llama 2 7B Query ===" >> $LOG_FILE
echo "Timestamp: $(date)" >> $LOG_FILE
echo "Temperature: $TEMP" >> $LOG_FILE
echo "Prompt: $1" >> $LOG_FILE

time ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 128 -c 2048 --color -p "$1" \
  2>&1 | tee -a $LOG_FILE

echo "=== End Query ===" >> $LOG_FILE

🔄 Automation Scripts

Daily Summary Generator

#!/bin/bash
# Generate daily AI insights

TOPICS=("AI trends" "Raspberry Pi projects" "Distributed computing")
OUTPUT_FILE="/home/brand/daily_insights_$(date +%Y%m%d).txt"

echo "# Daily AI Insights - $(date)" > $OUTPUT_FILE

for topic in "${TOPICS[@]}"; do
    echo -e "\n## $topic\n" >> $OUTPUT_FILE
    ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
      -t 8 -n 256 -c 2048 --temp 0.7 \
      -p "Provide a brief update on recent developments in $topic:" \
      >> $OUTPUT_FILE
    echo -e "\n---\n" >> $OUTPUT_FILE
done

echo "Daily insights saved to: $OUTPUT_FILE"

⚠️ Performance Tips:

Monitor temperature: watch -n 1 vcgencmd measure_temp
Use htop to monitor CPU and memory usage
Consider active cooling if running extended sessions
Reduce context size (-c) if experiencing memory issues

Troubleshooting

Common Issues

1. Out of Memory

# Reduce context and batch size
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 -n 64 -c 512 --batch-size 128 \
  -p "Your prompt"

# Or use smaller quantization
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q3_K_M.gguf

2. Slow Performance

# Check CPU throttling
vcgencmd get_throttled

# Ensure using all cores
nproc  # Should show 8 for Pi 5

# Run with optimal settings
nice -n -10 ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -t 8 --mlock -n 128 -c 1024 \
  -p "Your prompt"

3. Model Not Loading

# Verify model integrity
md5sum ../models/llama-2-7b.Q4_K_M.gguf

# Check file permissions
ls -la ../models/

# Test with minimal parameters
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
  -p "Test"

TANGNET Integration Ideas

🌐 Multi-Node Strategies

1. Load Balancing

Use Node 01 (TinyLlama) for quick queries, Node 02 (Llama 2 7B) for complex tasks:

#!/bin/bash
# Smart query router

QUERY="$1"
WORD_COUNT=$(echo "$QUERY" | wc -w)

if [ $WORD_COUNT -lt 10 ]; then
    echo "Routing to TinyLlama (fast)..."
    ssh brand@192.168.1.31 "tangnet '$QUERY'"
else
    echo "Routing to Llama 2 7B (detailed)..."
    ssh brand@192.168.1.43 "ask7b '$QUERY'"
fi

2. Parallel Processing

# Query both models simultaneously
(ssh brand@192.168.1.31 "tangnet '$1'" > tiny_response.txt) &
(ssh brand@192.168.1.43 "ask7b '$1'" > llama_response.txt) &
wait

echo "=== TinyLlama Response ==="
cat tiny_response.txt
echo -e "\n=== Llama 2 7B Response ==="
cat llama_response.txt

3. Consensus System

Get answers from multiple models and compare:

# Future implementation with more nodes
# Query 3+ models and find consensus
# Perfect for fact-checking and validation