Master the Power of Meta's 7 Billion Parameter Model on TANGNET
Llama 2 7B is Meta's powerful open-source language model with 7 billion parameters, trained on 2 trillion tokens. It's the sweet spot between capability and efficiency, perfect for running on your Raspberry Pi 5 with 16GB RAM.
Choose the right quantization for your hardware and quality needs:
Quantization | Size (GB) | RAM Required | Quality | Use Case |
---|---|---|---|---|
Q2_K |
2.83 | ~3.5 GB | ⭐⭐ | Emergency/Testing only |
Q3_K_S |
2.95 | ~3.5 GB | ⭐⭐⭐ | Low memory systems |
Q3_K_M |
3.30 | ~4 GB | ⭐⭐⭐ | Balanced memory/quality |
Q4_K_M ✓ |
4.08 | ~5 GB | ⭐⭐⭐⭐ | Recommended for Pi 5 |
Q5_K_M |
4.78 | ~6 GB | ⭐⭐⭐⭐⭐ | High quality, more RAM |
Q6_K |
5.53 | ~7 GB | ⭐⭐⭐⭐⭐ | Maximum quality |
ask7b "What is the meaning of life?"
cd ~/llama.cpp/build
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 128 -c 2048 --color \
-p "Explain quantum computing in simple terms"
cd ~/llama.cpp/build
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -c 2048 --color -i \
--reverse-prompt "User:" \
--in-prefix " " \
--in-suffix "Assistant:"
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 512 -c 2048 --color \
--temp 0.8 --top-k 40 --top-p 0.9 \
--repeat-penalty 1.1 \
-p "Write a short story about a robot learning to paint:"
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 256 -c 2048 --color \
--temp 0.3 --top-k 10 \
-p "List the steps to configure a Raspberry Pi cluster:"
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 4 -n 64 -c 1024 --color \
--batch-size 256 \
-p "Summarize this in one sentence:"
Number of CPU threads to use
Default: 4 | Pi 5: 8
Use all cores on Pi 5 for best performance
Max tokens to generate
Range: 1-2048 | Default: 128
Higher = longer responses, more time
Context window size
Max: 4096 | Safe: 2048
Reduce if running out of memory
Temperature (creativity)
Range: 0.0-2.0 | Default: 0.8
Lower = focused, Higher = creative
Top K sampling
Range: 1-100 | Default: 40
Limits word choices for coherence
Nucleus sampling
Range: 0.0-1.0 | Default: 0.9
Cumulative probability cutoff
Repetition penalty
Range: 0.0-2.0 | Default: 1.1
Prevents repetitive output
Batch size for prompt
Range: 1-512 | Default: 512
Lower = less memory usage
Lock model in RAM
Flag: Present/Absent
Prevents swapping, faster inference
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 256 -c 2048 --color \
-p "[INST] <>
You are a helpful AI assistant. Be concise and accurate.
< >
User question here [/INST]"
# Check available memory
free -h
# Run with memory lock
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
--mlock -t 8 -c 1024 -n 128 \
-p "Your prompt"
# Create prompts file
echo "Explain AI" > prompts.txt
echo "What is quantum computing?" >> prompts.txt
echo "How do neural networks work?" >> prompts.txt
# Process in batch
while IFS= read -r prompt; do
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 128 -c 1024 --color -p "$prompt" \
>> responses.txt
echo "---" >> responses.txt
done < prompts.txt
# Stream response to file
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 512 -c 2048 \
-p "Write a detailed guide about Raspberry Pi clusters" \
2>&1 | tee output.txt
# From another node
ssh brand@192.168.1.43 "ask7b 'What is the weather like?'"
# With proper escaping for complex prompts
ssh brand@192.168.1.43 'ask7b "Explain the concept of \"distributed computing\" in simple terms"'
# Start llama.cpp server
cd ~/llama.cpp/build
./bin/llama-server -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -c 2048 --host 0.0.0.0 --port 8080
# Query from another machine
curl -X POST http://192.168.1.43:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello, how are you?",
"n_predict": 128,
"temperature": 0.7
}'
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 256 -c 2048 --temp 0.3 --top-k 10 \
-p "[INST] You are a senior software engineer. Explain microservices architecture with examples. [/INST]"
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 512 -c 2048 --temp 0.9 --top-p 0.95 \
-p "[INST] You are a creative storyteller. Write a short sci-fi story about AI consciousness. [/INST]"
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 256 -c 2048 --temp 0.2 \
-p "[INST] Write a Python function to connect to multiple Raspberry Pis via SSH and run commands. [/INST]"
#!/bin/bash
# Save as llama-monitor.sh
LOG_FILE="/home/brand/llama_usage.log"
TEMP=$(vcgencmd measure_temp | cut -d'=' -f2)
echo "=== Llama 2 7B Query ===" >> $LOG_FILE
echo "Timestamp: $(date)" >> $LOG_FILE
echo "Temperature: $TEMP" >> $LOG_FILE
echo "Prompt: $1" >> $LOG_FILE
time ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 128 -c 2048 --color -p "$1" \
2>&1 | tee -a $LOG_FILE
echo "=== End Query ===" >> $LOG_FILE
#!/bin/bash
# Generate daily AI insights
TOPICS=("AI trends" "Raspberry Pi projects" "Distributed computing")
OUTPUT_FILE="/home/brand/daily_insights_$(date +%Y%m%d).txt"
echo "# Daily AI Insights - $(date)" > $OUTPUT_FILE
for topic in "${TOPICS[@]}"; do
echo -e "\n## $topic\n" >> $OUTPUT_FILE
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 256 -c 2048 --temp 0.7 \
-p "Provide a brief update on recent developments in $topic:" \
>> $OUTPUT_FILE
echo -e "\n---\n" >> $OUTPUT_FILE
done
echo "Daily insights saved to: $OUTPUT_FILE"
watch -n 1 vcgencmd measure_temp
htop
to monitor CPU and memory usage# Reduce context and batch size
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 -n 64 -c 512 --batch-size 128 \
-p "Your prompt"
# Or use smaller quantization
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q3_K_M.gguf
# Check CPU throttling
vcgencmd get_throttled
# Ensure using all cores
nproc # Should show 8 for Pi 5
# Run with optimal settings
nice -n -10 ./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-t 8 --mlock -n 128 -c 1024 \
-p "Your prompt"
# Verify model integrity
md5sum ../models/llama-2-7b.Q4_K_M.gguf
# Check file permissions
ls -la ../models/
# Test with minimal parameters
./bin/llama-cli -m ../models/llama-2-7b.Q4_K_M.gguf \
-p "Test"
Use Node 01 (TinyLlama) for quick queries, Node 02 (Llama 2 7B) for complex tasks:
#!/bin/bash
# Smart query router
QUERY="$1"
WORD_COUNT=$(echo "$QUERY" | wc -w)
if [ $WORD_COUNT -lt 10 ]; then
echo "Routing to TinyLlama (fast)..."
ssh brand@192.168.1.31 "tangnet '$QUERY'"
else
echo "Routing to Llama 2 7B (detailed)..."
ssh brand@192.168.1.43 "ask7b '$QUERY'"
fi
# Query both models simultaneously
(ssh brand@192.168.1.31 "tangnet '$1'" > tiny_response.txt) &
(ssh brand@192.168.1.43 "ask7b '$1'" > llama_response.txt) &
wait
echo "=== TinyLlama Response ==="
cat tiny_response.txt
echo -e "\n=== Llama 2 7B Response ==="
cat llama_response.txt
Get answers from multiple models and compare:
# Future implementation with more nodes
# Query 3+ models and find consensus
# Perfect for fact-checking and validation