Ollama & Open WebUI: Private AI Infrastructure on macOS

Jun 7, 2025

Complete guide to setting up private AI infrastructure using Ollama for local LLM hosting and Open WebUI for a ChatGPT-like interface on macOS.

Ollama & Open WebUI: Private AI Infrastructure on macOS

Private AI Infrastructure: Ollama & Open WebUI

Running AI models locally provides privacy, control, and cost-effectiveness compared to cloud-based solutions. Ollama simplifies local LLM deployment while Open WebUI delivers a familiar chat interface.

Why Local AI Infrastructure?

Privacy Benefits

  • Data Sovereignty: All conversations stay on your hardware
  • No External Dependencies: Works offline without internet
  • GDPR/Compliance: Meet strict data protection requirements
  • Confidential Computing: Sensitive business data never leaves premises

Technical Advantages

  • Customization: Fine-tune models for specific use cases
  • No API Limits: Unlimited usage without rate limiting
  • Latency Control: Optimized for your hardware and use case
  • Cost Predictability: No per-token pricing or monthly fees

Ollama Installation and Setup

macOS Installation

# Method 1: Direct download (Recommended)
# Download from https://ollama.ai
# Install the .pkg file

# Method 2: Homebrew
brew install ollama

# Verify installation
ollama --version

Service Management

# Start Ollama service
ollama serve

# Run as background service (alternative method)
brew services start ollama

# Check service status
brew services list | grep ollama

Model Management

# List available models
ollama list

# Pull popular models
ollama pull llama2                    # 7B parameter model (~4GB)
ollama pull llama2:13b               # 13B parameter model (~7GB)
ollama pull codellama               # Code-specialized model
ollama pull mistral                 # Fast 7B model
ollama pull llama2-uncensored       # Uncensored variant

# Pull specific model sizes
ollama pull llama2:7b-chat-q4_0     # 4-bit quantized for efficiency
ollama pull llama2:7b-chat-q8_0     # 8-bit quantized for quality

# Check model details
ollama show llama2

Model Selection and Optimization

Hardware Requirements

# Check system specs
system_profiler SPHardwareDataType | grep -E "(Memory|Chip)"

# RAM recommendations:
# 7B models: 8GB+ RAM
# 13B models: 16GB+ RAM  
# 70B models: 64GB+ RAM

# Apple Silicon optimization
ollama pull llama2:7b-chat-fp16      # Use FP16 for M1/M2 optimization

Performance Testing

# Benchmark model performance
time ollama run llama2 "Write a hello world program in Python"

# Monitor resource usage
top -pid $(pgrep ollama)

# Memory usage per model
ollama ps

Model Configuration

# Create custom Modelfile
cat > Modelfile << EOF
FROM llama2

# Set parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_p 0.9

# System prompt
SYSTEM """
You are a helpful AI assistant specialized in technical documentation and code review.
Always provide accurate, well-structured responses with code examples when appropriate.
"""
EOF

# Build custom model
ollama create samcloud-assistant -f Modelfile

# Test custom model
ollama run samcloud-assistant "Explain Docker containers"

Open WebUI Installation

Docker Installation

# Install Docker (if not already installed)
brew install --cask docker

# Start Docker Desktop
open /Applications/Docker.app

# Verify Docker is running
docker --version

Open WebUI Deployment

# Pull Open WebUI image
docker pull ghcr.io/open-webui/open-webui:main

# Run Open WebUI with Ollama integration
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

# Check container status
docker ps | grep open-webui

# View logs
docker logs open-webui

Alternative: Native Installation

# Install Python dependencies
pip install open-webui

# Start Open WebUI
open-webui serve --host 0.0.0.0 --port 3000

# Connect to Ollama
export OLLAMA_BASE_URL=http://localhost:11434
open-webui serve

Configuration and Customization

Open WebUI Configuration

# Access web interface
open http://localhost:3000

# Initial setup:
# 1. Create admin account
# 2. Configure Ollama connection
# 3. Test model connectivity

Environment Variables

# Create configuration file
cat > ~/.config/open-webui/config.env << EOF
# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_API_BASE_URL=http://localhost:11434/api

# Security settings
WEBUI_SECRET_KEY=$(openssl rand -hex 32)
ENABLE_SIGNUP=false
DEFAULT_USER_ROLE=user

# Feature flags
ENABLE_RAG=true
ENABLE_SEARCH=true
ENABLE_CHAT_IMPORT=true

# File upload limits
FILE_SIZE_LIMIT=25  # MB
IMAGE_SIZE_LIMIT=10 # MB
EOF

# Load configuration
source ~/.config/open-webui/config.env

Custom Themes and Branding

/* Custom CSS for Open WebUI */
:root {
  --primary-color: #3b82f6;
  --secondary-color: #1e293b;
  --accent-color: #fbbf24;
  --background-color: #0f172a;
  --text-color: #f8fafc;
}

.sidebar {
  background: var(--secondary-color);
  border-right: 1px solid var(--accent-color);
}

.chat-message {
  background: linear-gradient(135deg, var(--background-color), var(--secondary-color));
  border-left: 3px solid var(--accent-color);
}

Advanced AI Workflows

RAG (Retrieval Augmented Generation)

# Install vector database support
pip install chromadb sentence-transformers

# Configure document processing
mkdir -p ~/.config/open-webui/documents
cat > ~/.config/open-webui/rag-config.yaml << EOF
vector_store:
  type: chromadb
  persist_directory: ~/.config/open-webui/chromadb
  
embedding_model:
  name: sentence-transformers/all-MiniLM-L6-v2
  
document_processors:
  - pdf
  - txt
  - md
  - docx
  
chunk_size: 1000
chunk_overlap: 200
EOF

Custom AI Functions

# ~/.config/open-webui/functions/code_review.py
"""
title: Code Review Assistant
author: SamCloud
version: 1.0.0
"""

import ast
import subprocess

class CodeReviewTool:
    def __init__(self):
        self.name = "code_review"
        self.description = "Analyze code for issues and suggestions"
    
    def analyze_python(self, code: str) -> dict:
        """Analyze Python code for syntax and style issues."""
        try:
            # Parse AST for syntax validation
            tree = ast.parse(code)
            
            # Run pylint for style analysis
            result = subprocess.run(
                ['pylint', '--from-stdin', 'temp'],
                input=code.encode(),
                capture_output=True,
                text=True
            )
            
            return {
                "syntax_valid": True,
                "style_issues": result.stdout,
                "suggestions": self.generate_suggestions(tree)
            }
        except SyntaxError as e:
            return {
                "syntax_valid": False,
                "error": str(e),
                "suggestions": ["Fix syntax error before proceeding"]
            }
    
    def generate_suggestions(self, ast_tree) -> list:
        """Generate code improvement suggestions."""
        suggestions = []
        
        for node in ast.walk(ast_tree):
            if isinstance(node, ast.FunctionDef):
                if not node.returns:
                    suggestions.append(f"Consider adding type hints to function '{node.name}'")
        
        return suggestions

Multi-Model Ensemble

# Create model ensemble script
cat > ~/scripts/ai-ensemble.py << EOF
#!/usr/bin/env python3
import requests
import json
from concurrent.futures import ThreadPoolExecutor

class ModelEnsemble:
    def __init__(self):
        self.models = ['llama2', 'mistral', 'codellama']
        self.ollama_url = 'http://localhost:11434'
    
    def query_model(self, model: str, prompt: str) -> dict:
        """Query individual model."""
        response = requests.post(
            f'{self.ollama_url}/api/generate',
            json={'model': model, 'prompt': prompt, 'stream': False}
        )
        return {
            'model': model,
            'response': response.json().get('response', ''),
            'eval_duration': response.json().get('eval_duration', 0)
        }
    
    def ensemble_query(self, prompt: str) -> dict:
        """Query multiple models and combine responses."""
        with ThreadPoolExecutor(max_workers=len(self.models)) as executor:
            futures = [
                executor.submit(self.query_model, model, prompt)
                for model in self.models
            ]
            results = [future.result() for future in futures]
        
        # Combine and rank responses
        return self.rank_responses(results)
    
    def rank_responses(self, responses: list) -> dict:
        """Rank responses by quality metrics."""
        # Simple ranking by response length and speed
        scored = []
        for resp in responses:
            score = (
                len(resp['response']) * 0.7 +  # Content length
                (1000000 / resp['eval_duration']) * 0.3  # Speed factor
            )
            scored.append({**resp, 'score': score})
        
        return sorted(scored, key=lambda x: x['score'], reverse=True)

if __name__ == "__main__":
    ensemble = ModelEnsemble()
    prompt = input("Enter your question: ")
    results = ensemble.ensemble_query(prompt)
    
    print("\n=== AI Ensemble Results ===")
    for i, result in enumerate(results, 1):
        print(f"\n{i}. {result['model']} (Score: {result['score']:.2f})")
        print(f"Response: {result['response'][:200]}...")
EOF

chmod +x ~/scripts/ai-ensemble.py

Monitoring and Management

System Monitoring

# Create monitoring script
cat > ~/scripts/ai-monitor.sh << EOF
#!/bin/bash

echo "=== Ollama Service Status ==="
brew services list | grep ollama

echo -e "\n=== Active Models ==="
ollama ps

echo -e "\n=== System Resources ==="
top -l 1 | grep -E "(CPU usage|PhysMem)"

echo -e "\n=== Docker Containers ==="
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

echo -e "\n=== Disk Usage ==="
du -sh ~/.ollama/models/*

echo -e "\n=== Recent Logs ==="
docker logs open-webui --tail 10
EOF

chmod +x ~/scripts/ai-monitor.sh

# Run monitoring
~/scripts/ai-monitor.sh

Automated Backups

# Create backup script
cat > ~/scripts/ai-backup.sh << EOF
#!/bin/bash

BACKUP_DIR="$HOME/Backups/AI-Infrastructure"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR/$TIMESTAMP"

# Backup Ollama models
echo "Backing up Ollama models..."
cp -r ~/.ollama "$BACKUP_DIR/$TIMESTAMP/ollama"

# Backup Open WebUI data
echo "Backing up Open WebUI data..."
docker run --rm -v open-webui:/data -v "$BACKUP_DIR/$TIMESTAMP":/backup alpine \
  tar czf /backup/open-webui-data.tar.gz -C /data .

# Backup configurations
echo "Backing up configurations..."
cp -r ~/.config/open-webui "$BACKUP_DIR/$TIMESTAMP/config" 2>/dev/null || true

# Create backup manifest
cat > "$BACKUP_DIR/$TIMESTAMP/manifest.txt" << EOL
AI Infrastructure Backup
Created: $(date)
Ollama Version: $(ollama --version)
Models: $(ollama list | grep -v NAME | awk '{print $1}' | tr '\n' ', ')
Open WebUI: $(docker images | grep open-webui | awk '{print $2}')
EOL

echo "Backup completed: $BACKUP_DIR/$TIMESTAMP"
EOF

chmod +x ~/scripts/ai-backup.sh

# Schedule daily backups
(crontab -l 2>/dev/null; echo "0 2 * * * $HOME/scripts/ai-backup.sh") | crontab -

Performance Optimization

Model Quantization

# Create quantized versions for better performance
ollama create llama2-optimized << EOF
FROM llama2

# Optimize for Apple Silicon
PARAMETER num_gpu 1
PARAMETER num_thread 8
PARAMETER rope_freq_base 10000
PARAMETER rope_freq_scale 1

# Memory optimization
PARAMETER mmap true
PARAMETER mlock true
PARAMETER numa true
EOF

# Test performance difference
time ollama run llama2 "Explain quantum computing"
time ollama run llama2-optimized "Explain quantum computing"

GPU Acceleration (Apple Silicon)

# Verify Metal support
system_profiler SPDisplaysDataType | grep "Metal"

# Monitor GPU usage
sudo powermetrics -n 1 -s gpu_power | grep -E "(GPU|package)"

# Optimize for GPU memory
export OLLAMA_NUM_GPU=1
export OLLAMA_GPU_MEMORY_FRACTION=0.8

Security Considerations

Access Control

# Configure firewall
sudo pfctl -f /etc/pf.conf

# Create firewall rules for AI services
cat > /etc/pf.anchors/ai-services << EOF
# Allow local AI services
pass in on lo0 proto tcp from any to any port {11434, 3000}

# Block external access
block in proto tcp from any to any port {11434, 3000}
EOF

# Enable authentication in Open WebUI
docker run -d \
  --name open-webui-secure \
  -p 3000:8080 \
  -e ENABLE_SIGNUP=false \
  -e DEFAULT_USER_ROLE=pending \
  -e WEBUI_AUTH=true \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Data Privacy

# Encrypt model storage
hdiutil create -size 50g -encryption AES-256 -volname "AI-Models" ~/AI-Models.dmg
hdiutil attach ~/AI-Models.dmg

# Move Ollama data to encrypted volume
sudo mv ~/.ollama /Volumes/AI-Models/
ln -s /Volumes/AI-Models/.ollama ~/.ollama

Troubleshooting

Common Issues

# Ollama service not starting
brew services restart ollama
lsof -i :11434  # Check port conflicts

# Model download failures
ollama pull llama2 --insecure  # For network issues
rm -rf ~/.ollama/models/manifests/registry.ollama.ai/library/llama2  # Reset model

# Open WebUI connection issues
docker logs open-webui  # Check container logs
curl http://localhost:11434/api/tags  # Test Ollama API

# Memory issues
ollama ps  # Check loaded models
ollama stop llama2  # Unload specific model

Performance Debugging

# Debug model performance
ollama run llama2 --verbose "Test prompt"

# Monitor system resources
iostat 1  # Disk I/O
vm_stat 1  # Memory usage

Conclusion

Local AI infrastructure with Ollama and Open WebUI provides a powerful, private alternative to cloud-based AI services. The combination offers enterprise-grade privacy, unlimited usage, and complete customization while maintaining the familiar chat interface users expect.

This setup scales from personal use to team deployments, supporting various models and use cases while keeping sensitive data under your complete control.


Next: macOS Homelab Architecture: Infrastructure as Code