Get in Touch

Course Outline

Introduction to EXO and Local AI Clustering

  • Overview of the EXO framework and the exo-explore ecosystem.
  • Comparison between centralized cloud inference and distributed local inference.
  • Architecture overview: libp2p device discovery, MLX backend, dashboard, and API layers.
  • Hardware requirements: Apple Silicon (M3 Ultra, M4 Pro/Max), Thunderbolt 5, and shared storage.

Installing EXO on macOS

  • Setting up Xcode, Metal Toolchain, and macOS prerequisites.
  • Installing uv, Node.js, and the Rust nightly toolchain.
  • Installing the specific macmon fork for Apple Silicon monitoring.
  • Cloning the repository and building the dashboard using npm.
  • Running EXO from source and verifying the localhost:52415 dashboard.

Installing EXO on Linux

  • Installing dependencies via apt or Homebrew on Linux.
  • Configuring uv, Node.js 18+, and the Rust nightly toolchain.
  • Building the dashboard and running EXO in CPU-only mode.
  • Directory layout: utilizing XDG Base Directory paths for configuration, data, cache, and logs.

Automatic Device Discovery and Cluster Formation

  • Understanding libp2p-based auto-discovery across local networks.
  • Configuring custom namespaces using EXO_LIBP2P_NAMESPACE for cluster isolation.
  • Verifying node membership via the dashboard cluster view.
  • Troubleshooting discovery failures and network segmentation issues.

Enabling RDMA over Thunderbolt 5

  • Understanding RDMA architecture and the claimed 99 percent latency reduction.
  • Enabling RDMA in macOS Recovery mode using rdma_ctl.
  • Requirements for cables and port topology constraints on Mac Studio.
  • Ensuring macOS versions are matched across all cluster nodes.
  • Troubleshooting RDMA discovery and DHCP configuration.

Deploying Frontier Models

  • Using the dashboard to load and shard DeepSeek v3.1, Qwen3-235B, and Llama family models.
  • Previewing instance placements via the /instance/previews API endpoint.
  • Creating model instances with pipeline or tensor-parallel sharding.
  • Configuring custom model cards from the HuggingFace hub.

Monitoring and Troubleshooting

  • Reading EXO logs and understanding distributed tracing.
  • Interpreting cluster health through the dashboard cluster view.
  • Diagnosing worker node failures and reconnection behaviors.
  • Utilizing EXO_TRACING_ENABLED for performance bottleneck analysis.

Cluster Maintenance and Updates

  • Updating EXO binaries and procedures for dashboard rebuilds.
  • Migrating model caches and managing pre-downloaded models via NFS.
  • Gracefully removing nodes and rebalancing workloads.

Requirements

  • Fundamental understanding of networking concepts (IP addressing, subnetting, firewalls).
  • Experience with command-line administration on macOS or Linux.
  • Familiarity with Python package management (pip/uv) and Node.js tooling.

Target Audience

  • System administrators.
  • DevOps engineers.
  • AI infrastructure architects responsible for on-premise LLM deployment.
 21 Hours

Upcoming Courses

Related Categories