White Paper - OpenNebula AI Factory Reference Architecture

The growing demand for artificial intelligence is driving a fundamental shift in how computing infrastructure is designed and delivered. Around the world, cloud providers, HPC centers, telecommunications companies, and public institutions are investing in the creation of AI Factories — large-scale infrastructures built to train, deploy, and operate AI models efficiently and securely.

This white paper introduces the OpenNebula AI Factory Reference Architecture, a comprehensive framework for building secure, multi-tenant AI infrastructure that unifies high-performance computing, virtualization, and cloud orchestration within a single platform. Designed to meet the growing demands of AI training, inference, and service delivery, this architecture establishes a clear separation between shared infrastructure management and per-tenant operations, enabling organizations to deliver scalable AI-as-a-Service models with full data sovereignty and near-native performance.

At its foundation, the AI Factory integrates GPU-accelerated servers (e.g., NVIDIA H100, GB200, GH200) interconnected via NVLink, InfiniBand, or Spectrum-X Ethernet, complemented by BlueField-3 DPUs for secure networking and high-performance distributed storage systems for low-latency data access. On top of this hardware layer, OpenNebula acts as the Infrastructure-as-a-Service (IaaS) layer, providing unified orchestration, virtualization, and automation capabilities across distributed clusters.

Each tenant operates within an isolated Virtual Data Center (VDC), managing its own AI platforms and frameworks—ranging from Kubernetes-based environments (e.g., Run:ai, Kubeflow, NVIDIA Dynamo) to bare-metal GPU execution for performance-critical workloads. OpenNebula’s multi-tenant model ensures complete resource isolation through ACL-based governance, while supporting chargeback, monitoring, and marketplace integration for self-service operations and flexible business models.

The white paper also details the key building blocks of the AI Factory ecosystem—including AI platforms, storage, monitoring, security, DevOps, and cluster management tools—alongside a four-layer user model that distinguishes between end users, service users, service administrators, and infrastructure administrators.

Finally, it demonstrates how OpenNebula achieves near-zero virtualization overhead through advanced technologies such as PCI passthrough, SR-IOV, and CPU pinning, ensuring that AI workloads and HPC jobs run with bare-metal performance while maintaining the agility, scalability, and isolation required for modern, multi-tenant AI infrastructures.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.