All jobs

Senior AI Infrastructure & Platform Operations Engineer (remote in the EU)

100% Remote Full-time Open now

Company Description

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data-intensive applications. By combining open source innovation with deep expertise in Kubernetes orchestration, Mirantis empowers platform engineering teams to deliver composable, production-ready developer platforms across any environment—on-premises, in the cloud, at the edge, or in sovereign data centers. As enterprises navigate the growing complexity of AI-driven workloads, Mirantis delivers the automation, GPU orchestration, and policy-driven control needed to manage infrastructure with confidence and agility. Committed to open standards and freedom from lock-in, Mirantis ensures that customers retain full control of their infrastructure strategy. Mirantis serves many of the world’s leading enterprises, including Adobe, DocuSign, Liberty Mutual, PayPal, Reliance Jio, Societe Generale, Splunk, and Volkswagen. Learn more at www.mirantis.com.

Job Description

We are building a European AI Infrastructure & Platform Operations team responsible for operating large-scale AI infrastructure environments powered by NVIDIA GPUs, high-performance networking, Kubernetes, and next-generation platform technologies. As a Senior AI Infrastructure & Platform Operations Engineer, you will serve as a technical leader within the operations organization, providing deep expertise across infrastructure, networking, platform operations, and service reliability. You will be responsible for driving operational excellence across complex production environments while acting as a key escalation point for critical incidents and challenging technical issues. This role combines hands-on technical operations with technical leadership, helping shape operational standards, reliability practices, automation initiatives, and the future evolution of AI-powered operational services through platforms such as k0rdent AI. Responsibilities: Technical Operations & Service Reliability Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents. Act as a senior escalation point for operational teams during critical service-impacting events. Support large-scale NVIDIA GPU infrastructure and high-performance networking environments. Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues. Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks. Lead root cause analysis activities and drive long-term corrective actions. Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges. Participate in major incident management and service restoration activities. Platform Operations & Engineering Provide technical leadership for Kubernetes platform operations and supporting infrastructure services. Drive improvements in platform reliability, observability, monitoring, and operational processes. Identify opportunities to automate repetitive operational activities and improve operational efficiency. Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions. Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI. Evaluate emerging technologies and operational practices to improve service delivery and platform resilience. Technical Leadership Mentor and support AI Infrastructure & Platform Operations Engineers. Share technical knowledge through documentation, training sessions, and operational reviews. Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices. Help define operational processes, escalation paths, and service reliability standards. Act as a trusted technical advisor during operational planning and service improvement initiatives.

Qualifications

Required Skills & Experience: 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles. Expert-level Linux administration and troubleshooting skills. Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues. Strong experience operating Kubernetes in production environments. Experience supporting large-scale production infrastructure and distributed systems. Proven experience leading technical investigations and managing complex incidents. Experience performing root cause analysis and driving long-term operational improvements. Strong understanding of observability, monitoring, and service reliability practices. Excellent troubleshooting and analytical skills across multiple infrastructure domains. Strong communication, collaboration, and stakeholder management skills. Experience in one or more of the following areas is highly desirable: NVIDIA GPU infrastructure and accelerated computing platforms. InfiniBand networking and NVIDIA UFM. AI infrastructure environments. HPC environments. Platform Engineering or Site Reliability Engineering (SRE). Large-scale Kubernetes operations. Infrastructure automation technologies and Infrastructure-as-Code practices. Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry. Performance analysis and optimisation of distributed infrastructure platforms. Technical leadership, mentoring, or team lead responsibilities. Additional Information

We offer

Operate some of the most advanced AI infrastructure environments in production today. Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments. Help define operational standards and reliability practices for next-generation AI infrastructure services. Influence the adoption of AI-powered operational capabilities through k0rdent AI. Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale. Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation. #Remote We are a Leader for Container Management in G2 (#2 after AWS)! Apply To This Job

You might also like

Part-Time English Teacher

100% Remote Full-time

Viveno Final Test Marleen 19.06. - Test, do not apply!!

100% Remote Full-time

AI Automation Engineer

100% Remote Full-time

Técnico/a de Sistemas Senior – Oracle / ITIL – 100% Remoto (España)

100% Remote Full-time

We Are Era Final Test Marleen 19.06. - Test, do not apply!!

100% Remote Full-time

Account Executive, Enterprise DACH

100% Remote Full-time

Vogel Druck Final Test Marleen 19.06. - Test, do not apply!!

100% Remote Full-time

Insurance Producer - San Marcos, TX

100% Remote Full-time

AI Video Creator / AI Filmmaker - Remote

100% Remote Full-time

Workday Administrator / Developer

100% Remote Full-time

Remote Travel Marketing and Booking Entry Level

100% Remote Full-time

Experienced Social Media Sales Representative – Remote English Speaking Opportunity with arenaflex for Latin America Residents

100% Remote Full-time

[FULL TIME Remote] Remote Brand Designer - Infrared Finance

100% Remote Full-time

Remote USA-Based Customer Service Representative – Home Office Role with arenaflex, Full‑Time, Flexible Shifts, Career Growth

100% Remote Full-time

Require Multi-Grade Secondary Teacher Long Term Sub in USA

100% Remote Full-time

$90 Test center mystery visit - DF, Mexico City - MEXICO, 8310

100% Remote Full-time

Aetna Data Entry Remote Jobs $72000/Yearly

100% Remote Full-time

Experienced Stay-at-Home Mom: Share Valuable Insights & Perform Online Data Entry Tasks at arenaflex

100% Remote Full-time

[Remote] Staff Software Engineer - Back End / Front End

100% Remote Full-time

Apply Now: Need Adjunct Faculty - Instrumentation in Baton Rouge

100% Remote Full-time