AI Cloud Engineer

LockedIn AI

33 Irving Pl, Manhattan, New York, United States
33 Irving Pl, Manhattan, New York, United States
Manhattan, KS, USA, NY 10003
United States

Job Category

Development

Job Description

# AI Cloud Engineer

**Location:** Remote (US) · Optional Hybrid in New York, NY
**Employment Type:** Full-Time
**Department:** Engineering
**Compensation:** $140,000 – $195,000 USD + Equity

## About LockedIn AI

LockedIn AI is the world's leading real-time AI interview and meeting copilot, trusted by more than 1 million users globally. We help professionals perform at their best during live interviews, coding assessments, and high-stakes meetings through advanced AI-powered assistance.

As we continue to scale our platform and AI capabilities, we're looking for an exceptional **AI Cloud Engineer** to build and optimize the cloud infrastructure that powers our machine learning systems, real-time inference workloads, and next-generation AI products.

## The Opportunity

We are seeking a cloud-native engineer who understands both modern cloud platforms and AI infrastructure. In this role, you'll design, deploy, and optimize the systems that support model training, fine-tuning, evaluation, and production inference at scale.

You'll work at the intersection of cloud engineering, AI operations, platform architecture, and cost optimization—ensuring our AI services remain reliable, scalable, secure, and cost-efficient as our user base continues to grow.

If you're passionate about GPUs, Kubernetes, AI workloads, infrastructure automation, and building systems that serve millions of users, we'd love to meet you.

## What You'll Do

### Architect AI-Native Cloud Infrastructure

* Design and build cloud environments optimized for AI and machine learning workloads.
* Create scalable GPU training and inference clusters across AWS, GCP, or Azure.
* Develop cloud architectures supporting model training, staging, evaluation, and production environments.
* Implement elastic scaling strategies for both training and inference workloads.

### Build Production AI Serving Infrastructure

* Deploy and manage large-scale AI inference systems with high availability and low latency.
* Operate model serving frameworks such as vLLM, Triton Inference Server, TensorRT, TGI, or similar technologies.
* Optimize inference performance through batching, GPU utilization, memory optimization, and intelligent routing.
* Design resilient load-balancing and failover strategies for AI services.

### Manage GPU & Training Platforms

* Provision GPU environments for model development and experimentation.
* Configure distributed training infrastructure across multiple GPUs and nodes.
* Implement job scheduling, resource allocation, and automated provisioning systems.
* Manage cloud AI platforms including SageMaker, Vertex AI, Azure ML, or equivalent services.

### Optimize Cloud Costs

* Monitor and reduce infrastructure spend across compute, networking, storage, and AI services.
* Implement FinOps practices for AI workloads.
* Improve GPU utilization and resource efficiency.
* Build dashboards and reporting systems for cloud cost visibility.

### Security & Compliance

* Design secure cloud networking architectures.
* Implement IAM policies, encryption, secrets management, and audit logging.
* Protect AI assets including model weights, embeddings, datasets, and inference endpoints.
* Support compliance initiatives including SOC 2, GDPR, and CCPA readiness.

### Infrastructure Automation & Observability

* Manage infrastructure through Terraform, Pulumi, CloudFormation, or similar tools.
* Build automated deployment and provisioning pipelines.
* Create monitoring and alerting systems for GPU health, inference performance, and infrastructure reliability.
* Ensure high availability and operational excellence across all AI systems.

## Required Qualifications

### Experience

* 3+ years of experience in Cloud Engineering, DevOps, Infrastructure Engineering, or related fields.
* Experience supporting AI/ML workloads in production environments.
* Hands-on expertise with GPU infrastructure and cloud AI services.
* Experience collaborating with AI, ML, and platform engineering teams.
* Startup or high-growth company experience is highly valued.

### Technical Skills

* Strong programming skills in Python, Go, Bash, or similar languages.
* Deep knowledge of AWS, GCP, or Azure cloud services.
* Strong Kubernetes administration and deployment experience.
* Experience with model serving technologies and AI inference platforms.
* Proficiency with Infrastructure-as-Code tools such as Terraform or Pulumi.
* Experience with monitoring platforms including Prometheus, Grafana, Datadog, or CloudWatch.

### Soft Skills

* Strong problem-solving and systems-thinking abilities.
* Excellent communication and technical documentation skills.
* Ability to operate independently and take ownership of critical infrastructure.
* Passion for building scalable AI systems and improving operational efficiency.

## Preferred Qualifications

* Experience running large-scale LLM inference platforms.
* Knowledge of distributed training systems and multi-node GPU clusters.
* Familiarity with NCCL, RDMA, InfiniBand, or high-performance networking.
* Experience supporting real-time streaming applications.
* Multi-cloud infrastructure experience.
* Previous startup or early-stage company experience.
* Open-source contributions or technical publications.

## Why Join LockedIn AI?

### Own Critical AI Infrastructure

You'll build and operate the cloud foundation that powers every AI capability across our platform.

### Massive Scale

Your work will directly impact over 1 million users worldwide.

### Equity & Ownership

Receive meaningful equity and help shape the future of the company.

### Remote-First Culture

Work from anywhere in the US with optional collaboration opportunities in New York City.

### Accelerated Growth

Join a rapidly growing AI company at a pivotal stage of expansion.

### Work at the AI Frontier

Collaborate with talented engineers and build products powered by the latest advancements in AI and machine learning.

## Benefits

* Competitive salary ($140,000–$195,000)
* Meaningful equity package
* Remote-first work environment
* Flexible schedule
* Career growth opportunities
* Access to cutting-edge AI technologies
* High-impact work with real business outcomes

## How to Apply

Please submit:

* Your Resume or CV
* A brief note covering:

* Why you're interested in joining LockedIn AI
* Whether you've tried our product
* Ideas or improvements you'd make to the platform
* Optional: GitHub profile, technical blog, open-source contributions, or portfolio projects

We're excited to hear from engineers passionate about building the infrastructure behind the next generation of AI-powered products.

**LockedIn AI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all team members.**

Qualifications

Required Qualifications
Experience
3+ years of experience in Cloud Engineering, DevOps, Infrastructure Engineering, or related fields.
Experience supporting AI/ML workloads in production environments.
Hands-on expertise with GPU infrastructure and cloud AI services.
Experience collaborating with AI, ML, and platform engineering teams.
Startup or high-growth company experience is highly valued.
Technical Skills
Strong programming skills in Python, Go, Bash, or similar languages.
Deep knowledge of AWS, GCP, or Azure cloud services.
Strong Kubernetes administration and deployment experience.
Experience with model serving technologies and AI inference platforms.
Proficiency with Infrastructure-as-Code tools such as Terraform or Pulumi.
Experience with monitoring platforms including Prometheus, Grafana, Datadog, or CloudWatch.
Soft Skills
Strong problem-solving and systems-thinking abilities.
Excellent communication and technical documentation skills.
Ability to operate independently and take ownership of critical infrastructure.
Passion for building scalable AI systems and improving operational efficiency.
Preferred Qualifications
Experience running large-scale LLM inference platforms.
Knowledge of distributed training systems and multi-node GPU clusters.
Familiarity with NCCL, RDMA, InfiniBand, or high-performance networking.
Experience supporting real-time streaming applications.
Multi-cloud infrastructure experience.
Previous startup or early-stage company experience.
Open-source contributions or technical publications.

How to Apply

Website - Apply Online