Full Time

Lead Machine Learning Engineer, LLM Infrastructure - Salesforce, Inc. - Palo Alto, CA (+1 other)

Salesforce, Inc.

Palo Alto, CA (+1 other)
Posted today

About the Role

We are seeking a Lead ML Engineer, LLM Post-Training Infrastructure to join the Salesforce AI Research Incubation Team. In this role, you will own the infrastructure and engineering systems that support LLM post-training, large-scale evaluation, and model deployment. You will build scalable, reliable

pipelines for training orchestration, rollout generation, reward and feedback pipelines, experiment management, and model iteration, helping translate research ideas into production-grade systems.

This is an engineering-first role focused on ML infrastructure, distributed systems, and training/evaluation workflows rather than developing new model architectures or algorithms. You will work closely with research scientists, agent engineers, and platform teams to operationalize post-training and feedback-driven learning methods into robust, reusable systems.This is a lead-level individual contributor role with deep ownership of model-facing infrastructure and strong cross-functional influence.

Key Responsibilities:

● Design, build, and maintain infrastructure for LLM post-training, evaluation, and

deployment.

● Own scalable pipelines for training orchestration, rollout generation, reward and

feedback processing, checkpointing, and experiment management.

● Build reliable systems for feedback-driven model improvement, including human or AI

feedback loops, large-scale offline evaluation, and regression detection.

● Partner closely with research scientists to turn new post-training methods into reusable

engineering workflows.

● Collaborate with agent engineers and platform teams to integrate training and evaluation

systems with production model and agent stacks.

● Optimize distributed training and inference workloads for reliability, throughput, cost

efficiency, and observability.

● Drive best practices for reproducibility, versioning, monitoring, deployment, and

operational excellence across ML systems.

Required Qualifications:

● 5+ years of