Tổng quan công việc

About Us Nanyang Biologics (NYB) is a biotechnology company leveraging AI to revolutionize drug discovery and develop innovative therapeutics inspired by nature. Through advancing AI-driven solutions in the pharmaceutical industry, we address the complex and costly challenges of discovering effective treatments for difficult diseases. Our Team With a global track record of delivering projects with leading companies across the USA, India, Japan, Korea, Singapore, etc., our team of experts based in Vietnam and Singapore works collaboratively across functions and leverages global expertise to drive innovation and excellence. Who We Are Looking For We're hiring a Senior MLOps Engineer to own the reliability and scale of our GPU compute platform. Vecura runs 300+ scientific AI tools — protein structure prediction, molecular dynamics, docking, and more — across a wide range of GPU types, on both serverless and self host spanning cloud and on-prem. You'll own the platform layer between infrastructure and models: how GPU jobs are scheduled, queued, isolated, observed, and recovered. You think in SLOs and design for repeatability, building standard systems that scale across hundreds of models rather than one-off deployments. You'll work alongside our DevOps engineer (infra/cluster) and our AI engineers (model onboarding), owning the orchestration and reliability surface that connects them. Roles and

Responsibilities:Own the success-run ratio of GPU workloads as a measurable SLO; drive it up and keep it there. Build and operate the GPU job scheduling and queueing layer — fair-share allocation, prioritization, backpressure, and recovery across a heterogeneous fleet. Implement GPU partitioning and sharing (MIG, MPS, time-slicing) to raise utilization without destabilizing runs. Profile and right-size workloads: per-model GPU memory, runtime, and failure characteristics; eliminate OOMs and silent failures. Define a standard packaging/deployment contract for new models so onboarding is repeatable, not bespoke. Build observability for the run lifecycle — metrics, logs, traces, alerting — so failures are caught and diagnosed fast. Harden the orchestration stack (workflow engine, durable execution, retries/failover) against real failure modes. Partner with the DevOps engineer on cluster/networking and with AI engineers to make their models production-ready.

Kỹ năng chính

Pythonkubernetesawsgcpmachine learningonboardingai

Yêu cầu

5+ years in MLOps / ML platform / GPU systems engineering, with direct ownership of production reliability.

Deep experience operating GPU workloads at scale (NVIDIA stack: CUDA, drivers, GPU Operator, MIG/MPS).

Strong background in workload orchestration and scheduling — Kubernetes (Jobs/batch), Ray, Slurm, or equivalent.

Hands-on managed-ML platform experience on at least one major cloud, with working familiarity of the other: GCP — Cloud Run, Vertex AI AWS — SageMaker Solid understanding of cloud architecture (compute, networking, storage, IAM) across hybrid cloud + on-prem.

Proven track record raising reliability/utilization of a heterogeneous GPU fleet.

Solid software engineering (Python and one systems language) — you build platform tooling, not just configure it.

Observability and SRE fundamentals: SLOs, metrics, tracing, incident response.

Quyền lợi

We provide a dynamic, fast-paced, and collaborative environment where problem-solving and agility are at the heart of what we do.

Along with a competitive salary, we foster a culture that values ambition, confidence, and humility, consistently pushing the boundaries of innovation.

If you're excited about working in a young, talented tech company and want to explore the world of AI and pharmaceuticals, we encourage you to apply.

Competitive salary (negotiable based on experience) Workplace: No.45-57, Tran Xuan Soan, Hai Ba Trung, Ha Noi (From Monday to Friday: 9h -17h) Build a professional network through collaborations with pharmaceutical companies, industry leaders, and academic experts.

Work on impactful projects that address critical challenges in drug discovery and healthcare.

Employees are entitled to 2 work-from-home days per month , along with daily lunch provided by the company .

Holiday & Tet bonuses; performance-based bonus Social insurance contribution on full salary

Thông tin bổ sung

Tín hiệu vai trò

Vận hành