Resume

MLOps Engineer specializing in distributed LLM training and inference optimization. Currently leading GPU cluster infrastructure for SEA-LION at AI Singapore.

Skills #

Technical: Python, Rust, SQL | PyTorch, HuggingFace, DeepSpeed, Megatron-LM | Docker, Slurm, AWS

AI/ML: LLM Training & Fine-tuning, Multi-GPU/Multi-node Training, Computer Vision, NLP, MLOps

DevOps: Monitoring & Logging, IaC, Terraform, Ansible

Education #

JUN 2020 - MAY 2025 Singapore University of Social Sciences BSc (Hons) Mathematics with a Minor in Data Science

Experience #

DEC 2023 - PRESENT AI Singapore - AI Engineer, Infrastructure

  • Configured distributed training environments for multi-node LLM training with Megatron-LM and llm-foundry
  • Deployed multi-node Slurm cluster from bare metal DGX H100/H200 servers with Pyxis/Enroot support with Ansible
  • Developed monitoring and alerting system using Mimir, Prometheus and Grafana for cluster occupancy and GPU health
  • Moved from EC2 to ParallelCluster for cost savings and easier management of GPU instances
  • Implemented lifecycle policies for S3 buckets to optimise storage costs by up to 50%
  • Converted Python data processing to Rust, achieving 80% performance improvement
  • Implemented training pipeline for new employees and did hiring interviews for the role
  • Led knowledge sharing sessions on HPC and MLOps best practices for team leaders and members
  • Presented infrastructure architecture decisions to cross-functional teams for alignment
  • Created documentation and tutorials for cluster usage and MLOps workflows
  • Experimentations: Slinky cluster, KAI-schedulers, NVIDIA MIG

Key skills: Slurm, Ansible, Pytorch, Infrastructure, MLOps, AWS, FinOps, Rust

FEB 2023 - NOV 2023 AI Singapore - AI Apprentice (CAIE Associate AI Engineer)

  • Designed document understanding model with synthetic data generation, eliminating traditional box labeling
  • Built LangChain-integrated document retrieval chatbot with OpenAI function calling
  • Established CI/CD pipeline with GitHub Actions for automated testing and linting
  • Coached teams on Deep Learning, Computer Vision, and NLP
  • Recipient of “Outstanding Apprenticeship Award”

Key skills: Python, PyTorch, NLP, Computer Vision, MLOps, RAG, LangChain

DEC 2017 - FEB 2023 Republic of Singapore Air Force - Air Force Engineer

  • Specialist in vibration analysis and balancing for helicopter systems
  • Performed root cause analysis and troubleshooting for aircraft system rectifications
  • Managed multi-team task scheduling and coordination for bilateral mission preparations

Certifications #

AWS Certified Cloud Practitioner AWS Solutions Architect - Associate AWS Certified Machine Learning - Associate