HPC Specialist – Role Overview Arca dion is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics. Core Responsibilities Architecture & System Design Design scalable HPC clusters (on-prem, cloud, or hybrid) Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storage Configure Slurm, PBS, or OpenHPC job schedulers Cluster Deployment & Maintenance Install and manage Linux-based compute nodes Maintain job schedulers and resource managers Integrate monitoring tools (Prometheus, Grafana, Nagios) Performance Tuning & Optimization Benchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP) Optimize I/O and inter-node communication Ensure efficient job execution and queue handling Cloud & Hybrid HPC Integration Deploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit) Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano) AI/ML & GPU Workload Support Manage AI pipelines that require HPC acceleration (e.g., LLM training) Optimize GPU usage (NVIDIA A100/H100, AMD MI300) Interface with TensorFlow, PyTorch, and HPCML tools Security & Compliance Implement security best practices for multi-user environments Support data governance for sensitive or regulated workloads Maintain audit trails and role-based access control User & Application Support Assist researchers and data scientists with job submissions and optimizationDevelop documentation, training materials, and run code validation sessions Technical Skills & Tools Schedulers: Slurm, PBS, Torque, LSF HPC OS & Config: RHEL/CentOS, Rocky, Ubuntu Server HPC File Systems: Lustre, BeeGFS, GPFS Parallel Computing: MPI, OpenMP, CUDA Monitoring/Telemetry: Prometheus, Grafana, Ganglia Cloud HPC: AWS HPC, Azure CycleCloud, GCP HPC Toolkit Containers: Singularity, Apptainer, Docker, Kubernetes AI/ML Support: PyTorch, TensorFlow, Horovod, MLFlow DevOps Tools: Ansible, Terraform, Git, Jenkins Qualifications Bachelor’s or Master’s in Computer Science, Engineering, Physics, or a related technical field 3–7 years of experience in HPC environments Experience supporting AI/ML teams is a major asset Certifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE #J-18808-Ljbffr