HPC Specialist – Role Overview Arca dion is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics.Core ResponsibilitiesArchitecture & System DesignDesign scalable HPC clusters (on-prem, cloud, or hybrid)Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storageConfigure Slurm, PBS, or OpenHPC job schedulersCluster Deployment & MaintenanceInstall and manage Linux-based compute nodesMaintain job schedulers and resource managersIntegrate monitoring tools (Prometheus, Grafana, Nagios)Performance Tuning & OptimizationBenchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP)Optimize I/O and inter-node communicationEnsure efficient job execution and queue handlingCloud & Hybrid HPC IntegrationDeploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit)Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano)AI/ML & GPU Workload SupportManage AI pipelines that require HPC acceleration (e.g., LLM training)Optimize GPU usage (NVIDIA A100/H100, AMD MI300)Interface with TensorFlow, PyTorch, and HPCML toolsSecurity & ComplianceImplement security best practices for multi-user environmentsSupport data governance for sensitive or regulated workloadsMaintain audit trails and role-based access controlUser & Application SupportAssist researchers and data scientists with job submissions and optimizationDevelop documentation, training materials, and run code validation sessionsTechnical Skills & ToolsSchedulers:Slurm, PBS, Torque, LSFHPC OS & Config:RHEL/CentOS, Rocky, Ubuntu ServerHPC File Systems:Lustre, BeeGFS, GPFSParallel Computing:MPI, OpenMP, CUDAMonitoring/Telemetry:Prometheus, Grafana, GangliaCloud HPC:AWS HPC, Azure CycleCloud, GCP HPC ToolkitContainers:Singularity, Apptainer, Docker, KubernetesAI/ML Support:PyTorch, TensorFlow, Horovod, MLFlowDevOps Tools:Ansible, Terraform, Git, JenkinsQualificationsBachelor’s or Master’s in Computer Science, Engineering, Physics, or a related technical field3–7 years of experience in HPC environmentsExperience supporting AI/ML teams is a major assetCertifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE#J-18808-Ljbffr