Role Overview
Hands-on technical support role focused on maintaining and troubleshooting high-performance compute (HPC) and GPU-based AI infrastructure in a lab/data center environment. This position supports Linux-based systems, hardware infrastructure, and compute clusters.
Roles and Responsibilities
- Rack, stack, cable, and maintain servers and AI lab/data center hardware.
- Troubleshoot and repair GPU/CPU servers, compute clusters, and networking (including InfiniBand).
- Perform hardware diagnostics and replace/install server components in racks.
- Monitor and troubleshoot Linux-based systems, storage, and networking issues.
- Use Linux command-line tools for system validation, debugging, and performance checks.
- Collaborate with engineers to ensure stable operation of AI infrastructure.
- Manage lab space and support hardware inventory tracking and maintenance.
Required Skills & Experience
- 3–4+ years of experience in data centers, NOC environments, or similar technical infrastructure roles.
- Strong hands-on experience with server, GPU, and compute hardware troubleshooting.
- Solid understanding of computer architecture and high-performance compute systems.
- Strong Linux administration and troubleshooting skills.
- Ability to diagnose issues across hardware, OS, and networking stack.
- Experience performing component-level hardware installation and repair.
- Physically capable of handling equipment (lifting up to 50 lbs / 23 kg, and performing extended physical tasks).
- Basic scripting ability in Bash or Python.
- Exposure to Git, Jenkins, or similar tools is a plus.
Preferred Qualifications
- Experience with HPC or GPU compute clusters.
- Familiarity with InfiniBand and advanced networking environments.
- Hands-on experience building or maintaining custom server systems or PC hardware.
- Prior exposure to AI/ML infrastructure or research lab environments.
Education
- Bachelor’s degree preferred; equivalent hands-on experience accepted.
- Practical experience is prioritized over formal education.