Sr. Systems Design Engineer - Data Center GPU - AMD

Govtjobs.ca

Sr. Systems Design Engineer - Data Center GPU

AMD Markham

Job Description

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture.

THE ROLE

We are looking for a dynamic, energetic Senior Systems Design Engineer to join our growing Data Center GPU team. In this role, you will work closely with the automation, infrastructure, and validation teams to ensure scalability and reliability. You will also document processes, best practices, and provide training for internal teams.

THE PERSON

As a Systems Design Engineer, you will drive balanced, scalable, and automated solutions. In this high visibility position, your software systems engineering expertise will be necessary towards product development, definition, and root cause resolution. You will have strong problem‑solving and debugging skills, excellent communication and collaboration abilities, and the ability to work in fast‑paced, cross‑functional environments.

KEY RESPONSIBILITIES

Containerization & Image Management
- Design, build, and maintain Docker images optimized for ML/AI workloads.
- Implement multi‑stage builds, image hardening, and vulnerability scanning.
- Manage Docker registries (e.g., Harbor) and enforce retention policies for large‑scale deployments.

Automation & Orchestration
- Develop and maintain Python‑based automation scripts for Conductor workflows.
- Implement CI/CD pipelines for automated container builds and workload deployment.
- Integrate orchestration frameworks (Conductor, Kubernetes, Slurm) for multi‑node workload execution.

ML/AI Workload Enablement
- Enable training and inference workloads using frameworks like PyTorch, TensorFlow, VLLM.
- Optimize distributed training and inference across multi‑node clusters using MPI and RDMA.
- Collaborate with app experts to benchmark and tune performance for AI/HPC workloads.

Infrastructure & Performance
- Integrate ROCm stack and GPU resource management into containerized environments.
- Troubleshoot latency, networking, and storage bottlenecks for at‑scale workloads.
- Implement monitoring and logging for containerized ML workloads.

PREFERRED EXPERIENCE

Strong proficiency in Python and automation frameworks.

Hands‑on experience with Docker and container orchestration (Kubernetes, Podman).

Familiarity with CI/CD tools (Jenkins, GitHub Actions) and infrastructure‑as‑code (Terraform, Ansible).

Knowledge of ML frameworks (PyTorch, TensorFlow) and GPU acceleration (ROCm, CUDA).

Understanding of networking concepts (RDMA, MPI) for distributed workloads.

Prior experience enabling ML/AI workloads in production or HPC environments.

Exposure to orchestration platforms like Conductor or similar workflow engines.

ACADEMIC CREDENTIALS

Bachelors or Masters degree in electrical or computer engineering, minimum 5‑7 years relevant experience.

LOCATION

Markham, ON

BENEFITS

Benefits offered are described: AMD benefits at a glance.

BASE PAY RANGE

$116,000.00/yr – $174,000.00/yr

LEGAL STATEMENTS

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee‑based recruitment services. AMD and its subsidiaries are equal‑opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third‑party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Seniority Level: Mid‑Senior level

Employment Type: Full‑time

Job Function: Semiconductor Manufacturing

#J-18808-Ljbffr

How to Apply

Ready to start your career as a Sr. Systems Design Engineer - Data Center GPU at AMD?

Click the "Apply Now" button below.
Review the safety warning in the modal.
You will be redirected to the employer's official portal to complete your application.
Ensure your resume and cover letter are tailored to the job description using our AI tools.

Frequently Asked Questions

Who is hiring?▼

This role is with AMD in Markham.

Is this a remote position?▼

This appears to be an on-site role in Markham.

What is the hiring process?▼

After you click "Apply Now", you will be redirected to the employer's official site to submit your resume. You can typically expect to hear back within 1-2 weeks if shortlisted.

How can I improve my application?▼

Tailor your resume to the specific job description. You can use our free Resume Analyzer to see how well you match the requirements.

What skills are needed?▼

Refer to the "Job Description" section above for a detailed list of required and preferred qualifications.

Sr. Systems Design Engineer - Data Center GPU

Job Description

THE ROLE

THE PERSON

KEY RESPONSIBILITIES

PREFERRED EXPERIENCE

ACADEMIC CREDENTIALS

LOCATION

BENEFITS

BASE PAY RANGE

LEGAL STATEMENTS

How to Apply

Frequently Asked Questions

Stand Out from the Crowd

Search Again

Similar Opportunities

(Bilingue) Gestionnaire de compte majeur - Metro et Tigre géant/ (Bilingual) Key Account Manager - Metro and Giant Tiger

(CAN) Consumables Associate

*General Operator - Days

.NET Backend Developer (Hardware Integration)

.NET Developer (Markham, ON)

Popular Searches

View Job on Our Network

Safety & Disclaimer

External Application