Staff Software Engineer, Machine Learning Operations

Location: Chicago, Illinois, United States

Company: Grainger Businesses

Job Type: Full-time, Part-time

Salary: $122000 - $202000

Posted: Aug 28, 2025

Application deadline: Nov 7, 2025

Source: ZipRecruiter

Full-TimeHybridLeadBachelor's DegreeSoftwareEducationInsuranceEngineeringDesignPythonGitAWSSEMForecastingInclusive401(k)Paid Time OffSoftware Engineer

Work Location Type: Hybrid

Req Number 323339

About Grainger:

W.W. Grainger, Inc., is a leading broad line distributor with operations primarily in North America, Japan and the United Kingdom. At Grainger, We Keep the World Working by serving more than 4.5 million customers worldwide with products and solutions delivered through innovative technology and deep customer relationships. Known for its commitment to service and award-winning culture, the Company had 2024 revenue of $17.2 billion across its two business models. In the High-Touch Solutions segment, Grainger offers approximately 2 million maintenance, repair and operating (MRO) products and services, including technical support and inventory management. In the Endless Assortment segment,Zoro.comoffers customers access to more than 14 million products, andMonotaRO.comoffers more than 24 million products. For more information, visitwww.grainger.com.

Compensation:

The anticipated base pay compensation range for this position is $121,500.00 to $202,500.00.

Rewards and Benefits:

With benefits starting on day one, our programs provide choice and flexibility to meet team members' individual needs, including:

  • Medical, dental, vision, and life insurance plans with coverage starting on day one of employment and 6 free sessions each year with a licensed therapist to support your emotional wellbeing.
  • 18 paid time off (PTO) days annually for full-time employees (accrual prorated based on employment start date) and 6 company holidays per year.
  • 6% company contribution to a 401(k) Retirement Savings Plan each pay period, no employee contribution required.
  • Employee discounts, tuition reimbursement, student loan refinancing and free access to financial counseling, education, and tools.
  • Maternity support programs, nursing benefits, and up to 14 weeks paid leave for birth parents and up to 4 weeks paid leave for non-birth parents.

For additional information and details regarding Grainger's benefits, please click on the link below:

https://experience100.ehr.com/grainger/Home/Tools-Resources/Key-Resources/New-Hire

The pay range provided above is not a guarantee of compensation. The range reflects the potential base pay for this role at the time of this posting based on the job grade for this position. Individual base pay compensation will depend, in part, on factors such as geographic work location and relevant experience and skills.

The anticipated compensation range described above is subject to change and the compensation ultimately paid may be higher or lower than the range described above.

Grainger reserves the right to amend, modify, or terminate its compensation and benefit programs in its sole discretion at any time, consistent with applicable law.

Position Details:

The Machine Learning Platform & Operations team is focused on enabling machine learning scientists and engineers at Grainger to continuously develop, deploy, monitor, and refine machine learning models as well as improving the ML software development process. Our mission is to empower Grainger teams to effortlessly build, ship, and scale reliable machine learning, data science, and analytical solutions by proactively listening to our users and anticipating Grainger's evolving needs; delivering self-service, quality-first platforms that accelerate business outcomes. You will work with machine learning, data engineering, network, security, and platform engineering teams to build core components of a scalable, self-service machine learning platform that powers customer-facing applications. You will play an important part in developing the tools and services that form the backbone of Grainger's AI driven features leveraging methods in Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond. This is an exciting opportunity to join a team fueling the next phase in Grainger Technology Group's data- and AI-driven modernization.

Our team is organized around three focus areas:

  • Machine Learning Operations & Infrastructure: Build and maintain core infrastructure components (i.e., Kubernetes clusters) and tooling enabling self-service development and deployment of a variety of applications leveraging GitOps practices.
  • Machine Learning

Platform: Design and develop user-friendly software systems and interfaces supporting all stages of the machine learning development lifecycle.

  • Machine Learning Effectiveness & Enablement: Guide, partner, and consult with machine learning, product, and business domain teams from across the organization to foster responsible, scalable, and efficient development of high-quality ML systems.

For this role, we seek an individual with deep experience administering and maintaining scalable cloud infrastructure components to continue driving quality and reliability of our Machine Learning Operations & Infrastructure focus area. If you are passionate about driving improvements in system reliability and availability and are excited by the challenge of supporting high scale machine learning systems, this is the role for you.

You Will:

  • Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
  • Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale. Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance tradeoffs for training and inference.
  • Architect multicluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, autorollback) patterns in CI/CD.
  • Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices. Evolve CI/CD from repolocal workflows to reusable pipeline templates with quality/performance gates; standardize GitOps objects/guardrails (e.g., Argo CD Applications/Projects, policyascode).
  • Define orgwide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
  • Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
  • Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
  • Own multicomponent roadmap initiatives that measurably move platform & reliability OKRs; communicate major changes and incidents to orgwide forums and host crossteam design sessions.
  • Partner with teams across the business to enable reliable adoption of ML by hosting internal workshops, publishing playbooks/templates, and advising teams on adopting platform patterns safely.

You Have:

  • Bachelor's degree and 7+ years' relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
  • Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred.
  • Experience leading org-wide platform initiatives (e.g., multicluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers.
  • Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
  • Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
  • Deep expertise with GitOps practices and tools (Argo CD appofapps, RBAC, sync policies) as well as policyascode (OPA/Kyverno) for safe rollouts.
  • Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK).
  • Deep, handson experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns).
  • Ability to work collaboratively and empathetically in a team environment.

Bonus:

  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).
  • Experience driving machine learning system reliability and awareness of associated requirements (e.g., model/feature drift telemetry, evaluation services, and modelrouting layers integrated with CI/CD).
  • You've built pragmatic Kubernetes extensions (think small CRDs or admission webhooks), helped teams adopt OpenTelemetry to standardize traces/metrics/logs, and led safe, multi-cluster Kubernetes upgrades with staged rollouts, thorough testing, and clean rollback.

Don't meet every single qualification? Studies show people are hesitant to apply if they don't meet all requirements listed in a job posting. If you feel you don't have all the desired experience, but it otherwise aligns with your background and you're excited about this role, we encourage you to apply. You could be a great candidate for this or other roles on our team.

We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex (including pregnancy), national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or expression, protected veteran status or any other protected characteristic under federal, state, or local law. We are proud to be an equal opportunity workplace.

We are committed to fostering an inclusive, accessible work environment that includes both providing reasonable accommodations to individuals with disabilities during the application and hiring process as well as throughout the course of one's employment, should you need a reasonable accommodation during the application and selection process, including, but not limited to use of our website, any part of the application, interview or hiring process, please advise us so that we can provide appropriate assistance.

More Jobs Like This

RN - Telemetry in Cleveland, Ohio - $2,581/week

Vetted Health

Newburgh Heights, Ohio

Posted: Sep 4, 2025

Vetted is seeking a RN - Telemetry for a travel job in Cleveland, Ohio. The job was posted about 24 hours ago. The assignment starts on Oct 3 and is 13 weeks long with 8 hour shifts 5 days a week. You must live 60 miles away from the facility in order to get the travel rate. The contract pays $2,581 per week gross, with $1,807 in wages and $774 in stipend. You'll need 2 years of experience, BLS and national and state certification and/or as required. Benefits include 1. Quick Payments Weekly pay through direct deposit 2. Health Generous medical and dental plans 3. Housing Stipend and pe

IR - Tech

St. Vincent Carmel Hospital

Carmel, Indiana

Posted: Sep 4, 2025

MedSource Travelers offers assignments nationwide and is currently seeking a qualified IR Tech with 1-2 year's experience for a travel assignment in Carmel, Indiana. Please have resume, skills checklist and 2-3 references. Contact us today about job details. The benefits of MedSource Travelers include, • Weekly Pay • Holiday Pay • Continuing Education • Referral Bonus • Completion Bonus • Extension Bonus • Medical Benefits • Dental Benefits • Vision Benefits Let's get started!

Resident Service Engineer- Greencastle, PA

Yrh

Location not specified

Posted: Sep 4, 2025

Job Description: S2 6:30-5am Packsize is an Equal Opportunity employer and is committed to diversity in its workforce. In compliance with applicable federal and state laws, Packsize policy of equal employment opportunity prohibits discrimination on the basis of race or ethnicity, religion, color, national origin, sex, age, sexual orientation, gender identity/expression, veteran’s status, status as a qualified person with a disability, or genetic information. Individuals from historically underrepresented groups, such as minorities, women, qualified persons with disabilities, and protected vet

25-26 SY- TA Child Development

Charleston County School District

Location not specified

Posted: Sep 4, 2025

• Position Type: Teacher Assistants/Child Development (CD) • Date Posted: 9/2/2025 • Location: Sanders Clyde Elementary School • Closing Date: Open Until Filled Job Shift: 7:00am - 3:00pm Position Control No.: 71424801 FTE: 1.0 Assignment Type: Full time CLASSIFIED HOURLY RATE: • $18.49 to $32.55 - with Parapro OR 60 hours of accredited college credit • $19.14 to $36.14 - with Bachelor's Degree • 190 day position • Salary is based on the board-approved 2025-2026 salary schedule, and years of work experience derived from the employment application up to a maximum of thirty-five years. Att

SAP FICO Lead Consultant (Hybrid in Houston, TX)

Akaasa Technologies

Houston, Texas

Posted: Sep 4, 2025

Role: SAP FICO Lead Consultant (with public cloud experience) Job Description: • Client is looking for a FICO Lead resource for a Public Cloud implementation. • Must have solid Costing experience - Configuration of FI/CO, COSTING (PCA, CCA) • Must have Fi/Co business knowledge along with SAP product knowledge.

Specialty Teacher K-12 – Fall Break Learning Academy

Shelby County Schools

Memphis, Indiana

Posted: Sep 3, 2025

This position is responsible for professional instructions during fall school designed to provide academic, social, physical, and skills development for students in specific grades/subject areas to prepare them to be successful citizens and workers in the 21st century. Graduation from an accredited college or university with a Bachelor’s Degree in Education and an endorsement in the grade/subject taught with a valid Tennessee certificate. Governed by the Rules and Regulations of the Tennessee Code Annotated and the collective bargaining agreement. The following special conditions are require

Registered Nurse - Correctional - $2,002 per week

Care Career

Fontainebleau, Florida

Posted: Sep 3, 2025

Care Career is seeking a travel nurse Correctional for a travel nursing job in Milan, New Mexico. Job Description & Requirements • Specialty: Correctional • Discipline: RN • Start Date: ASAP • Duration: 13 weeks • 36 hours per week • Shift: 12 hours, days • Employment Type: Travel Care Career Job ID #. Pay package is based on 12 hour shifts and 36.0 hours per week (subject to confirmation) with tax-free stipend amount to be determined. Posted job title: RN:Corrections,07:00:00-19:00:00 About Care Career Care Career brings together a portfolio of leading healthcare staffing organizations, eac

Travel Interventional Radiology Technologist - $3,098 per week

Voyage Healthcare

Jefferson, Louisiana

Posted: Sep 3, 2025

Voyage Healthcare is seeking a travel Interventional Radiology Technologist for a travel job in Jefferson, Louisiana. Job Description & Requirements • Specialty: Interventional Radiology Technologist • Discipline: Allied Health Professional • Start Date: 09/22/2025 • Duration: 13 weeks • 40 hours per week • Shift: 10 hours, days • Employment Type: Travel Voyage Healthcare Job ID #JOB-409658. Pay package is based on 10 hour shifts and 40 hours per week (subject to confirmation) with tax-free stipend amount to be determined. About Voyage Healthcare We want to provide the same level of care tha