Required Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related technical field.
- 5+ years of experience in OMS Technical Operations, Platform Engineering, Site Reliability Engineering (SRE), or Production Support within high-volume, event-driven SaaS environments.
- Strong experience supporting Order Management Systems (OMS) and distributed integrations.
- Advanced knowledge of GraphQL queries, mutations, aliases, fragments, and variables.
- Strong understanding of REST APIs, JSON, and event-driven architectures (Pub/Sub, Kafka, Event Grid, or similar).
- Hands-on experience with observability and monitoring tools such as Splunk, Datadog, ELK Stack, or New Relic.
- Strong experience with Git, version control, and deployment processes.
- Proficiency in reading, debugging, and troubleshooting Java-based applications and custom extensions.
- Strong understanding of ITIL processes with an SRE mindset focused on automation and operational excellence.
- Excellent analytical, troubleshooting, and communication skills.
Roles & Responsibilities
Platform Reliability & Automation
- Design and implement automation solutions to improve OMS platform reliability and reduce manual intervention.
- Develop automated order remediation and recovery mechanisms for synchronization failures across integrated systems.
- Build tools and utilities using platform SDKs, APIs, and scripting to support operational efficiency.
- Drive self-healing capabilities and proactive platform monitoring.
Monitoring & Observability
- Develop and maintain dashboards to monitor API performance, GraphQL query execution, system health, and integration success rates.
- Implement and optimize alerting strategies to proactively identify stuck orders, inventory discrepancies, and integration failures.
- Analyze system metrics and trends to improve platform stability and performance.
Incident Management & Root Cause Analysis
- Serve as the technical escalation point for complex production incidents and platform issues.
- Perform deep-dive troubleshooting across application logs, integrations, APIs, and event-driven workflows.
- Lead root cause analysis efforts and implement long-term corrective actions.
- Document technical resolutions, workarounds, and operational best practices.
Performance & Platform Optimization
- Analyze API response times, integration bottlenecks, and application performance issues.
- Collaborate with engineering teams to recommend and implement platform improvements.
- Support scalability, reliability, and operational readiness initiatives.
Stakeholder & Vendor Collaboration
- Act as the technical liaison between business, architecture, engineering, and operations teams.
- Collaborate with platform vendors and internal teams on upgrades, enhancements, and release planning.
- Mentor support and operations teams on troubleshooting, GraphQL optimization, and technical best practices.
Release & Change Management
- Review and validate platform configurations, integrations, and deployments during release cycles.
- Support change management processes and operational readiness activities.
- Manage source control and branching strategies for operational fixes and configuration updates.
Preferred Qualifications
- Experience with Fluent Commerce OMS, including GraphQL APIs, Webhooks, and business rules.
- Experience supporting eCommerce, order fulfillment, or retail technology platforms.
- Familiarity with CI/CD pipelines and DevOps practices.
- Experience working in cloud-based SaaS environments.