Lead Site Reliability Engineer Jobs in Irving, TX

Lead Site Reliability Engineer

Long Finch Technologies

Irving, TX

Posted On: Oct 03, 2024

Job Overview

Job Type

Contract - W2, Contract to Hire - W2, Contract - Independent, Contract to Hire - Independent

Experience

5 - 10 Years

Salary

Depends on Experience

Work Arrangement

On-Site

Travel Requirement

Required Skills

Lead

Job Description

In this role, you will:

· Lead complex technology initiatives including those that are companywide with broad impact

· Act as a key participant in developing standards and companywide best practices for engineering complex and large-scale technology solutions for technology engineering disciplines

· Design, code, test, debug, and document for projects and programs

· Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors

· Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives

· Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals

· Troubleshoot, and analyze production job failures related to data, network file delivery, and server and application issues independently and provide solutions to recovery. Participate in root cause analysis and preventative actions to avoid recurring incidents.

· Participate in the buildout of automation to prevent problem recurrence, with the goal of automating response to all non-exceptional service conditions.

· Apply technology background in software engineering and systems engineering to ensure the applications on-boarded to SRE are available, have full-stack observability, are integrated with CI/CD, and always-on by introducing continuous improvement through code and automation, continuous testing (performance, functional), and provide operational insight through analytics.

· Assess the availability of critical business flows, identify service level objectives and indicators, and conduct destructive and resiliency testing to reach 99.995% availability for the firm's critical products and services leading to improved customer experience and customer satisfaction.

· Develop original and/or complex code, provide coding guidance/review, and create documentation

· Introduce enterprise capabilities, tools, and innovation to improve availability in a multi-cloud ecosystem by evolving observability, monitoring, logging, CI/CD integration, continuous testing (performance, functional, ), continuous improvement, and standardization/automation of key SRE metrics and IT Service Operations processes.

· Evolve continuous inspection capabilities code quality to identify problems before they manifest in production.

· Introduce and expand AIOps, and robotic process automation (RPA) to solve complex operational and systemic issues, and to improve availability of products to customers.

· Share support responsibilities for critical applications, to identify systemic issues, conduct blameless post mortems, root cause analysis, and introduce strategic solutions in code that solve the problem and eliminate repeat issues.

· Be willing to work non-standard business hours on an on-call basis in a 24x7x365 environment.

· Lead projects, teams, or serve as a peer mentor

Required Qualifications:

· 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

· 5+ years experience troubleshooting and systems administration experience across multiple OS Platforms: Solaris, AIX, PKS, Kubernetes, OpenShift, Linux, Windows, VMware

· 3+ years experience with web platforms: Java, Apache, Tomcat, Weblogic, Oracle

· 2+ years experience with database technologies: Basic SQL, Cassandra DB, Oracle, Postgres SQL

· 2+ years experience with Observability tools: Traffic Manager, Message Processor, AppDynamics, Filebeat, Basemon, etc.

· 2+ years experience using logging/monitoring tools: ELK, Filebeats, Splunk, Netcool, SiteScope, Kafka

Desired Qualifications:

· 5+ years of software development experience with languages such as Perl, Python, Java, JavaScript, Ruby, JSON, Angular, NodeJS

· 2+ years experience with Automation Scripting: Bash, Shell, Ansible, Terraform, Azure DevOps

· 1+ year of experience with Cloud technologies: PCF, Azure, AWS, GCP, etc

· 2+ years Incident Management System experience

· 2+ years experience with Agile Scrum (Daily Standup, Sprint Planning and Sprint Retrospective meetings)

· 2+ years experience using JIRA.

· 2+ years experience with Data Services platforms: Bigdata, Datalake, Hadoop, Spark.

· 1+ years experience with AIOPs tools: BigPanda, MoogSoft.

· Experience with one or more CI/CD Pipeline (Github, Jenkins) and Automation tools: Gradle, Maven, Git, Ansible, Puppet

· Experience with one or more Observability/Monitoring tools: Elastic, Kibana, Grafana, AppDynamics, Kafka, Big Panda, Splunk

· Experience with one or more Data/Data Structures: Kafka, Apache Airflow, Logstash, Spark, Oracle, SQL, Mongo, Hadoop, Cloudera, AWS EMR, S3

· Knowledge of one or more additional capabilities: Uipath, Robotic Processing and Capacity Management

· An industry standard certification

Job ID: LF240006

Posted By

Ashish Negi