Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

100% Remote Full-time Open now

Note :Hello!! The manager would like to hold a Supplier Call to better explain the specifics they are looking for in this role. Please do not add anyone until after this call. I have scheduled this call for Friday 9/6/24 @ 1pm EST. Description: This is a Site Reliability Engineer Role for Sam's Cash Application team. Role and Responsibilities include:

Production Tickets handling and Troubleshooting : Requires knowledge of: Strong Analytical and problem solving skills; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Data Reporting and Metrics:

Advanced SQL skills to pull complex data report from multiple sources, familiar with Databricks or GCP Big Query, capable to write advanced "Splunk" queries to join multiple indices to stitch data, using Data-Driven decision-making process to analyze the impact of the production issues and prioritize them.

Additional Information: What project or initiative will they be working on?

Sam's Cash Reward Project

Will this role be hybrid?

If hybrid, how many days per week will need to be in office?

2-3 times a week

Top 3 Skills Needed or Required

Strong technical analytical and problem solving skills , experiences on triaging and Troubleshooting Production Issues;
Monitoring and Alerting Skills ((Splunk, Prometheus, Grafana)
Data Reporting and Metrics Skills (SQL,Python, Pyspark, Databricks).

What is the makeup of the team?

Team of 8 engineers including Java backend engineers, Site Reliability Engineer and Data Engineers, supporting Sam's Cash Core Application Operations.

Additional Job Details

Location can be Sunnyvale, CA, Bentonville, AR, or Dallas, TX

Required Skills : Grafana Additional Skills : Cloud Developer Apply tot his job Apply To this Job

Apply

Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

You might also like

Platform Site Reliability Engineer:

Site Reliability Engineer 2 days Onsite

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Senior Site Reliability Engineer

Site Reliability Engineer/Sunnyvale, CA/ Austin, TX (Hybrid)- 6-12 months

Senior Site Reliability Engineer (CloudVision as a Service)

Site Reliability Engineer Manager

Site Reliability Engineer: initial focus on Release Management

Distinguished Site Reliability Engineer – Cloud

Site Reliability Engineer, IDaaS Data Platform

Experienced Remote Data Entry Associate – Part-Time Opportunity at arenaflex

Fully Remote Customer Service Representative - US Work From Home Position | Online Shopping Support Specialist

Field Service Engineer

Remote job: NLP Engineer (Document Intelligence & Automation)

Experienced Customer Support Representative – Blockchain and Cryptocurrency Education

Collections Specialist (Accounts Receivable) - 60327811892

Human Resources Assistant

Entry Level | Experience Scheduling Coordinator | Remote

Financial Systems Administrator - Remote

Experienced Aviation Safety Analyst – Remote Data Entry, Compliance & Flight Operations Quality Assurance Specialist