- Automation
Instead of manually performing these functions, their aim is to automate them. Such functions include:
- Continuous integration and continuous delivery
- Incident response
- Alerts
- Monitoring
They also monitor critical applications and services to minimize downtime and ensure their availability.
- Issue resolution
This team will investigate and then resolve the issue in the event that a developer runs into a problem.
Following the incident resolution, the engineer will revisit the issue and determine the cause to ensure it doesn’t happen again.
- Cross team collaboration
Common tools and experience needed:
- Monitoring: such tools include AWS CloudWatch and NewRelic
- Incident management/on-call: such as TWS, and other altering tools.
- Project management and issue tracking: such as Jira and Trello
- Infrastructure orchestration: including Terraform and SaltStack
- Other responsibilities
- Administer production jobs
- Understand debugging info
- “Drain” traffic away from a cluster
- Roll back a bad software push
- Block or rate-limiting unwanted traffic
- Bring up additional serving capacity
- Use the monitoring systems (for alerting and dashboards)
II. Qualifications
- 10+ years of experience with significant experience in the DevSecOps and SRE space
- Experience in designing and running robust and highly scalable data platform and data pipelines
- Experience with leading a team of experienced SRE / DevSecOps professionals
- Extensive experience with designing/supporting both streaming and batch ETL pipelines
- Clear understanding of distributed computing, especially in databases
- Experience with open-source technologies (Spark, Kafka, Presto, Hive, Cassandra etc.)
- Experience working on any of the Cloud platforms (GCP, AWS, Azure)
- Strong communications skills and presentation skills to C levels
- Ability to manage numerous requests concurrently and be able to prioritize and deliver