Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

Job Description

Alibaba is looking for a Senior Site Reliability Engineer (SRE) to help us improve and expand our rapidly-growing products.

In Alibaba, we believe we can influence the workplace culture of Iran by promoting Integrity, Authenticity, Commitment to something beyond ourselves and Respectful dialogue, and you can be a part of this building procedure.

Pros of working in Alibaba
  • Don't worry about your income, we update our payments according to the job market
  • We pay for your development plans.
  • We provide all the gadgets and devices you need for work (Except for Apple Watch!)
  • You won't be isolated in your team, we believe anyone in Alibaba can collaborate for the team's winning!
  • If there is a Coolest Office Competition, we may win the Gold medal
  • You can have breaks in relaxing areas and help yourself to some drinks and fruits
  • We believe diversity brings creativity.
  • We don't believe in HW (Hard Working!). We believe in HHW (Happily Hard Working!) so we'll have lots of fun alongside the work.

Site Reliability Engineering (SRE) ensures that Alibaba’s services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally, SRE’s will keep an ever-watchful eye on the capacity and performance of our system. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.

On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Alibaba, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.

Responsibilities:

  • Experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, etc...
  • Help build systems and tooling to support automating our reliability and scaling efforts.
  • Work directly with application architects in a consulting capacity.
  • Support the release of new services, through capacity planning, rollout planning, and release management.
  • In collaboration with the application developers, define and implement monitoring strategies, define SLI/SLOs and error budgets.
  • Troubleshoot and remediate issues with the services you manage.
  • Supporting (as 2nd-line) a broad range of systems.
  • Eliminating toil by automating procedures and process.
  • Practice sustainable incident response and blameless postmortems.

Requirements

  • +2 years of experience in a DevOps/SRE role running mission-critical services.
  • Have a mastery of at least one programming language (Python is a plus)
  • Practical knowledge of Kubernetes and other Cloud Native technologies
  • Experience and a strong interest in automating infrastructure and monitoring, with tools and services like Ansible, etc.
  • Have experience applying CI & CD concepts

Submit Your Application