Being the escalation path for on-call incident response and triage
Performing root cause analysis and post-mortems with an eye towards future prevention
Creating configuration management and infrastructure automation
Designing and implementing CI/CD pipelines for all that we build
Designing and delivering core infrastructure services
Preemptively creating stability, security, and performance improvements via metric/monitoring analysis
Making sure every service has a complete high-availability and disaster recovery story
Maintaining security standards across everything we support.
Producing documentation, runbooks, and support tooling for online support teams
Who You Are
A self-starter with a considerable breadth of technical knowledge and the ability to dig deep when necessary.
Someone who communicates well with people across dozens of teams and practices.
An engineer with a passion for excellence, a devotion to automation, and an eye for efficiency.
A consummate problem solver.
The systems we support are incredibly diverse, produced by dozens of teams from around the world. Accordingly the ideal candidate will have a diverse skill set and always be eager to expand it. More importantly, they will be able to apply their conceptual understanding to new technologies and tools rapidly. Being a self-starter and having a personal dedication to continuous learning is key. The below is a representative but non-exhaustive list of the skills we are looking for in a successful candidate.
Systems Administration: a strong understanding of *nix is mandatory. Familiarity with both RHEL and Debian family distros is preferred. Strong skill with ad-hoc scripting a plus. Understanding of core services like DNS, DHCP, LDAP, logging, etc.
Networking: a strong understanding of networking basics is mandatory. Switching/routing, VPNs, load balancing, proxying, network virtualization, firewall basics (especially iptables) and general netsec best practices.
Programming Languages: experience with Ruby and Python is preferred. Ability to dive into the code during triage or while trying to understand behavior is a must. Familiarity with C/C++, Go, Java, and Scala is desirable.
Automation: experience with configuration management and infrastructure as code tools is a must. Chef (preferred), Puppet, Terraform (preferred), Packer, CloudFormation, etc.
Distributed Systems: a strong understanding of distributed systems is a must. An understanding of the CAP theorem, techniques for high availability, service discovery, secret management, etc.
Virtualization, Containerization, Cloud Computing: AWS (preferred), GCP, Azure, VMWare ecosystem, Kubernetes (preferred), Docker, Vagrant, etc.