Site Reliability Engineer
650 Fifth Avenue New York, NY 10019
In this role, you will build a fully automated infrastructure by leveraging cutting-edge technologies like Docker, Kubernetes, Ansible, and Terraform. The ideal candidate for this role approaches technology operations as a systems engineering discipline, obtains and analyzes data to identify trouble spots and optimization opportunities, and applies software development practices to to improve the reliability of our platform and services. To succeed in this role you need to be passionate about making constant incremental improvements to systems, laser focused on availability and performance, and driven to automate all the things.
What You Will Do:
- Support development initiatives to build highly automated, tuned, and reliable systems and services. Contribute to design and implementation decisions, development, and ongoing refactoring.
- Implement tools that analyze and monitor performance and availability; use your findings to make informed decisions on how to improve existing systems and processes
- Bring SRE best practices in-house (post-mortems, trend analysis, availability standards, etc.) and help set the tone around service operations and reliability
- Develop and deliver timely reports on service metrics including but not limited to availability, capacity, performance, and latency across production system
Who You Are:
- You are passionate about making better software and continuously improving the development, integration, and deployment processes
- You enjoy new technological challenges and are motivated to find creative solutions to solve them
- You are highly motivated, self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
- You know how to design, implement, and iterate CI/CD tooling and techniques to improve our ability to deliver software and services quickly and reliably
- Expertise in incident and problem management including timely problem identification, successful resolution, and root-cause analysis
- Strong verbal and written communication skills to communicate technology concepts and practices
- Strong expertise in monitoring tools (AppDynamics/App Insights/Sumo Logic/etc.)
- Experience with configuration management tools (Ansible, Chef, Puppet, etc)
- Strong working knowledge of containers, container orchestration, and AWS environment and tools.
Kaitlin O'Brien graduated from St. Joseph's University in 2012 with a Bachelor's in Psychology (& Sociology minor) and has been working at The Phoenix Group since July 2013. TPG was the first interview she went on, and she felt an immediate sense of comfort around peers who were also just starting their careers. When Kaitlin's not in the office, she's probably catching up on TV shows and snuggling with her Irish Doodle, Mr. Wilson!