Summary
I am a skilled Site Reliability Engineer specializing in automation and incident response, with a passion for ensuring the reliability of complex systems. Over the past five years, I have successfully reduced incident response times by 40% through the implementation of proactive monitoring strategies. With hands-on experience in incident response and web development, I excel at implementing automation strategies to enhance system reliability and performance. I am proficient in a variety of automation tools and technologies, including Kubernetes, Terraform, and Prometheus, and am adept at collaborating with cross-functional teams to achieve operational excellence.
Experience

Site Reliability Engineer III
Jan 2023 – Present
- Enterprise Infrastructure Automation: Designed and implemented large-scale configuration management solutions using industry-standard orchestration platforms, reducing manual operations by 85% across distributed server environments.
- Production Systems Engineering: Architected resilient distributed systems with automated failover capabilities, implementing comprehensive observability solutions and performance optimization strategies for high-traffic applications.
- CI/CD Pipeline Architecture: Built robust continuous integration and deployment workflows that enabled rapid, reliable software delivery while maintaining strict security and compliance standards.
- Platform Security & Identity Management: Implemented enterprise authentication systems and automated credential lifecycle management, ensuring secure access patterns across development and production environments.
- Infrastructure as Code Leadership: Developed comprehensive Infrastructure as Code practices using modern provisioning tools, enabling consistent, repeatable deployments across multiple cloud and on-premises environments
Site Reliability Engineer II
Jun 2022 – Jan 2023 (8 months)
- Automated Infrastructure Operations: Led monthly patching cycles for 100+ application servers, developing Ansible playbooks and implementing automation platforms to reduce manual intervention by 80%.
- Security & Compliance: Implemented network security controls, access management solutions, and automated configuration management via Infrastructure as Code, ensuring enterprise compliance standards.
- DevOps Tooling: Architected and deployed automation tracking infrastructure with database backend, migrated container images to private registries for enhanced security and organizational control.
- Monitoring & Alerting: Designed and implemented comprehensive monitoring solutions for critical infrastructure components, resolved system failures proactively, and maintained 99.9+ % service availability.
- Process Improvement: Created detailed documentation and operational playbooks for application server maintenance, established automated reporting systems, and optimized Python-based infrastructure tools
Site Reliability Engineer I
Apr 2019 – Jun 2022 (3 years 3 months)
- Responsible for ensuring the reliability, scalability, and performance of the company’s web applications.
- Implemented monitoring and alerting systems using tools like Prometheus and Grafana to proactively identify and resolve issues before they impact users, resulting in a 30% decrease in downtime.
- Collaborated with development teams to design and implement automated deployment pipelines using Jenkins and Ansible, reducing deployment times by 50%. For instance, a recent project saw us deploy on-demand updates for a critical application with zero downtime.
- Conducted root cause analysis for over 100 incidents and implemented preventative measures such as automated rollbacks and improved documentation, leading to a 25% reduction in recurring issues.
- Managed the company’s AWS infrastructure, including EC2 instances, S3 buckets, and RDS databases, optimizing costs and performance.
- Participated in on-call rotation to provide 24/7 support and incident response for the resolution of critical issues.
- Languages: Python, Go, Java, Ruby, Bash/Shell scripting
- Frameworks and Tools: Kubernetes, Docker, Terraform, Ansible, Jenkins, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Nagios, Splunk, New Relic, AWS (Amazon Web Services), Azure, Google Cloud Platform
Systems Administrator I
Apr 2017 – Apr 2019 (2 years 1 month)
- Built automation to improve system reliability, availability, and overall performance of GoDaddy’s hosting environment.
- Design and implement monitoring to identify failure and automate resolutions.
- Influence and design cloud infrastructure services. Create solutions to problems faced by end users.
- Automate workflows to create maximum efficiency.
- Languages: Python, Node.JS, ReactJS, MySQL, PostgreSQL
- Frameworks and Tools: Ansible, Jenkins, Docker, Puppet
Managed Services – Expert II
Mar 2016 – Apr 2017 (1 year 2 months)
- Server administrator for Linux, Windows 2008 r2, and Windows 2012 r2.
- Back end automation programming using Bash, Javascript, NodeJS, PHP, and Python.
- Assist Expert I with higher tier issues.
Managed Services – Expert I
Aug 2015 – Mar 2016 (8 months)
- Server administrator for Linux, Windows 2008 r2, and Windows 2012 r2.
- Expertise in email servers, DNS management, software firewall management, and application troubleshooting.
- Windows Remote Desktop/IIS/SQL Server Management Studio, Parallels Plesk Control Panel
Linux, Apache, MySQL, PHP (LAMP), WHM/cPanel - Receive customer calls for white glove server administration services and troubleshooting.
Hosting Sales and Support
Nov 2014 – Aug 2015 (10 months)
- Level 2 tech support role with an influence on sales.
- Expertise in content management systems such as WordPress and Joomla.
- Manage and Troubleshoot: PHP, HTML, FTP, Email, MySQL, SQL Server, Windows, Linux
Inbound Sales and Technical Support
May 2014 – Nov 2014 (7 months)
- This position is a level 1 general support system with a heavy influence on sales.
- troubleshoot email delivery over SMTP