Summary

I am a skilled Site Reliability Engineer specializing in automation and incident response, with a passion for ensuring the reliability of complex systems. Over the past five years, I have successfully reduced incident response times by 40% through the implementation of proactive monitoring strategies. With hands-on experience in incident response and web development, I excel at implementing automation strategies to enhance system reliability and performance. I am proficient in a variety of automation tools and technologies, including Kubernetes, Terraform, and Prometheus, and am adept at collaborating with cross-functional teams to achieve operational excellence.

Experience

Site Reliability Engineer III

Jan 2023 – Present

Enterprise Infrastructure Automation: Designed and implemented large-scale configuration management solutions using industry-standard orchestration platforms, reducing manual operations by 85% across distributed server environments.
Production Systems Engineering: Architected resilient distributed systems with automated failover capabilities, implementing comprehensive observability solutions and performance optimization strategies for high-traffic applications.
CI/CD Pipeline Architecture: Built robust continuous integration and deployment workflows that enabled rapid, reliable software delivery while maintaining strict security and compliance standards.
Platform Security & Identity Management: Implemented enterprise authentication systems and automated credential lifecycle management, ensuring secure access patterns across development and production environments.
Infrastructure as Code Leadership: Developed comprehensive Infrastructure as Code practices using modern provisioning tools, enabling consistent, repeatable deployments across multiple cloud and on-premises environments

Site Reliability Engineer II

Jun 2022 – Jan 2023 (8 months)

Automated Infrastructure Operations: Led monthly patching cycles for 100+ application servers, developing Ansible playbooks and implementing automation platforms to reduce manual intervention by 80%.
Security & Compliance: Implemented network security controls, access management solutions, and automated configuration management via Infrastructure as Code, ensuring enterprise compliance standards.
DevOps Tooling: Architected and deployed automation tracking infrastructure with database backend, migrated container images to private registries for enhanced security and organizational control.
Monitoring & Alerting: Designed and implemented comprehensive monitoring solutions for critical infrastructure components, resolved system failures proactively, and maintained 99.9+ % service availability.
Process Improvement: Created detailed documentation and operational playbooks for application server maintenance, established automated reporting systems, and optimized Python-based infrastructure tools

Site Reliability Engineer I

Apr 2019 – Jun 2022 (3 years 3 months)

Responsible for ensuring the reliability, scalability, and performance of the company’s web applications.
Implemented monitoring and alerting systems using tools like Prometheus and Grafana to proactively identify and resolve issues before they impact users, resulting in a 30% decrease in downtime.
Collaborated with development teams to design and implement automated deployment pipelines using Jenkins and Ansible, reducing deployment times by 50%. For instance, a recent project saw us deploy on-demand updates for a critical application with zero downtime.
Conducted root cause analysis for over 100 incidents and implemented preventative measures such as automated rollbacks and improved documentation, leading to a 25% reduction in recurring issues.
Managed the company’s AWS infrastructure, including EC2 instances, S3 buckets, and RDS databases, optimizing costs and performance.
Participated in on-call rotation to provide 24/7 support and incident response for the resolution of critical issues.
Languages: Python, Go, Java, Ruby, Bash/Shell scripting
Frameworks and Tools: Kubernetes, Docker, Terraform, Ansible, Jenkins, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Nagios, Splunk, New Relic, AWS (Amazon Web Services), Azure, Google Cloud Platform

Systems Administrator I

Apr 2017 – Apr 2019 (2 years 1 month)

Built automation to improve system reliability, availability, and overall performance of GoDaddy’s hosting environment.
Design and implement monitoring to identify failure and automate resolutions.
Influence and design cloud infrastructure services. Create solutions to problems faced by end users.
Automate workflows to create maximum efficiency.
Languages: Python, Node.JS, ReactJS, MySQL, PostgreSQL
Frameworks and Tools: Ansible, Jenkins, Docker, Puppet

Managed Services – Expert II

Mar 2016 – Apr 2017 (1 year 2 months)

Server administrator for Linux, Windows 2008 r2, and Windows 2012 r2.
Back end automation programming using Bash, Javascript, NodeJS, PHP, and Python.
Assist Expert I with higher tier issues.

Managed Services – Expert I

Aug 2015 – Mar 2016 (8 months)

Server administrator for Linux, Windows 2008 r2, and Windows 2012 r2.
Expertise in email servers, DNS management, software firewall management, and application troubleshooting.
Windows Remote Desktop/IIS/SQL Server Management Studio, Parallels Plesk Control Panel
Linux, Apache, MySQL, PHP (LAMP), WHM/cPanel
Receive customer calls for white glove server administration services and troubleshooting.

Hosting Sales and Support

Nov 2014 – Aug 2015 (10 months)

Level 2 tech support role with an influence on sales.
Expertise in content management systems such as WordPress and Joomla.
Manage and Troubleshoot: PHP, HTML, FTP, Email, MySQL, SQL Server, Windows, Linux

Inbound Sales and Technical Support

May 2014 – Nov 2014 (7 months)

This position is a level 1 general support system with a heavy influence on sales.
troubleshoot email delivery over SMTP