Nathaniel Eliot
+1 512 786 8842
resume@t9productions.com
40+ refs & full CV on LinkedIn
Career Objective
I help build the platforms other engineers depend on, with deep roots in Site Reliability Engineering, Platform Engineering, and the Kubernetes/CNCF ecosystem. I’m looking to support and lead pragmatic innovation in this novel technical era, at an organization with enterprise-class infrastructure or the desire to build it.
Preferred Approaches
Practices: SRE, SLOs/SLIs, Error Budgets, Incident Response, On-Call, GitOps, Infrastructure-as-Code, CI/CD, Service Mesh, Multi-cloud, Capacity Planning, Cost Optimization
Clustering System: Kubernetes, Helm, ClusterAPI, Istio, Argo, Vault, & other CNCF technologies
Agentic Development: Claude Code, Letta Code, OpenCode
Development Languages: Golang, Python, Rust, Ruby, Bash, Javascript
Hosted Services: Amazon Web Services (AWS), Google Cloud (GCP), Azure, Github
Observability: Prometheus, Grafana, Loki, Alertmanager, Datadog, PagerDuty
Operating System: Linux (Ubuntu/Debian and Redhat/CentOS families)
Other: Terraform, Docker, Git, Elasticsearch, PostgreSQL
Recent Experience
InfluxData
Senior Software Reliability Engineer, December 2024 - April 2026
Kubernetes, Vault, Istio, Argo, Prometheus, Grafana, Golang, AWS, GCP, Azure, Claude, Terraform
Stepped up as team lead after sudden departure of previous leadership. Upgraded Argo, Vault, Istio, and Kubernetes across all eighteen clusters of our multi-cloud environment. Rebuilt the Alertmanager pipeline to consolidate Slack, PagerDuty, and runbook routing, reducing alert fatigue and standardizing team-level severity handling. Drove a large-scale customer data recovery effort (fifteen phases, ~5,000 worker-hours) and landed the supporting engine fixes in the core platform. Built an OIDC/PKCE authentication pipeline enabling safe per-cluster observability access for AI agents, meaningfully improving incident diagnosis and architectural guidance.
Auditboard
Staff Software Engineer, July 2022 - April 2023
Kubernetes, Argo, Helm, Prometheus, Python, TypeScript, Golang, AWS, Azure, Datadog, Terraform
Diagnosed alert fatigue in the operations team, and organized regular on-call reviews to reduce alert load to more reasonable levels. Developed and trained team on SLO alerting framework in Datadog and Terraform to further simplify and unify monitoring. Took over core deployment process temporarily, to drive team-wide and company-wide improvements to it, then educated team in new process. Recognized the need for, built consensus for, and implemented a version deploy policy company-wide, reducing the operational burden of variance. Maintained and advanced underlying Kubernetes architecture across multiple cloud providers (AWS, Azure).
Indeed
Senior Site Reliability Engineer, June 2018 - July 2022
Kubernetes, Helm, Prometheus, Grafana, ClusterAPI, Python, AWS, Puppet, Datadog
Built several generations of deployment systems for Kubernetes, which replaced a prior Mesos-based system within 18 months, and paved the way for a company-wide lift-and-shift to AWS. These clusters provide roughly 200 kCPUs and 500 TB of memory to over five thousand applications in sixteen datacenters worldwide. Successfully lobbied upper management to adopt the CNCF ecosystem more broadly, which removed costly-to-maintain dependencies and provided many novel capabilities to developers. Supported infrastructure development efforts across five client teams, providing valuable operational direction and oversight. Developed SLO alerting in Prometheus, Grafana, and Datadog, which reduced alert fatigue dramatically for those client teams and and their supporting SRE-on-call rotations.
The Greenfield Guild
Founder & CEO, January 2017 - December 2020
Kubernetes, Helm, Velero, Golang, Javascript, Python, AWS, GCP, Terraform, Docker, Cassandra
Founded The Greenfield Guild to respond to a gap in available cloud and Agile expertise in small and medium businesses. Recruited, interviewed, hired, and managed a half-dozen employees and freelancers. Developed core architectural experiments in Terraform, Kubernetes, Docker, and Wordpress. Composed job proposals for a variety of clients, from early stage startups to large government entities. Attended conferences and networked with software vendors to provide early pipeline for the sales team. Increased firm’s visibility through a variety of means, including speaking opportunities and social media engagement.
Bazaarvoice
Staff DevOps Engineer, September 2013 - May 2016
AWS, Docker, CloudFormation, Puppet, Java, Scala, Golang, Cassandra, Elasticsearch, Datadog
Maintained and developed on the core infrastructure (Cassandra, Elasticsearch, and custom Java and Scala code, deployed with Cloudformation and Puppet) for the new distributed data stack. Provided front-line operational support to relieve core developers during performance pushes. Took ownership of deployment for largest customer team during critical delivery push. Provided guidance and conducted experiments to further stabilize and test the core stack. Open sourced a useful ancillary tool (cloudformation-ruby-dsl) written by coworkers. Developed an internal PaaS offering using Flynn.io, which provides a stable, decentralized, container-based build and deployment framework to numerous teams in a variety of environments.
Infochimps
Senior Operations Engineer, May 2011 - June 2013
AWS, Rackspace, Chef, Ruby, Python, Javascript, Hadoop, HBase, Cassandra, Elasticsearch
Core developer on the cluster orchestration suite Ironfan across two major releases, including multi-cloud provider support. Supported eight developers through deployment and incident response. Standardized 85+ internal cloud servers across five active clients onto a common deployment stack with regular redeployment. Championed and built continuous integration (CI) for full-stack deployments. Wrote a lightweight AAA server for the metered data offering. Led Ironfan open-source community engagement: email, issue tracking, social media, and in-person and webcast talks. Continuously reduced infrastructure spend through usage guidance and data-store cleanup.
T9 Productions
Consultant, March 2003 - current
Kubernetes, Flux, Helm, Golang, Javascript, Python, AWS, Azure, Terraform, Docker, MySQL
Long-running independent consultancy on system architecture and developer infrastructure. Recent engagement: rebuilt a pre-funding startup’s architecture and developer pipeline with Kubernetes and CNCF tooling, reducing cost and unblocking deployments. Earlier work spans system administration (email, virtualization, developer tools, web infrastructure) and full-stack web development in a variety of languages.
Relevant Skills
Leadership: Focused on soft, goal-driven leadership. Experienced with leading both from the ranks and from positions of authority. Good at identifying and nurturing talent in fellow professionals. Comfortable with building consensus across diverse business functions. Familiar with a variety of business failure modes and potential remedies. Experienced working with distributed, remote-first teams in multiple timezones. Solid writing and copy-editing skills, including technical and policy writing.
Software Development: Primary focus in system automation, including customization and extension of many popular open-source packages. Strong skills in functional and object-oriented programming across many diverse languages. Comfortable with most layers of common development stacks, with a preference for deep system integration. Heavy emphasis on test-driven design and other agile development practices.
Reliability Engineering: Server administration and user technical support across all major platforms, in a wide range of software domains. Capable of end-to-end system implementation, including requirements gathering, architecture design, server provisioning and build, software development, product launch, and support infrastructure. Strong focus on SLOs and other SRE / observability practices, including educating other teams in their adoption. Heavy emphasis on repeatable infrastructure and other agile architecture practices. Strong preferences for low cost, open source, and inter-operable solutions. Practical experience with incident response, post-incident reviews, on-call rotation design, multi-cloud disaster recovery, and capacity planning at scale.