Senior HPC Cloud Engineer
Descartes Labs
OUR VISION
At EarthDaily Analytics (EDA), we strive to build a more sustainable planet by creating innovative solutions that combine satellite imagery of the Earth, modern software engineering, machine learning, and cloud computing to solve the toughest challenges in agriculture, energy and mining, insurance and risk mitigation, wildfire and forest intelligence, carbon-capture verification and more.
EDA’s signature Earth Observation mission, the EarthDaily Constellation (EDC), is currently under construction. The EDC will be the most powerful global change detection and change monitoring system ever developed, capable of generating unprecedented predictive analytics and insights. It will combine with the EarthPipeline data processing system to provide unprecedented, scientific-grade data of the world every day, positioning EDA to meet the growing needs of diverse industries.
OUR TEAM
Our global, enterprise-wide team represents a variety of business lines and is made up of business development, sales, marketing and support professionals, data scientists, software engineers, project managers and finance, HR, and IT professionals. Our Earth Insights team is nimble and collaborative, and in preparation for launching a frontier and disruptive product in EDC, we are currently looking for an experienced Senior HPC Cloud Engineer to join our crew!
READY TO LAUNCH?
Do you want to work for one of the most exciting space companies at the forefront of global change detection/change monitoring to design, build, and optimize high-performance computing infrastructure on AWS and other hyperscalers? The ideal candidate will have experience in cloud engineering, HPC engineering, DevOps, Python development, AWS cloud architecture, containerization technologies and orchestration, and database design to support computationally intensive workloads at scale.
PREPARE FOR IMPACT!
As a key technical leader, reporting to the Director of Product Management for EarthInsights, you will architect cloud-native solutions that bridge traditional HPC paradigms with modern DevOps practices, enabling our organization to leverage elastic cloud resources for complex simulations, data processing, and scientific computing. You will be responsible for scaling our existing infrastructure while troubleshooting critical issues across distributed compute clusters spanning multiple cloud providers. This role demands both deep technical expertise in HPC technologies and the versatility to work across the full stack—from low-level computational debugging to high-level application development. Working remotely with a high degree of autonomy, you will drive infrastructure innovation, optimize resource utilization, and implement best practices that directly impact our computational capabilities and operational efficiency. Your work will be instrumental in accelerating product development, delivering high-value data to our clients, reducing infrastructure costs, and ensuring our HPC platforms remain reliable and performant.
RESPONSIBILITIES:
Cloud Infrastructure & HPC Management
- Design, architect, and deploy scalable high-performance computing (HPC) solutions on AWS using services including AWS ParallelCluster, AWS Batch, EC2, ECS, ECR, Lambda, and managed databases.
- Configure and optimize HPC job schedulers (Slurm, PBS) for resource allocation, job scheduling, and workload management across cloud compute clusters.
- Create and manage resource reservation strategies for high-performance compute applications to optimize cost and performance.
- Troubleshoot and resolve complex issues with AWS HPC clusters, including performance bottlenecks, job failures, and infrastructure instabilities.
- Optimize HPC resource reservation scripts and automation workflows to improve cluster efficiency and reduce operational overhead.
- Build and maintain cloud-native applications using AWS PaaS services, integrating with compute, storage, and database solutions.
- Develop serverless functions using AWS Lambda to automate workflows, process events, and orchestrate cloud resources.
- Modernize cron-based task scheduling by migrating to cloud-native schedulers including Lambda, ECS scheduled tasks, and EventBridge.
- Create, optimize, and maintain containerized applications using Docker for deployment across cloud environments.
- Deploy and manage container workloads using Amazon ECS (Elastic Container Service) and ECR (Elastic Container Registry).
- Implement container orchestration strategies for both batch processing and long-running services.
- Build and maintain container images optimized for HPC workloads.
- Design and execute machine image replication and migration strategies from AWS to provider-agnostic HPC cloud platforms.
- Ensure workload portability across heterogeneous cloud environments while maintaining performance characteristics.
- Develop infrastructure-as-code solutions that support multi-cloud deployments.
- Architect database schemas for optimal performance, scalability, and data integrity.
- Create and maintain databases and tables across various AWS database services (S3, RDS, Aurora, DynamoDB, Redshift).
- Design and implement data ingestion pipelines to process output files from HPC simulations and applications.
- Develop automated ETL workflows to transform, validate, and load data from diverse sources.
- Create Python scripts that generate structured data files for downstream analysis and reporting.
- Implement comprehensive monitoring, logging, and alerting solutions using CloudWatch, X-Ray, and third-party APM tools.
- Build and maintain CI/CD pipelines for infrastructure and application deployments.
- Develop tooling for application performance monitoring, traceability, and debugging in production environments.
- Implement infrastructure-as-code using Terraform, CloudFormation, or AWS CDK.
YOUR PAST MISSIONS
- Bachelor's degree in Computer Science, Computer Engineering, Computational Science, Data Science, or related technical field
- (Preferred) Master’s degree in Computer Science, High Performance Computing, Distributed Systems, or related technical field
- (Preferred) Relevant AWS certification (Solutions Architect Professional, DevOps Engineer Professional, Advanced Networking)
- 7-10 years of professional experience in cloud engineering, HPC engineering, DevOps, or related roles
- 5+ years of Python development experience for automation, scripting, and application development
- 5+ years of hands-on experience with AWS services and cloud architecture
- 3+ years of experience managing HPC clusters and job schedulers (Slurm, PBS, SGE, or similar)
- Proven experience with containerization technologies (Docker) and container orchestration (ECS, Kubernetes)
- Demonstrated experience with database design, schema architecture, and data pipeline development
- Experience troubleshooting complex distributed systems and infrastructure
- Experience with multi-cloud environments (AWS, Azure, GCP) and cloud migration projects
- Background in computational science, scientific computing, or engineering simulation workloads
- Familiarity with numerical computing, parallel computing concepts, and HPC application optimization
YOUR TOOLKIT
Core Technical Skills (Required):
- AWS Services: EC2, ECS, ECR, Lambda, S3, RDS, Aurora, VPC, IAM, CloudWatch, ParallelCluster, Batch
- HPC Technologies: Slurm, PBS, AWS ParallelCluster, job scheduling, resource management, cluster configuration
- Programming & Scripting: Python (advanced), Bash, SQL
- Containerization: Docker, Amazon ECS, ECR, container optimization
- Database Technologies: Relational databases (PostgreSQL, MySQL), schema design, query optimization, data modeling
- Linux/Unix: System administration, performance tuning, shell scripting
- Infrastructure as Code: Terraform, CloudFormation, or AWS CDK
- Version Control: Git, GitHub/GitLab/Bitbucket
- Networking: VPC design, security groups, load balancers, DNS
- Cloud Schedulers: EventBridge, Step Functions, ECS scheduled tasks
- Data Engineering: ETL pipelines, data transformation, Apache Airflow, AWS Glue
- Monitoring & Observability: CloudWatch, X-Ray, Datadog, Prometheus, Grafana
- CI/CD: Jenkins, GitLab CI, GitHub Actions, AWS CodePipeline
- Additional Languages: Go, Java, or compiled languages for HPC applications
- Multi-cloud platforms: Azure (Azure CycleCloud), GCP (Cloud HPC Toolkit)
- Storage Systems: Lustre, EFS, FSx, parallel file systems
- Advanced troubleshooting and root cause analysis for complex distributed systems
- Performance profiling and optimization for HPC workloads
- Capacity planning and cost optimization for cloud infrastructure
- Motivated self-starter who proactively identifies and addresses opportunities for improvement across the full-stack.
- Strong written and verbal communication skills for technical and non-technical audiences
- Ability to work independently in a remote environment with minimal supervision
- Collaborative mindset for cross-functional team engagement
- Adaptability to rapidly changing technologies and business requirements
OUR SPACE (including travel)
We’d love to welcome you to the EarthInsights team for this fully-remote opportunity open to individuals in and working from the US and Canada. Agile software development with daily standups and weekly Scrum cadence in a fast-paced environment with the need to adapt quickly to time-sensitive deliveries.
Ours is a fun, fast-paced and exciting work environment where we hold earth-smart (living sustainably), creativity and innovation, proactive communication, diversity and accountability as core values. And just like space exploration - we’re constantly evolving and pushing new boundaries.
Hours of work typically fall between 9:00am and 5:00pm Central Time Monday to Friday with periodic cross-over work required with other team members across a few times zones in addition to occasional evening and weekend work. Team members need be available for a minimum of six (6) hours daily during this period to facilitate collaboration.
YOUR COMPENSATION
Base Salary Range: $165k – 195k USD annually.
The range above depends on job-related skills, experience, training, education, location and business needs. The range is based on Minneapolis-derived compensation for this role. Only when a candidate has the demonstrated experience, skills, and expertise to advance in the range for this position, would we consider paying at the top end of the range for this role.
WHY EARTHDAILY ANALYTICS?
- Competitive compensation and flexible time off
- Be part of a meaningful mission in one of North America's most innovative space companies developing sustainable solutions for our planet
- Great work environment and team with head office locations in Vancouver, Canada and Minneapolis, MN