Senior Machine Learning Engineer
About Tripstack
Founded in Toronto, Canada in 2016, Tripstack has been part of Etraveli Group since 2019. It is a B2B Flights as a Service provider and a world leader in virtual interlining.
Operating from offices in Canada, India, and Poland, Tripstack is the gateway into Etraveli Group’s world leading tech platform - giving partners access to global flight content, virtual interlining, and a full suite of services including payments, fraud prevention, pricing, and customer support. As a world leader in virtual interlining technology, Tripstack connects non-partner, low-cost and full-service carriers, enabling the creation of unique and flexible itineraries through a simple, cost-effective API.
Its technology ingest over 30B price points and handles over 240 million searches daily.
Through partnerships with airlines, OTAs, and other distribution channels across the globe, Tripstack expands networks, drives new revenue streams, and offers more choice at competitive prices, all backed by robust technology and traveler protection.
For more information, visit:
The role
Our data platform needs more support to continue to scale. Druid cluster maintenance, incident root-cause analysis, query and segment tuning, and the Spark practice we need to scale beyond today's MLOps workloads. All of these need someone with personal ownership, not another layer of coordination.
Second, we are migrating the entire stack from bare-metal VMs to Kubernetes on OpenStack in a new data centre. This will require careful attention to detail, planning, testing and execution skills.
Third, and connected to both: we are expanding our ML practice. We have several ambitious models to launch in 2026. Our MLFlow-based training platform is live but early the path from notebook to monitored production model needs a real owner.
This role exists because we need one person who can operate distributed data systems in production, lead a complex migration, and turn our ML platform into something any data scientist at Tripstack can ship against all while partnering with SRE to ensure high availability and stability in our data operations
Responsibilities
Operate and evolve our data platform
Own the operational health of Druid, Spark, Redpanda, Airflow, PostgreSQL, and Elasticsearch as production systems e.g. segment lifecycle, JVM tuning, ingestion specs, broker/coordinator/overlord internals, partition design, consumer lag, replication tuning
Build the KPIs, alerting, dashboards, and runbooks that let us see cluster exhaustion before it becomes an incident, and diagnose it quickly when it does
Own the query, report, segment, and tiering optimizations that keep our analytics cost-effective and responsive under load
Lead the data-stack migration
Plan and execute the migration of Druid, Spark, Redpanda, and our orchestration layer from bare-metal VMs to Kubernetes on OpenStack with no downtime on stateful workloads
Design StatefulSet, PVC, pod disruption budget, and rolling-upgrade patterns that are safe for production data systems
Codify the migration with Infrastructure-as-Code (Terraform for OpenStack, Helm or Kustomize for K8s, GitOps via ArgoCD or Flux) so the result is reproducible and supportable by the whole team
Own and expand the ML platform
Evolve our MLFlow-based training and retraining platform into a reliable, multi-tenant product used by Data, Search, and Content teams
Build and operate the feature pipelines, model registries, and serving patterns that enable ML at scale
Define the deployment, rollback, drift-detection, and incident-response patterns for ML in production, tied into our existing PagerDuty and JIRA incident flow
Partner with Data Science on new model development: your job is to make sure "it works on my notebook" becomes "it ships, it is monitored, and we can prove its business impact"
Raise the bar on observability and reliability
Build the Prometheus / Grafana / distributed-tracing coverage our data systems need. Treat SLOs, error budgets, and post-incident discipline as table stakes.
Partner closely with SRE on hardware, networking, and K8s fundamentals but own the data applications themselves end-to-end
What success looks like in year one
By 30 days: you have mapped our data platform end to end, identified the top operational risks on Druid and Spark, and shipped at least one of them. You are the named on-call owner for data platform incidents.
By 90 days: the data centre migration plan for the data stack is in execution with a clear, tested rollback path. Druid cluster-health KPIs and runbooks are in place, and we are measurably faster at RCA than we are today.
By 6 months: data centre migration is well settled and model deployments are scaled up significantly. Any data scientist at Tripstack can go from a trained model to a monitored production deployment in under a day, through tooling you own.
Requirements
Deep production operations experience with a real-time OLAP system at scale - Apache Druid strongly preferred; equivalent depth in ClickHouse or Apache Pinot considered. We expect you to have done segment lifecycle, JVM tuning, ingestion spec authoring, and RCA on broker/coordinator-class failures yourself.
Strong Kubernetes experience with stateful workloads - StatefulSets, PVCs, pod disruption budgets, rolling upgrades for data systems
Observability and RCA discipline - Prometheus, Grafana, distributed tracing, and the habit of writing the runbook that stops the next incident
Hands-on Apache Spark at scale - DAG execution, shuffle optimization, memory tuning, Spark-on-Kubernetes
7+ years building and operating production data systems, at least 2 of them on self-hosted or bare-metal infrastructure. You have been on-call for pipelines and services you built, and you have opinions about what good looks like.
Clear written and verbal English; comfortable working across Kraków, Toronto, Pune, and Stockholm time zones
Strong differentiators
Production experience with MLFlow or an equivalent ML lifecycle platform (e.g. Kubeflow, SageMaker, Vertex AI) and a serious opinion about when each is the right choice
Redpanda or Apache Kafka operations - partition design, consumer lag management, replication tuning
Infrastructure as Code at a senior level - Terraform (ideally against OpenStack), Helm or Kustomize, GitOps with ArgoCD or Flux
Fluency with agentic coding tools (Claude Code, Gemini, or equivalent) as a real part of your workflow - including the judgement to know when not to trust an AI-generated config for a production data system
Python that is production-grade, plus working knowledge of at least one JVM or Go-family language so you can integrate with our services directly
Additional Experience That Would Be Considered An Asset
OpenStack familiarity - Neutron networking, Cinder/Ceph storage
Delta Lake or Apache Iceberg experience and architectural judgement about when to introduce a table format
Change Data Capture patterns, especially PostgreSQL-to-Druid streaming
Security and secrets management in K8s - Vault, network policies, encryption at rest
Exposure to travel, flights, or large-scale search and cache systems
Experience with experimentation platforms (e.g. Growthbook, internal A/B frameworks)
Benefits
We offer an opportunity to work with a young, dynamic, and a growing team composed of high-caliber professionals. We value professionalism and promote a culture where individuals are encouraged to do more and be more. If you feel you share our passion for excellence, and growth, then look no further. We have an ambitious mission, and we need a world-class team to make it a reality. Upgrade to a First Class team!
At Tripstack, we proudly believe in embracing diversity. This is true for our team, clients, communities and stakeholders. We are an equal opportunity employer and committed to creating a safe, healthy and accessible environment. We encourage applications regardless of race, colour, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or any other grounds protected by law. Please let us know if you need any accommodations during any part of the recruitment process.
Tripstack thanks all applicants for their interest, however only those selected to continue in the process will be contacted.
Learn more about us at www.tripstack.com
#tripstack #LI-Hybrid #LI-CM1
- Division
- Tripstack
- Department
- Technology
- Locations
- Kraków
- Remote status
- Hybrid
About Tripstack
Founded in Toronto, Canada in 2016, Tripstack has been part of Etraveli Group since 2019. It is a B2B Flights as a Service provider and a world leader in virtual interlining.
Operating from offices in Canada, India, and Poland, Tripstack is the gateway into Etraveli Group’s world leading tech platform - giving partners access to global flight content, virtual interlining, and a full suite of services including payments, fraud prevention, pricing, and customer support. As a world leader in virtual interlining technology, Tripstack connects non-partner, low-cost and full-service carriers, enabling the creation of unique and flexible itineraries through a simple, cost-effective API.
Its technology ingest over 30B price points and handles over 240 million searches daily.
Through partnerships with airlines, OTAs, and other distribution channels across the globe, Tripstack expands networks, drives new revenue streams, and offers more choice at competitive prices, all backed by robust technology and traveler protection.
For more information, visit: