Data Engineer - Forward Deployed USA

Full-time

Remote

United States

Description

We’re building the data backbone for Orbital, an industrial AI system that ingests and learns from complex refinery and process data in real time. As our Data Engineer, you’ll architect and maintain pipelines that make high-frequency time-series, lab, and historian data into a scalable Lakehouse architecture, usable for both deep learning models and real-time LLMs.

You’ll be working across AWS (EKS, S3, EBS, KMS, CloudWatch) and Databricks/PySpark, ensuring data is contextualised, synchronised, and optimised for both deep learning models and real-time LLM workloads.

This isn’t a traditional ETL role, you’ll be solving problems at the intersection of control systems, industrial data engineering, and AI enablement.

Location:
You will be based in the U.S. or eligible to work there and ideally in Houston or willing to travel extensively to there

Core Responsibilities

Ingest & Contextualise Data

Ingest from OPC UA servers, process historians, IoT sensors, LIMS systems, alarms/events, and P&IDs.
Map signals to their physical processes (tags, units, hierarchies) for interpretability in AI pipelines.

Data Movement & Accessibility

Build pipelines that handle real-time streaming and batch ingestion into the Lakehouse.
Manage synchronisation between historian archives, unstructured files, and AWS storage (S3/EBS).
Orchestrate Databricks Lakeflow/Connectors for integrating data into Lakebase/Lakehouse.

Handle secure, high-throughput transfers between historian archives and sandbox/live environments.

Change Tracking & Integrity

Detect and manage schema changes, signal drift, and inconsistencies across time.

Implement lineage and audit trails across Spark/Databricks and AWS pipelines.

Data Preparation for AI

Build and maintain dual pipelines:
- Training → large-scale historical data prep for time-series + LLM training.
- Inference → low-latency, real-time pipelines for anomaly detection, optimisation, and LLM search.
Support heterogeneous AI workloads (time-series forecasting and retrieval-augmented LLMs).

Database Performance & Optimisation

Tune PostgreSQL and spark for high-throughput time-series workloads (partitioning, indexing, query optimisation).
Optimise pipelines for both fast analytical queries and high-efficiency model training.

Deploy and manage data pipelines in AWS EKS (Kubernetes) with persistent EBS-backed storage.

Technical Requirements

Deep expertise in PostgreSQL (partitioning, indexing, query optimisation, storage design).
Strong proficiency in Python for data processing, scripting, and pipeline orchestration.
Hands-on experience with AWS (EKS, S3, EBS, IAM, KMS, CloudWatch, etc.) for secure and scalable data pipelines.
Proven ability to work with Databricks and PySpark for large-scale distributed data processing.
Familiarity with time-series industrial data (control systems, DCS/SCADA logs, process historians).
Experience in unstructured data sync and management within hybrid cloud/on-prem environments.
Bonus: Knowledge of streaming frameworks (Kafka, Flink, Spark Streaming) or MLOps stacks for data versioning and lineage.

What Success Looks Like

Live data streams are contextualised, query-able, and AI-ready.
Schema changes and signal drift are detected and handled without breaking downstream workflows.
Training and inference pipelines run smoothly in parallel, optimised for scale and latency.
AI teams can focus on modelling because the data backbone is robust, fast, and reliable.

Apply now

Share this job

Twitter Facebook Linkedin Email

Data Engineer - Forward Deployed USA

More jobs

Forward Deployed Engineer

PeopleLift

Forward Deployed Product Manager

Nash