6.4 KiB
Workflow Orchestration: Development and Deployment Guide
1. Goal and Scope
The purpose of this document is to provide a comprehensive guide for participants to create, manage, and update workflows within the Simpl-Open orchestration platform. By following a code-first approach, developers ensure consistency, traceability, and reliability across all environments.
2. Local Development
Development must always begin in a local environment. This allows developers to rapidly iterate, test business logic, and validate DAG (Directed Acyclic Graph) structures without impacting production data.
2.1 Project Layout
This repository (template-code-location) serves as the single consolidated code location for all data services workflows. It contains the jobs, ops, and configurations previously spread across data-processing, dataframe-level-anonymisation, and field-level-pseudo-anonymisation.
template-code-location/
├── src/
│ └── template_code_location/
│ ├── repository.py # Unified entry point (all jobs/sensors/resources)
│ ├── data_processing/ # Data cleaning & transformation ops/jobs
│ │ ├── config_models/
│ │ ├── jobs.py
│ │ └── ops.py
│ ├── dataframe_level_anonymisation/ # k-anonymity, l-diversity, t-closeness
│ │ ├── config_models/
│ │ ├── jobs.py
│ │ ├── ops.py
│ │ └── utils.py
│ ├── field_level_pseudo_anonymisation/ # Field-level encryption/hashing/redaction
│ │ ├── config_models/
│ │ ├── techniques/
│ │ ├── jobs.py
│ │ ├── ops.py
│ │ ├── unstructured_ops.py
│ │ └── utils.py
│ ├── jobs/ # Template example jobs
│ └── ops/ # Template example ops
├── tests/ # All tests (migrated from source repos)
├── Dockerfile
├── pyproject.toml
└── README.md
2.2 Code Examples (Ops, Jobs, and Definitions)
The orchestration logic should be modular. Here is a practical example of how to construct a workflow.
1. Defining Ops (ops.py)
Ops are the core units of computation. Keep them focused on a single task.
from dagster import op
@op
def fetch_raw_data() -> list:
"""Fetches raw data from an external source."""
return [{"id": 1, "value": "A"}, {"id": 2, "value": "B"}]
@op
def process_data(data: list) -> dict:
"""Transforms raw data into an aggregated format."""
return {"processed_count": len(data), "status": "success"}
2. Assembling Jobs (jobs.py)
Jobs link ops together to form a dependency graph (workflow).
from dagster import job
from .ops import fetch_raw_data, process_data
@job
def data_processing_job():
"""A workflow that fetches and processes data."""
raw_data = fetch_raw_data()
process_data(raw_data)
3. Registering Definitions (repository.py)
This file acts as the entry point for the Simpl-Open orchestration platform to discover your code.
from dagster import Definitions
from .jobs import data_processing_job
# The platform will load this Definitions object
defs = Definitions(
jobs=[data_processing_job]
# You can also declare schedules, sensors, and resources here
)
2.3 Best Practices & Constraints
- Separation of Concerns: Keep orchestration logic (how ops connect) strictly separate from heavy business logic (which should ideally live in separate Python modules/classes).
- Naming Conventions: Use snake_case for jobs and ops. Code locations should be named based on the domain they represent (e.g., inventory_sync_service).
- Dependency Management: All dependencies must be explicitly declared in pyproject.toml or requirements.txt.
- Environment Agnosticism: Avoid hardcoding credentials. Use environment variables to handle configuration.
3. Publishing to Production (Gitea)
Once the local validation is complete, the code must be published to the centralized Gitea repository.
- Repository Hosting: All workflows are stored in Gitea instances within the agent environment.
- Versioning: Workflows are versioned using Git. Each version of a workflow must correspond to a specific Git commit.
- Artifact Generation: Workflows are packaged as Docker container images.
- Images must be pushed to the Gitea Integrated Container Registry.
- Tagging Policy: Use semantic versioning or Git commit SHAs. Avoid using the latest tag in production to ensure idempotency and easy rollbacks.
4. Review and Approval Process
To maintain high-quality standards and security, no code is deployed directly to the main branch.
- Feature Branching: Developers must push their changes to a dedicated feature branch.
- Pull Request (PR): Open a Pull Request in Gitea from the feature branch to the main branch.
- Peer Review: At least one developer (other than the author) must review the code.
- Reviewers check for logic errors, security vulnerabilities, and adherence to the standards defined in Section 2.
- Approval: Once comments are addressed and the reviewer provides an "Approve" status, the PR can be merged.
5. Production Deployment
After the code is merged and the artifact is published, the final step is deploying to the orchestration platform.
5.1 Deployment Pipeline
The deployment follows these automated steps:
- CI/CD Trigger: A merge to the main branch triggers the CI pipeline.
- Image Build: The pipeline builds the Docker image and pushes it to the Gitea Registry.
- Manifest Update: The deployment configuration (e.g., Helm values or Kubernetes manifests) is updated to reference the new image tag.
- Platform Reload: The Simpl-Open orchestration platform (Dagster) is notified of the change.
5.2 Verification
To confirm a successful deployment:
- Dagster UI: Navigate to the "Deployment" or "Code Locations" tab. Verify that the loaded image tag matches the latest Git commit.
- Health Check: Trigger a "Test Run" of the job in the production environment using a limited data slice.
- Logs: Monitor the initialization logs in the Dagster daemon to ensure the code location was loaded without schema or dependency errors.