Files
template-code-location/documents/Development Guide.md

8.5 KiB

Workflow Orchestration: Development and Deployment Guide

1. Goal and Scope

The purpose of this document is to provide a comprehensive guide for participants to create, manage, update and delete workflows within the Simpl-Open orchestration platform. By following a code-first approach, developers ensure consistency, traceability, and reliability across all environments.

2. Local Development

Development must always begin in a local environment. This allows developers to rapidly iterate, test business logic, and validate DAG (Directed Acyclic Graph) structures without impacting production data.

2.1 Project Layout

This repository (template-code-location) serves as the single consolidated code location for all data services workflows. It imports jobs and ops from three external packages (data-processing, dataframe-level-anonymisation, and field-level-pseudo-anonymisation) which are installed as Git dependencies, and also provides a place for custom template jobs/ops.

template-code-location/
├── src/
│   └── template_code_location/
│       ├── __init__.py
│       ├── repository.py                  # Unified entry point (all jobs/sensors/resources)
│       ├── jobs/                           # Custom jobs specific to this code location
│       │   ├── __init__.py
│       │   └── jobs.py
│       └── ops/                            # Custom ops specific to this code location
│           ├── __init__.py
│           └── ops.py
├── tests/                                  # Unit & integration tests
├── Dockerfile
├── pyproject.toml                          # Dependencies & external package sources
└── README.md

2.2 External Dependencies (Git Packages)

The heavy-lifting logic lives in separate repositories, pulled in as installable Python packages via pyproject.toml and [tool.uv.sources]:

Package Purpose Source
data-processing Data cleaning & transformation jobs Git (branch: develop)
dataframe-level-anonymisation k-anonymity, l-diversity, t-closeness Git (branch: develop)
field-level-pseudo-anonymisation Field-level encryption/hashing/redaction Git (branch: develop)
util-services Shared resources, sensors, and logging Git (tag: v0.5.0)

These packages expose their jobs and ops which are then imported and registered in repository.py.

2.3 Code Examples (Ops, Jobs, and Definitions)

The orchestration logic should be modular.

2.3.1 Workflow creation

Here is a practical example of how to construct a workflow.

1. Defining Ops (ops/ops.py)
Ops are the core units of computation. Keep them focused on a single task.

from dagster import op

@op
def fetch_data() -> list:
    """Fetches raw data from a source."""
    return [{"id": 1, "value": "A"}, {"id": 2, "value": "B"}]

@op
def process_data(data: list) -> dict:
    """Processes raw data and returns a summary."""
    return {"count": len(data), "status": "success"}

2. Assembling Jobs (jobs/jobs.py)
Jobs link ops together to form a dependency graph (workflow).

from dagster import job
from ..ops.ops import fetch_data, process_data

@job
def data_processing_job():
    """A simple job that fetches and processes data."""
    raw = fetch_data()
    process_data(raw)

3. Registering Definitions (repository.py)
This file acts as the entry point for the Simpl-Open orchestration platform to discover your code. It imports jobs from local modules as well as from external packages.

from dagster import Definitions
from util_services.resources import s3_resource
from util_services.sensors import notify_success, notify_failure, notify_canceled
from util_services.custom_json_logger import simpl_json_logger

# External package jobs
from data_processing.jobs import remove_duplicates_job_s3, fill_missing_values_job_s3
from dataframe_level_anonymisation.jobs import k_anonymity_job_s3, l_diversity_job_s3
from field_level_pseudo_anonymisation.jobs import anonymise_pseudonymise_structured_job_s3

# Local template jobs
from template_code_location.jobs.jobs import data_processing_job

defs = Definitions(
    jobs=[data_processing_job, remove_duplicates_job_s3, ...],
    sensors=[notify_success, notify_failure, notify_canceled],
    resources={"s3": s3_resource.configured({"resource_name": "selfS3"})},
    loggers={"simpl": simpl_json_logger},
)

2.3.2 Workflow deletion

Here is a practical example of how to delete a workflow.

1. Pre-delete check
Before deleting an existing workflow, first check its status (if is needed or referenced):

  • Dagster UI: Navigate to the "Runs". Identify a given worklog using the "Filter" button or navigation "Newer", "Older" buttons.
  • Details, logs check: Check workflow status, open it by clicking on it's target or open logs by clicking on it's uuid.

2. Delete Ops (ops/ops.py)
If the workflow is defined as an op, delete it's corresponding method from the ops.py file.

3. Delete Jobs (jobs/jobs.py)
If the workflow is defined as a job, delete it's corresponding method from the jobs.py file.

4. Delete Registering Definitions (repository.py) Delete workflow import, and it's definition from the repository.py

2.4 Best Practices & Constraints

  • Separation of Concerns: Keep orchestration logic (how ops connect) strictly separate from heavy business logic (which should ideally live in separate Python modules/classes).
  • Naming Conventions: Use snake_case for jobs and ops. Code locations should be named based on the domain they represent (e.g., inventory_sync_service).
  • Dependency Management: All dependencies must be explicitly declared in pyproject.toml or requirements.txt.
  • Environment Agnosticism: Avoid hardcoding credentials. Use environment variables to handle configuration.

3. Publishing to Production (Gitea)

Once the local validation is complete, the code must be published to the centralized Gitea repository.

  1. Repository Hosting: All workflows are stored in Gitea instances within the agent environment.
  2. Versioning: Workflows are versioned using Git. Each version of a workflow must correspond to a specific Git commit.
  3. Artifact Generation: Workflows are packaged as Docker container images.
    • Images must be pushed to the Gitea Integrated Container Registry.
    • Tagging Policy: Use semantic versioning or Git commit SHAs. Avoid using the latest tag in production to ensure idempotency and easy rollbacks.

4. Review and Approval Process

To maintain high-quality standards and security, no code is deployed directly to the main branch.

  1. Feature Branching: Developers must push their changes to a dedicated feature branch.
  2. Pull Request (PR): Open a Pull Request in Gitea from the feature branch to the main branch.
  3. Peer Review: At least one developer (other than the author) must review the code.
    • Reviewers check for logic errors, security vulnerabilities, and adherence to the standards defined in Section 2.
  4. Approval: Once comments are addressed and the reviewer provides an "Approve" status, the PR can be merged.

5. Production Deployment

After the code is merged and the artifact is published, the final step is deploying to the orchestration platform.

5.1 Deployment Pipeline

The deployment follows these automated steps:

  1. CI/CD Trigger: A merge to the main branch triggers the CI pipeline.
  2. Image Build: The pipeline builds the Docker image and pushes it to the Gitea Registry.
  3. Manifest Update: The deployment configuration (e.g., Helm values or Kubernetes manifests) is updated to reference the new image tag.
  4. Platform Reload: The Simpl-Open orchestration platform (Dagster) is notified of the change.

5.2 Verification

To confirm a successful deployment:

  • Dagster UI:
    • Navigate to the "Deployment" or "Code Locations" tab. Verify that the loaded image tag matches the latest Git commit.
    • Navigate to the "Runs" tab. Identify deleted workflow. Click on it's target to open details. "Pipeline not found" message shows that workflow was deleted successfully. Click on it's uuid to open logs. Logs must be available after deletion.
  • Health Check: Trigger a "Test Run" of the job in the production environment using a limited data slice.
  • Logs: Monitor the initialization logs in the Dagster daemon to ensure the code location was loaded without schema or dependency errors.