Merge branch 'develop' into 'main'

SIMPL-24642 dev ---> main See merge request simpl/simpl-open/development/data-services/template-code-location!7
2026-05-29 10:12:01 +02:00
parent bb95b381fe 337578fea5
commit f5e34fe8ba
12 changed files with 4960 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,23 @@
 # Python
 *.egg-info/
 **/__pycache__/
 *.pyc
 *.pyo
 # Virtual environments
 .venv/
 venv/
 env/
 # Test & coverage
 .pytest_cache/
 .coverage
 htmlcov/
 # UV lock file
 uv.lock
 # IDE / OS
 .idea/
 .vscode/
 *.DS_Store
--- a/71
+++ b/71
@@ -0,0 +1,71 @@
 FROM python:3.12-slim-bookworm
 # --- Install uv (pinned for reproducibility) ---
 COPY --from=ghcr.io/astral-sh/uv:0.10.8 /uv /uvx /bin/
 WORKDIR /app
 # Create non-root user with explicit UID/GID 1000
 RUN addgroup --gid 1000 appgroup && \
    adduser --uid 1000 --gid 1000 --disabled-password --gecos "" appuser
 # Install system dependencies:
 #   - git: required to fetch util-services from GitLab (tool.uv.sources)
 #   - build-essential / gcc / g++ / python3-dev / cmake: native extensions
 #     (scrubadub-spacy → spaCy, pycanon, etc.)
 #   - curl: optional healthcheck / runtime tooling
 RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y --no-install-recommends \
    build-essential=12.9 \
    cmake=3.25.1-1 \
    gcc=4:12.2.0-3 \
    g++=4:12.2.0-3 \
    python3-dev=3.11.2-1+b1 \
    git=1:2.39.5-0+deb12u3 \
    curl=7.88.1-10+deb12u14 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && rm -rf /var/tmp/*
 # Pre-own /app so appuser can write to it
 RUN chown -R appuser:appgroup /app
 # Copy project metadata and source
 COPY pyproject.toml .
 COPY uv.lock .
 COPY src/ ./src/
 # uv environment knobs:
 #   UV_COMPILE_BYTECODE  → compile .pyc files at install time for faster cold start
 #   UV_LINK_MODE=copy    → copy files instead of symlinks (required in Docker layers)
 #   UV_SYSTEM_PYTHON=1   → install into the system Python (no extra venv needed)
 ENV UV_COMPILE_BYTECODE=1
 ENV UV_LINK_MODE=copy
 ENV UV_SYSTEM_PYTHON=1
 # Install the project and all dependencies, respecting [tool.uv.sources]
 # (git source for util-services and pytorch-cpu index for torch)
 # BuildKit cache mount keeps the uv package cache across builds
 RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev
 # Put the project's venv on PATH (matches WORKDIR)
 ENV PATH="/app/.venv/bin:${PATH}"
 ENV PYTHONPATH="/app/src"
 # Make /app writable for the non-root user (e.g. spaCy model downloads)
 RUN chown -R 1000:1000 /app && chmod -R u+w /app
 # Provide a real home directory for appuser
 RUN mkdir -p /home/appuser && chown -R 1000:1000 /home/appuser
 ENV HOME=/home/appuser
 USER appuser
 # Sanity-check: fail the build early if the dagster CLI is missing
 RUN dagster --version
 EXPOSE 4000
 CMD ["dagster", "code-server", "start", "-h", "0.0.0.0", "-p", "4000", "-f", "src/template_code_location/repository.py"]
--- a/documents/Development
+++ b/documents/Development
@@ -0,0 +1,151 @@
 # Workflow Orchestration: Development and Deployment Guide
 ## 1. Goal and Scope
 The purpose of this document is to provide a comprehensive guide for participants to create, manage, and update workflows within the Simpl-Open orchestration platform. 
 By following a *code-first approach*, developers ensure consistency, traceability, and reliability across all environments.
 ## 2. Local Development
 Development must always begin in a local environment. This allows developers to rapidly iterate, test business logic, and validate DAG (Directed Acyclic Graph) structures without impacting production data.
 ### 2.1 Project Layout
 This repository (`template-code-location`) serves as the **single consolidated code location** for all data services workflows. It imports jobs and ops from three external packages (`data-processing`, `dataframe-level-anonymisation`, and `field-level-pseudo-anonymisation`) which are installed as Git dependencies, and also provides a place for custom template jobs/ops.
 ```text
 template-code-location/
 ├── src/
 │   └── template_code_location/
 │       ├── __init__.py
 │       ├── repository.py                  # Unified entry point (all jobs/sensors/resources)
 │       ├── jobs/                           # Custom jobs specific to this code location
 │       │   ├── __init__.py
 │       │   └── jobs.py
 │       └── ops/                            # Custom ops specific to this code location
 │           ├── __init__.py
 │           └── ops.py
 ├── tests/                                  # Unit & integration tests
 ├── Dockerfile
 ├── pyproject.toml                          # Dependencies & external package sources
 └── README.md
 ```
 ### 2.2 External Dependencies (Git Packages)
 The heavy-lifting logic lives in separate repositories, pulled in as installable Python packages via `pyproject.toml` and `[tool.uv.sources]`:
 | Package | Purpose | Source |
 |---------|---------|--------|
 | `data-processing` | Data cleaning & transformation jobs | Git (branch: `develop`) |
 | `dataframe-level-anonymisation` | k-anonymity, l-diversity, t-closeness | Git (branch: `develop`) |
 | `field-level-pseudo-anonymisation` | Field-level encryption/hashing/redaction | Git (branch: `develop`) |
 | `util-services` | Shared resources, sensors, and logging | Git (tag: `v0.5.0`) |
 These packages expose their jobs and ops which are then imported and registered in `repository.py`.
 ### 2.3 Code Examples (Ops, Jobs, and Definitions)
 The orchestration logic should be modular. Here is a practical example of how to construct a workflow.
 **1. Defining Ops (`ops/ops.py`)**  
 Ops are the core units of computation. Keep them focused on a single task.
 ```python
 from dagster import op
@op
 def fetch_data() -> list:
    """Fetches raw data from a source."""
    return [{"id": 1, "value": "A"}, {"id": 2, "value": "B"}]
@op
 def process_data(data: list) -> dict:
    """Processes raw data and returns a summary."""
    return {"count": len(data), "status": "success"}
 ```
 **2. Assembling Jobs (`jobs/jobs.py`)**  
 Jobs link ops together to form a dependency graph (workflow).
 ```python
 from dagster import job
 from ..ops.ops import fetch_data, process_data
@job
 def data_processing_job():
    """A simple job that fetches and processes data."""
    raw = fetch_data()
    process_data(raw)
 ```
 **3. Registering Definitions (`repository.py`)**  
 This file acts as the entry point for the Simpl-Open orchestration platform to discover your code. It imports jobs from local modules as well as from external packages.
 ```python
 from dagster import Definitions
 from util_services.resources import s3_resource
 from util_services.sensors import notify_success, notify_failure, notify_canceled
 from util_services.custom_json_logger import simpl_json_logger
 # External package jobs
 from data_processing.jobs import remove_duplicates_job_s3, fill_missing_values_job_s3
 from dataframe_level_anonymisation.jobs import k_anonymity_job_s3, l_diversity_job_s3
 from field_level_pseudo_anonymisation.jobs import anonymise_pseudonymise_structured_job_s3
 # Local template jobs
 from template_code_location.jobs.jobs import data_processing_job
 defs = Definitions(
    jobs=[data_processing_job, remove_duplicates_job_s3, ...],
    sensors=[notify_success, notify_failure, notify_canceled],
    resources={"s3": s3_resource.configured({"resource_name": "selfS3"})},
    loggers={"simpl": simpl_json_logger},
 )
 ```
 ### 2.4 Best Practices & Constraints
 - **Separation of Concerns**: Keep orchestration logic (how ops connect) strictly separate from heavy business logic (which should ideally live in separate Python modules/classes).
 - **Naming Conventions**: Use snake_case for jobs and ops. Code locations should be named based on the domain they represent (e.g., inventory_sync_service).
 - **Dependency Management**: All dependencies must be explicitly declared in pyproject.toml or requirements.txt.
 - **Environment Agnosticism**: Avoid hardcoding credentials. Use environment variables to handle configuration.
 ## 3. Publishing to Production (Gitea)
 Once the local validation is complete, the code must be published to the centralized Gitea repository.
 1. **Repository Hosting**: All workflows are stored in Gitea instances within the agent environment.
 2. **Versioning**: Workflows are versioned using Git. Each version of a workflow must correspond to a specific Git commit.
 3. **Artifact Generation**: Workflows are packaged as Docker container images.
    - Images must be pushed to the Gitea Integrated Container Registry.
    - **Tagging Policy**: Use semantic versioning or Git commit SHAs. Avoid using the latest tag in production to ensure idempotency and easy rollbacks.
 ## 4. Review and Approval Process
 To maintain high-quality standards and security, no code is deployed directly to the main branch.
 1. **Feature Branching**: Developers must push their changes to a dedicated feature branch.
 2. **Pull Request (PR)**: Open a Pull Request in Gitea from the feature branch to the main branch.
 3. **Peer Review**: At least one developer (other than the author) must review the code.
    - Reviewers check for logic errors, security vulnerabilities, and adherence to the standards defined in Section 2.
 4. **Approval**: Once comments are addressed and the reviewer provides an "Approve" status, the PR can be merged.
 ## 5. Production Deployment
 After the code is merged and the artifact is published, the final step is deploying to the orchestration platform.
 ### 5.1 Deployment Pipeline
 The deployment follows these automated steps:
 1. **CI/CD Trigger**: A merge to the main branch triggers the CI pipeline.
 2. **Image Build**: The pipeline builds the Docker image and pushes it to the Gitea Registry.
 3. **Manifest Update**: The deployment configuration (e.g., Helm values or Kubernetes manifests) is updated to reference the new image tag.
 4. **Platform Reload**: The Simpl-Open orchestration platform (Dagster) is notified of the change.
 ### 5.2 Verification
 To confirm a successful deployment:
 - **Dagster UI**: Navigate to the "Deployment" or "Code Locations" tab. Verify that the loaded image tag matches the latest Git commit.
 - **Health Check**: Trigger a "Test Run" of the job in the production environment using a limited data slice.
 - **Logs**: Monitor the initialization logs in the Dagster daemon to ensure the code location was loaded without schema or dependency errors.
--- a/pipeline.variables.sh
+++ b/pipeline.variables.sh
@@ -1 +1 @@
-PROJECT_VERSION_NUMBER="0.0.1"
+PROJECT_VERSION_NUMBER="0.1.0"
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,44 @@
 [build-system]
 requires = ["setuptools>=68.0", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "template-code-location"
 version = "0.1.0"
 description = "Consolidated code location for all data services workflows"
 requires-python = ">=3.12"
 dependencies = [
    "dagster>=1.8.13",
    "util-services",
    "data-processing",
    "dataframe-level-anonymisation",
    "field-level-pseudo-anonymisation",
 ]
 [tool.uv]
 exclude-dependencies = ["transformers", "spacy-transformers"]
 override-dependencies = [
    "util-services @ git+https://code.europa.eu/simpl/simpl-open/development/data-services/util-services.git@v0.6.1",
 ]
 [tool.uv.sources]
 torch = { index = "pytorch-cpu" }
 util-services = { git = "https://code.europa.eu/simpl/simpl-open/development/data-services/util-services.git", rev = "v0.6.1" }
 data-processing = { git = "https://code.europa.eu/simpl/simpl-open/development/data-services/data-processing.git", branch = "0.4.0" }
 dataframe-level-anonymisation = { git = "https://code.europa.eu/simpl/simpl-open/development/data-services/dataframe-level-anonymisation.git", branch = "0.6.0" }
 field-level-pseudo-anonymisation = { git = "https://code.europa.eu/simpl/simpl-open/development/data-services/field-level-pseudo-anonymisation.git", branch = "0.7.0" }
 [[tool.uv.index]]
 name = "pytorch-cpu"
 url = "https://download.pytorch.org/whl/cpu"
 explicit = true
 [project.optional-dependencies]
 dev = [
    "pytest>=8.0.0",
    "pytest-cov>=7.0.0",
    "pytest-mock>=3.0.0"
 ]
 [tool.setuptools.packages.find]
 where = ["src"]
--- a/src/template_code_location/init.py
+++ b/src/template_code_location/init.py
--- a/src/template_code_location/jobs/init.py
+++ b/src/template_code_location/jobs/init.py
--- a/src/template_code_location/jobs/jobs.py
+++ b/src/template_code_location/jobs/jobs.py
@@ -0,0 +1,9 @@
 from dagster import job
 from ..ops.ops import fetch_data, process_data
@job
 def data_processing_job():
    """A simple job that fetches and processes data."""
    raw = fetch_data()
    process_data(raw)
--- a/src/template_code_location/ops/init.py
+++ b/src/template_code_location/ops/init.py
--- a/src/template_code_location/ops/ops.py
+++ b/src/template_code_location/ops/ops.py
@@ -0,0 +1,13 @@
 from dagster import op
@op
 def fetch_data() -> list:
    """Fetches raw data from a source."""
    return [{"id": 1, "value": "A"}, {"id": 2, "value": "B"}]
@op
 def process_data(data: list) -> dict:
    """Processes raw data and returns a summary."""
    return {"count": len(data), "status": "success"}
--- a/src/template_code_location/repository.py
+++ b/src/template_code_location/repository.py
@@ -0,0 +1,70 @@
 from dagster import Definitions
 from util_services.resources import s3_resource
 from util_services.sensors import (
    notify_success,
    notify_failure,
    notify_canceled
 )
 from util_services.custom_json_logger import simpl_json_logger
 # Data processing jobs
 from data_processing.jobs import (
    remove_duplicates_job_s3,
    fill_missing_values_job_s3,
    standardize_categorical_values_job_s3,
    correct_typos_job_s3,
    normalize_numeric_min_max_job_s3,
    normalize_datetime_job_s3,
    normalize_coordinates_job_s3,
    add_global_aggregations_job_s3,
    filter_dataset_job_s3,
    quality_job_s3
 )
 # Dataframe-level anonymisation jobs
 from dataframe_level_anonymisation.jobs import (
    k_anonymity_job_s3,
    l_diversity_job_s3,
    t_closeness_job_s3,
    read_write_semistructured_job_s3,
 )
 # Field-level pseudo-anonymisation jobs
 from field_level_pseudo_anonymisation.jobs import (
    anonymise_pseudonymise_structured_job_s3,
    depseudonymise_structured_job_s3,
    anonymise_pseudonymise_unstructured_job_s3,
    depseudonymise_unstructured_job_s3,
 )
 from template_code_location.jobs.jobs import data_processing_job
 defs = Definitions(
    jobs=[
        data_processing_job,
        # Data processing
        remove_duplicates_job_s3,
        fill_missing_values_job_s3,
        standardize_categorical_values_job_s3,
        correct_typos_job_s3,
        normalize_numeric_min_max_job_s3,
        normalize_datetime_job_s3,
        normalize_coordinates_job_s3,
        add_global_aggregations_job_s3,
        filter_dataset_job_s3,
        quality_job_s3,
        # Dataframe-level anonymisation
        k_anonymity_job_s3,
        l_diversity_job_s3,
        t_closeness_job_s3,
        read_write_semistructured_job_s3,
        # Field-level pseudo-anonymisation
        anonymise_pseudonymise_structured_job_s3,
        depseudonymise_structured_job_s3,
        anonymise_pseudonymise_unstructured_job_s3,
        depseudonymise_unstructured_job_s3,
    ],
    sensors=[notify_success, notify_failure, notify_canceled],
    resources={"s3": s3_resource.configured({"resource_name": "selfS3"})},
    loggers={"simpl": simpl_json_logger},
 )
--- a/uv.lock
+++ b/uv.lock
		`@@ -1 +1 @@`
			`PROJECT_VERSION_NUMBER="0.0.1"`				`PROJECT_VERSION_NUMBER="0.1.0"`