SIMPL-28034 init

2026-05-29 12:39:39 +02:00
parent f5daf2f748
commit bfc22e594a
1 changed files with 151 additions and 0 deletions
--- a/documents/Output
+++ b/documents/Output
@@ -0,0 +1,151 @@
+# Dagster Workflow – Input/Output Separation and Non-Overwrite Principles
+
+## 1. Objective
+
+The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing.
+
+The workflow must:
+- Read data from a **source location (input)**
+- Write processed data to a **separate destination (output)**
+- Preserve the original dataset unchanged
+
+---
+
+## 2. Key Principles
+
+To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:
+
+### 2.1 Separation of Input and Output
+- Input and output must **always refer to different storage locations**
+- This separation can be enforced via:
+  - Different `file_key` values
+  - Different buckets or paths
+  - Different prefixes or folders
+
+### 2.2 Read-Only Input
+- Input datasets must be treated as **immutable**
+- No operation should write back to the input path
+
+### 2.3 Explicit Output Configuration
+- Output destinations must be explicitly configured
+- Avoid default or implicit reuse of input configuration
+
+### 2.4 Idempotent Processing
+- Workflow execution should not produce side effects on the original dataset
+- Re-running the workflow must not alter the source data
+
+---
+
+## 3. Test Setup and Execution
+
+### 3.1 Dataset Configuration
+- Prepare an input dataset (e.g. `input.csv`) in a defined location:
+  - Example: `s3://dagster-workflow-bucket/input.csv`
+
+### 3.2 Workflow Configuration
+- Configure Dagster to:
+  - Read from the input dataset
+  - Write results to a different location (e.g. `output.csv`)
+
+### 3.3 Execution
+- Run the Dagster job
+- Ensure that:
+  - Data is successfully processed
+  - Output dataset is generated
+
+---
+
+## 4. Verification of No Overwrite
+
+To validate correct behavior:
+
+- Compare input dataset **before and after execution**
+- Ensure:
+  - File content is unchanged
+  - File timestamp/version is unchanged (if applicable)
+- Verify that:
+  - Output dataset exists in a different location
+  - Output contains only processed data
+
+---
+
+## 5. Example – Simpl-Open Pre-Built Workflow Configuration
+
+In Simpl-Open pre-built workflows, this principle is **already enforced by design**.
+
+Below is an example configuration:
+
+```yaml
+ops:
+  apply_l_diversity:
+    config:
+      generalisation_hierarchies:
+        age: simpl_age
+      ident:
+        - Name
+      k: 2
+      l: 3
+      quasi_identifiers:
+        - age
+      sensitive_attribute: Disease
+      supp_level: 50.0
+
+  read_structured_from_s3:
+    config:
+      bucket_name: dagster-workflow-bucket
+      file_format: csv
+      file_key: input.csv
+
+  write_df_to_s3:
+    config:
+      bucket_name: dagster-workflow-bucket
+      file_format: csv
+      file_key: output.csv
+```
+
+### Explanation
+
+- `read_structured_from_s3`
+  - Reads the dataset from `input.csv`
+
+- `write_df_to_s3`
+  - Writes the processed dataset to `output.csv`
+
+### Key Point
+
+Even when using the **same bucket**, separation is guaranteed by:
+- Using a **different `file_key` for output**
+
+This ensures that:
+- The input dataset (`input.csv`) is never overwritten
+- The output dataset is stored independently (`output.csv`)
+
+---
+
+## 6. Configuration Guidelines
+
+When creating or customizing Dagster workflows, follow these guidelines:
+
+- Always define a **dedicated output path**
+- Never reuse the same `file_key` for input and output
+- Prefer:
+  - Different filenames (`input.csv` vs `output.csv`)
+  - Or structured paths:
+    - `/input/...`
+    - `/output/...`
+- Validate configuration before execution
+
+---
+
+## 7. Conclusion
+
+The separation between input and output datasets is a **design principle** in Dagster workflows.
+
+Simpl-Open pre-built workflows already implement this approach by:
+- Clearly distinguishing input and output configurations
+- Ensuring safe, non-destructive data processing
+
+Adhering to these principles guarantees:
+- Data integrity
+- Reproducibility
+- Safe pipeline execution without unintended overwrites