SIMPL-28034 init

This commit is contained in:
Matteo Basile
2026-05-29 12:39:39 +02:00
parent f5daf2f748
commit bfc22e594a

View File

@@ -0,0 +1,151 @@
# Dagster Workflow Input/Output Separation and Non-Overwrite Principles
## 1. Objective
The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing.
The workflow must:
- Read data from a **source location (input)**
- Write processed data to a **separate destination (output)**
- Preserve the original dataset unchanged
---
## 2. Key Principles
To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:
### 2.1 Separation of Input and Output
- Input and output must **always refer to different storage locations**
- This separation can be enforced via:
- Different `file_key` values
- Different buckets or paths
- Different prefixes or folders
### 2.2 Read-Only Input
- Input datasets must be treated as **immutable**
- No operation should write back to the input path
### 2.3 Explicit Output Configuration
- Output destinations must be explicitly configured
- Avoid default or implicit reuse of input configuration
### 2.4 Idempotent Processing
- Workflow execution should not produce side effects on the original dataset
- Re-running the workflow must not alter the source data
---
## 3. Test Setup and Execution
### 3.1 Dataset Configuration
- Prepare an input dataset (e.g. `input.csv`) in a defined location:
- Example: `s3://dagster-workflow-bucket/input.csv`
### 3.2 Workflow Configuration
- Configure Dagster to:
- Read from the input dataset
- Write results to a different location (e.g. `output.csv`)
### 3.3 Execution
- Run the Dagster job
- Ensure that:
- Data is successfully processed
- Output dataset is generated
---
## 4. Verification of No Overwrite
To validate correct behavior:
- Compare input dataset **before and after execution**
- Ensure:
- File content is unchanged
- File timestamp/version is unchanged (if applicable)
- Verify that:
- Output dataset exists in a different location
- Output contains only processed data
---
## 5. Example Simpl-Open Pre-Built Workflow Configuration
In Simpl-Open pre-built workflows, this principle is **already enforced by design**.
Below is an example configuration:
```yaml
ops:
apply_l_diversity:
config:
generalisation_hierarchies:
age: simpl_age
ident:
- Name
k: 2
l: 3
quasi_identifiers:
- age
sensitive_attribute: Disease
supp_level: 50.0
read_structured_from_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: input.csv
write_df_to_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: output.csv
```
### Explanation
- `read_structured_from_s3`
- Reads the dataset from `input.csv`
- `write_df_to_s3`
- Writes the processed dataset to `output.csv`
### Key Point
Even when using the **same bucket**, separation is guaranteed by:
- Using a **different `file_key` for output**
This ensures that:
- The input dataset (`input.csv`) is never overwritten
- The output dataset is stored independently (`output.csv`)
---
## 6. Configuration Guidelines
When creating or customizing Dagster workflows, follow these guidelines:
- Always define a **dedicated output path**
- Never reuse the same `file_key` for input and output
- Prefer:
- Different filenames (`input.csv` vs `output.csv`)
- Or structured paths:
- `/input/...`
- `/output/...`
- Validate configuration before execution
---
## 7. Conclusion
The separation between input and output datasets is a **design principle** in Dagster workflows.
Simpl-Open pre-built workflows already implement this approach by:
- Clearly distinguishing input and output configurations
- Ensuring safe, non-destructive data processing
Adhering to these principles guarantees:
- Data integrity
- Reproducibility
- Safe pipeline execution without unintended overwrites