Files
template-code-location/documents/Output Separation and Non-Overwrite Principles.md
2026-05-29 12:39:39 +02:00

151 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Dagster Workflow Input/Output Separation and Non-Overwrite Principles
## 1. Objective
The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing.
The workflow must:
- Read data from a **source location (input)**
- Write processed data to a **separate destination (output)**
- Preserve the original dataset unchanged
---
## 2. Key Principles
To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:
### 2.1 Separation of Input and Output
- Input and output must **always refer to different storage locations**
- This separation can be enforced via:
- Different `file_key` values
- Different buckets or paths
- Different prefixes or folders
### 2.2 Read-Only Input
- Input datasets must be treated as **immutable**
- No operation should write back to the input path
### 2.3 Explicit Output Configuration
- Output destinations must be explicitly configured
- Avoid default or implicit reuse of input configuration
### 2.4 Idempotent Processing
- Workflow execution should not produce side effects on the original dataset
- Re-running the workflow must not alter the source data
---
## 3. Test Setup and Execution
### 3.1 Dataset Configuration
- Prepare an input dataset (e.g. `input.csv`) in a defined location:
- Example: `s3://dagster-workflow-bucket/input.csv`
### 3.2 Workflow Configuration
- Configure Dagster to:
- Read from the input dataset
- Write results to a different location (e.g. `output.csv`)
### 3.3 Execution
- Run the Dagster job
- Ensure that:
- Data is successfully processed
- Output dataset is generated
---
## 4. Verification of No Overwrite
To validate correct behavior:
- Compare input dataset **before and after execution**
- Ensure:
- File content is unchanged
- File timestamp/version is unchanged (if applicable)
- Verify that:
- Output dataset exists in a different location
- Output contains only processed data
---
## 5. Example Simpl-Open Pre-Built Workflow Configuration
In Simpl-Open pre-built workflows, this principle is **already enforced by design**.
Below is an example configuration:
```yaml
ops:
apply_l_diversity:
config:
generalisation_hierarchies:
age: simpl_age
ident:
- Name
k: 2
l: 3
quasi_identifiers:
- age
sensitive_attribute: Disease
supp_level: 50.0
read_structured_from_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: input.csv
write_df_to_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: output.csv
```
### Explanation
- `read_structured_from_s3`
- Reads the dataset from `input.csv`
- `write_df_to_s3`
- Writes the processed dataset to `output.csv`
### Key Point
Even when using the **same bucket**, separation is guaranteed by:
- Using a **different `file_key` for output**
This ensures that:
- The input dataset (`input.csv`) is never overwritten
- The output dataset is stored independently (`output.csv`)
---
## 6. Configuration Guidelines
When creating or customizing Dagster workflows, follow these guidelines:
- Always define a **dedicated output path**
- Never reuse the same `file_key` for input and output
- Prefer:
- Different filenames (`input.csv` vs `output.csv`)
- Or structured paths:
- `/input/...`
- `/output/...`
- Validate configuration before execution
---
## 7. Conclusion
The separation between input and output datasets is a **design principle** in Dagster workflows.
Simpl-Open pre-built workflows already implement this approach by:
- Clearly distinguishing input and output configurations
- Ensuring safe, non-destructive data processing
Adhering to these principles guarantees:
- Data integrity
- Reproducibility
- Safe pipeline execution without unintended overwrites