SIMPL-28034 init
This commit is contained in:
151
documents/Output Separation and Non-Overwrite Principles.md
Normal file
151
documents/Output Separation and Non-Overwrite Principles.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Dagster Workflow – Input/Output Separation and Non-Overwrite Principles
|
||||
|
||||
## 1. Objective
|
||||
|
||||
The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing.
|
||||
|
||||
The workflow must:
|
||||
- Read data from a **source location (input)**
|
||||
- Write processed data to a **separate destination (output)**
|
||||
- Preserve the original dataset unchanged
|
||||
|
||||
---
|
||||
|
||||
## 2. Key Principles
|
||||
|
||||
To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:
|
||||
|
||||
### 2.1 Separation of Input and Output
|
||||
- Input and output must **always refer to different storage locations**
|
||||
- This separation can be enforced via:
|
||||
- Different `file_key` values
|
||||
- Different buckets or paths
|
||||
- Different prefixes or folders
|
||||
|
||||
### 2.2 Read-Only Input
|
||||
- Input datasets must be treated as **immutable**
|
||||
- No operation should write back to the input path
|
||||
|
||||
### 2.3 Explicit Output Configuration
|
||||
- Output destinations must be explicitly configured
|
||||
- Avoid default or implicit reuse of input configuration
|
||||
|
||||
### 2.4 Idempotent Processing
|
||||
- Workflow execution should not produce side effects on the original dataset
|
||||
- Re-running the workflow must not alter the source data
|
||||
|
||||
---
|
||||
|
||||
## 3. Test Setup and Execution
|
||||
|
||||
### 3.1 Dataset Configuration
|
||||
- Prepare an input dataset (e.g. `input.csv`) in a defined location:
|
||||
- Example: `s3://dagster-workflow-bucket/input.csv`
|
||||
|
||||
### 3.2 Workflow Configuration
|
||||
- Configure Dagster to:
|
||||
- Read from the input dataset
|
||||
- Write results to a different location (e.g. `output.csv`)
|
||||
|
||||
### 3.3 Execution
|
||||
- Run the Dagster job
|
||||
- Ensure that:
|
||||
- Data is successfully processed
|
||||
- Output dataset is generated
|
||||
|
||||
---
|
||||
|
||||
## 4. Verification of No Overwrite
|
||||
|
||||
To validate correct behavior:
|
||||
|
||||
- Compare input dataset **before and after execution**
|
||||
- Ensure:
|
||||
- File content is unchanged
|
||||
- File timestamp/version is unchanged (if applicable)
|
||||
- Verify that:
|
||||
- Output dataset exists in a different location
|
||||
- Output contains only processed data
|
||||
|
||||
---
|
||||
|
||||
## 5. Example – Simpl-Open Pre-Built Workflow Configuration
|
||||
|
||||
In Simpl-Open pre-built workflows, this principle is **already enforced by design**.
|
||||
|
||||
Below is an example configuration:
|
||||
|
||||
```yaml
|
||||
ops:
|
||||
apply_l_diversity:
|
||||
config:
|
||||
generalisation_hierarchies:
|
||||
age: simpl_age
|
||||
ident:
|
||||
- Name
|
||||
k: 2
|
||||
l: 3
|
||||
quasi_identifiers:
|
||||
- age
|
||||
sensitive_attribute: Disease
|
||||
supp_level: 50.0
|
||||
|
||||
read_structured_from_s3:
|
||||
config:
|
||||
bucket_name: dagster-workflow-bucket
|
||||
file_format: csv
|
||||
file_key: input.csv
|
||||
|
||||
write_df_to_s3:
|
||||
config:
|
||||
bucket_name: dagster-workflow-bucket
|
||||
file_format: csv
|
||||
file_key: output.csv
|
||||
```
|
||||
|
||||
### Explanation
|
||||
|
||||
- `read_structured_from_s3`
|
||||
- Reads the dataset from `input.csv`
|
||||
|
||||
- `write_df_to_s3`
|
||||
- Writes the processed dataset to `output.csv`
|
||||
|
||||
### Key Point
|
||||
|
||||
Even when using the **same bucket**, separation is guaranteed by:
|
||||
- Using a **different `file_key` for output**
|
||||
|
||||
This ensures that:
|
||||
- The input dataset (`input.csv`) is never overwritten
|
||||
- The output dataset is stored independently (`output.csv`)
|
||||
|
||||
---
|
||||
|
||||
## 6. Configuration Guidelines
|
||||
|
||||
When creating or customizing Dagster workflows, follow these guidelines:
|
||||
|
||||
- Always define a **dedicated output path**
|
||||
- Never reuse the same `file_key` for input and output
|
||||
- Prefer:
|
||||
- Different filenames (`input.csv` vs `output.csv`)
|
||||
- Or structured paths:
|
||||
- `/input/...`
|
||||
- `/output/...`
|
||||
- Validate configuration before execution
|
||||
|
||||
---
|
||||
|
||||
## 7. Conclusion
|
||||
|
||||
The separation between input and output datasets is a **design principle** in Dagster workflows.
|
||||
|
||||
Simpl-Open pre-built workflows already implement this approach by:
|
||||
- Clearly distinguishing input and output configurations
|
||||
- Ensuring safe, non-destructive data processing
|
||||
|
||||
Adhering to these principles guarantees:
|
||||
- Data integrity
|
||||
- Reproducibility
|
||||
- Safe pipeline execution without unintended overwrites
|
||||
Reference in New Issue
Block a user