From bfc22e594a9821f90f4f303df02983b5f853c440 Mon Sep 17 00:00:00 2001 From: Matteo Basile Date: Fri, 29 May 2026 12:39:39 +0200 Subject: [PATCH] SIMPL-28034 init --- ...Separation and Non-Overwrite Principles.md | 151 ++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 documents/Output Separation and Non-Overwrite Principles.md diff --git a/documents/Output Separation and Non-Overwrite Principles.md b/documents/Output Separation and Non-Overwrite Principles.md new file mode 100644 index 0000000..cc09d63 --- /dev/null +++ b/documents/Output Separation and Non-Overwrite Principles.md @@ -0,0 +1,151 @@ +# Dagster Workflow – Input/Output Separation and Non-Overwrite Principles + +## 1. Objective + +The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing. + +The workflow must: +- Read data from a **source location (input)** +- Write processed data to a **separate destination (output)** +- Preserve the original dataset unchanged + +--- + +## 2. Key Principles + +To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows: + +### 2.1 Separation of Input and Output +- Input and output must **always refer to different storage locations** +- This separation can be enforced via: + - Different `file_key` values + - Different buckets or paths + - Different prefixes or folders + +### 2.2 Read-Only Input +- Input datasets must be treated as **immutable** +- No operation should write back to the input path + +### 2.3 Explicit Output Configuration +- Output destinations must be explicitly configured +- Avoid default or implicit reuse of input configuration + +### 2.4 Idempotent Processing +- Workflow execution should not produce side effects on the original dataset +- Re-running the workflow must not alter the source data + +--- + +## 3. Test Setup and Execution + +### 3.1 Dataset Configuration +- Prepare an input dataset (e.g. `input.csv`) in a defined location: + - Example: `s3://dagster-workflow-bucket/input.csv` + +### 3.2 Workflow Configuration +- Configure Dagster to: + - Read from the input dataset + - Write results to a different location (e.g. `output.csv`) + +### 3.3 Execution +- Run the Dagster job +- Ensure that: + - Data is successfully processed + - Output dataset is generated + +--- + +## 4. Verification of No Overwrite + +To validate correct behavior: + +- Compare input dataset **before and after execution** +- Ensure: + - File content is unchanged + - File timestamp/version is unchanged (if applicable) +- Verify that: + - Output dataset exists in a different location + - Output contains only processed data + +--- + +## 5. Example – Simpl-Open Pre-Built Workflow Configuration + +In Simpl-Open pre-built workflows, this principle is **already enforced by design**. + +Below is an example configuration: + +```yaml +ops: + apply_l_diversity: + config: + generalisation_hierarchies: + age: simpl_age + ident: + - Name + k: 2 + l: 3 + quasi_identifiers: + - age + sensitive_attribute: Disease + supp_level: 50.0 + + read_structured_from_s3: + config: + bucket_name: dagster-workflow-bucket + file_format: csv + file_key: input.csv + + write_df_to_s3: + config: + bucket_name: dagster-workflow-bucket + file_format: csv + file_key: output.csv +``` + +### Explanation + +- `read_structured_from_s3` + - Reads the dataset from `input.csv` + +- `write_df_to_s3` + - Writes the processed dataset to `output.csv` + +### Key Point + +Even when using the **same bucket**, separation is guaranteed by: +- Using a **different `file_key` for output** + +This ensures that: +- The input dataset (`input.csv`) is never overwritten +- The output dataset is stored independently (`output.csv`) + +--- + +## 6. Configuration Guidelines + +When creating or customizing Dagster workflows, follow these guidelines: + +- Always define a **dedicated output path** +- Never reuse the same `file_key` for input and output +- Prefer: + - Different filenames (`input.csv` vs `output.csv`) + - Or structured paths: + - `/input/...` + - `/output/...` +- Validate configuration before execution + +--- + +## 7. Conclusion + +The separation between input and output datasets is a **design principle** in Dagster workflows. + +Simpl-Open pre-built workflows already implement this approach by: +- Clearly distinguishing input and output configurations +- Ensuring safe, non-destructive data processing + +Adhering to these principles guarantees: +- Data integrity +- Reproducibility +- Safe pipeline execution without unintended overwrites \ No newline at end of file