# Dagster Workflow – Input/Output Separation and Non-Overwrite Principles ## 1. Objective The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing. The workflow must: - Read data from a **source location (input)** - Write processed data to a **separate destination (output)** - Preserve the original dataset unchanged --- ## 2. Key Principles To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows: ### 2.1 Separation of Input and Output - Input and output must **always refer to different storage locations** - This separation can be enforced via: - Different `file_key` values - Different buckets or paths - Different prefixes or folders ### 2.2 Read-Only Input - Input datasets must be treated as **immutable** - No operation should write back to the input path ### 2.3 Explicit Output Configuration - Output destinations must be explicitly configured - Avoid default or implicit reuse of input configuration ### 2.4 Idempotent Processing - Workflow execution should not produce side effects on the original dataset - Re-running the workflow must not alter the source data --- ## 3. Test Setup and Execution ### 3.1 Dataset Configuration - Prepare an input dataset (e.g. `input.csv`) in a defined location: - Example: `s3://dagster-workflow-bucket/input.csv` ### 3.2 Workflow Configuration - Configure Dagster to: - Read from the input dataset - Write results to a different location (e.g. `output.csv`) ### 3.3 Execution - Run the Dagster job - Ensure that: - Data is successfully processed - Output dataset is generated --- ## 4. Verification of No Overwrite To validate correct behavior: - Compare input dataset **before and after execution** - Ensure: - File content is unchanged - File timestamp/version is unchanged (if applicable) - Verify that: - Output dataset exists in a different location - Output contains only processed data --- ## 5. Example – Simpl-Open Pre-Built Workflow Configuration In Simpl-Open pre-built workflows, this principle is **already enforced by design**. Below is an example configuration: ```yaml ops: apply_l_diversity: config: generalisation_hierarchies: age: simpl_age ident: - Name k: 2 l: 3 quasi_identifiers: - age sensitive_attribute: Disease supp_level: 50.0 read_structured_from_s3: config: bucket_name: dagster-workflow-bucket file_format: csv file_key: input.csv write_df_to_s3: config: bucket_name: dagster-workflow-bucket file_format: csv file_key: output.csv ``` ### Explanation - `read_structured_from_s3` - Reads the dataset from `input.csv` - `write_df_to_s3` - Writes the processed dataset to `output.csv` ### Key Point Even when using the **same bucket**, separation is guaranteed by: - Using a **different `file_key` for output** This ensures that: - The input dataset (`input.csv`) is never overwritten - The output dataset is stored independently (`output.csv`) --- ## 6. Configuration Guidelines When creating or customizing Dagster workflows, follow these guidelines: - Always define a **dedicated output path** - Never reuse the same `file_key` for input and output - Prefer: - Different filenames (`input.csv` vs `output.csv`) - Or structured paths: - `/input/...` - `/output/...` - Validate configuration before execution --- ## 7. Conclusion The separation between input and output datasets is a **design principle** in Dagster workflows. Simpl-Open pre-built workflows already implement this approach by: - Clearly distinguishing input and output configurations - Ensuring safe, non-destructive data processing Adhering to these principles guarantees: - Data integrity - Reproducibility - Safe pipeline execution without unintended overwrites