Files

Matteo Basile bfc22e594a SIMPL-28034 init

2026-05-29 12:39:39 +02:00

3.9 KiB

Raw Blame History

Dagster Workflow – Input/Output Separation and Non-Overwrite Principles

1. Objective

The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are never modified or overwritten during processing.

The workflow must:

Read data from a source location (input)
Write processed data to a separate destination (output)
Preserve the original dataset unchanged

2. Key Principles

To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:

2.1 Separation of Input and Output

Input and output must always refer to different storage locations
This separation can be enforced via:
- Different file_key values
- Different buckets or paths
- Different prefixes or folders

2.2 Read-Only Input

Input datasets must be treated as immutable
No operation should write back to the input path

2.3 Explicit Output Configuration

Output destinations must be explicitly configured
Avoid default or implicit reuse of input configuration

2.4 Idempotent Processing

Workflow execution should not produce side effects on the original dataset
Re-running the workflow must not alter the source data

3. Test Setup and Execution

3.1 Dataset Configuration

Prepare an input dataset (e.g. input.csv) in a defined location:
- Example: s3://dagster-workflow-bucket/input.csv

3.2 Workflow Configuration

Configure Dagster to:
- Read from the input dataset
- Write results to a different location (e.g. output.csv)

3.3 Execution

Run the Dagster job
Ensure that:
- Data is successfully processed
- Output dataset is generated

4. Verification of No Overwrite

To validate correct behavior:

Compare input dataset before and after execution
Ensure:
- File content is unchanged
- File timestamp/version is unchanged (if applicable)
Verify that:
- Output dataset exists in a different location
- Output contains only processed data

5. Example – Simpl-Open Pre-Built Workflow Configuration

In Simpl-Open pre-built workflows, this principle is already enforced by design.

Below is an example configuration:

ops:
  apply_l_diversity:
    config:
      generalisation_hierarchies:
        age: simpl_age
      ident:
        - Name
      k: 2
      l: 3
      quasi_identifiers:
        - age
      sensitive_attribute: Disease
      supp_level: 50.0

  read_structured_from_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: input.csv

  write_df_to_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: output.csv

Explanation

read_structured_from_s3
- Reads the dataset from input.csv
write_df_to_s3
- Writes the processed dataset to output.csv

Key Point

Even when using the same bucket, separation is guaranteed by:

Using a different file_key for output

This ensures that:

The input dataset (input.csv) is never overwritten
The output dataset is stored independently (output.csv)

6. Configuration Guidelines

When creating or customizing Dagster workflows, follow these guidelines:

Always define a dedicated output path
Never reuse the same file_key for input and output
Prefer:
- Different filenames (input.csv vs output.csv)
- Or structured paths:
  - /input/...
  - /output/...
Validate configuration before execution

7. Conclusion

The separation between input and output datasets is a design principle in Dagster workflows.

Simpl-Open pre-built workflows already implement this approach by:

Clearly distinguishing input and output configurations
Ensuring safe, non-destructive data processing

Adhering to these principles guarantees:

Data integrity
Reproducibility
Safe pipeline execution without unintended overwrites

3.9 KiB Raw Blame History Unescape Escape