Files
template-code-location/documents/Output Separation and Non-Overwrite Principles.md
2026-05-29 12:39:39 +02:00

3.9 KiB
Raw Blame History

Dagster Workflow Input/Output Separation and Non-Overwrite Principles

1. Objective

The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are never modified or overwritten during processing.

The workflow must:

  • Read data from a source location (input)
  • Write processed data to a separate destination (output)
  • Preserve the original dataset unchanged

2. Key Principles

To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:

2.1 Separation of Input and Output

  • Input and output must always refer to different storage locations
  • This separation can be enforced via:
    • Different file_key values
    • Different buckets or paths
    • Different prefixes or folders

2.2 Read-Only Input

  • Input datasets must be treated as immutable
  • No operation should write back to the input path

2.3 Explicit Output Configuration

  • Output destinations must be explicitly configured
  • Avoid default or implicit reuse of input configuration

2.4 Idempotent Processing

  • Workflow execution should not produce side effects on the original dataset
  • Re-running the workflow must not alter the source data

3. Test Setup and Execution

3.1 Dataset Configuration

  • Prepare an input dataset (e.g. input.csv) in a defined location:
    • Example: s3://dagster-workflow-bucket/input.csv

3.2 Workflow Configuration

  • Configure Dagster to:
    • Read from the input dataset
    • Write results to a different location (e.g. output.csv)

3.3 Execution

  • Run the Dagster job
  • Ensure that:
    • Data is successfully processed
    • Output dataset is generated

4. Verification of No Overwrite

To validate correct behavior:

  • Compare input dataset before and after execution
  • Ensure:
    • File content is unchanged
    • File timestamp/version is unchanged (if applicable)
  • Verify that:
    • Output dataset exists in a different location
    • Output contains only processed data

5. Example Simpl-Open Pre-Built Workflow Configuration

In Simpl-Open pre-built workflows, this principle is already enforced by design.

Below is an example configuration:

ops:
  apply_l_diversity:
    config:
      generalisation_hierarchies:
        age: simpl_age
      ident:
        - Name
      k: 2
      l: 3
      quasi_identifiers:
        - age
      sensitive_attribute: Disease
      supp_level: 50.0

  read_structured_from_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: input.csv

  write_df_to_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: output.csv

Explanation

  • read_structured_from_s3

    • Reads the dataset from input.csv
  • write_df_to_s3

    • Writes the processed dataset to output.csv

Key Point

Even when using the same bucket, separation is guaranteed by:

  • Using a different file_key for output

This ensures that:

  • The input dataset (input.csv) is never overwritten
  • The output dataset is stored independently (output.csv)

6. Configuration Guidelines

When creating or customizing Dagster workflows, follow these guidelines:

  • Always define a dedicated output path
  • Never reuse the same file_key for input and output
  • Prefer:
    • Different filenames (input.csv vs output.csv)
    • Or structured paths:
      • /input/...
      • /output/...
  • Validate configuration before execution

7. Conclusion

The separation between input and output datasets is a design principle in Dagster workflows.

Simpl-Open pre-built workflows already implement this approach by:

  • Clearly distinguishing input and output configurations
  • Ensuring safe, non-destructive data processing

Adhering to these principles guarantees:

  • Data integrity
  • Reproducibility
  • Safe pipeline execution without unintended overwrites