# Dagster Workflow – Input/Output Separation and Non-Overwrite Principles

## 1. Objective

The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are **never modified or overwritten** during processing.

The workflow must:
- Read data from a **source location (input)**
- Write processed data to a **separate destination (output)**
- Preserve the original dataset unchanged

---

## 2. Key Principles

To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:

### 2.1 Separation of Input and Output
- Input and output must **always refer to different storage locations**
- This separation can be enforced via:
  - Different `file_key` values
  - Different buckets or paths
  - Different prefixes or folders

### 2.2 Read-Only Input
- Input datasets must be treated as **immutable**
- No operation should write back to the input path

### 2.3 Explicit Output Configuration
- Output destinations must be explicitly configured
- Avoid default or implicit reuse of input configuration

### 2.4 Idempotent Processing
- Workflow execution should not produce side effects on the original dataset
- Re-running the workflow must not alter the source data

---

## 3. Test Setup and Execution

### 3.1 Dataset Configuration
- Prepare an input dataset (e.g. `input.csv`) in a defined location:
  - Example: `s3://dagster-workflow-bucket/input.csv`

### 3.2 Workflow Configuration
- Configure Dagster to:
  - Read from the input dataset
  - Write results to a different location (e.g. `output.csv`)

### 3.3 Execution
- Run the Dagster job
- Ensure that:
  - Data is successfully processed
  - Output dataset is generated

---

## 4. Verification of No Overwrite

To validate correct behavior:

- Compare input dataset **before and after execution**
- Ensure:
  - File content is unchanged
  - File timestamp/version is unchanged (if applicable)
- Verify that:
  - Output dataset exists in a different location
  - Output contains only processed data

---

## 5. Example – Simpl-Open Pre-Built Workflow Configuration

In Simpl-Open pre-built workflows, this principle is **already enforced by design**.

Below is an example configuration:

```yaml
ops:
  apply_l_diversity:
    config:
      generalisation_hierarchies:
        age: simpl_age
      ident:
        - Name
      k: 2
      l: 3
      quasi_identifiers:
        - age
      sensitive_attribute: Disease
      supp_level: 50.0

  read_structured_from_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: input.csv

  write_df_to_s3:
    config:
      bucket_name: dagster-workflow-bucket
      file_format: csv
      file_key: output.csv
```

### Explanation

- `read_structured_from_s3`
  - Reads the dataset from `input.csv`

- `write_df_to_s3`
  - Writes the processed dataset to `output.csv`

### Key Point

Even when using the **same bucket**, separation is guaranteed by:
- Using a **different `file_key` for output**

This ensures that:
- The input dataset (`input.csv`) is never overwritten
- The output dataset is stored independently (`output.csv`)

---

## 6. Configuration Guidelines

When creating or customizing Dagster workflows, follow these guidelines:

- Always define a **dedicated output path**
- Never reuse the same `file_key` for input and output
- Prefer:
  - Different filenames (`input.csv` vs `output.csv`)
  - Or structured paths:
    - `/input/...`
    - `/output/...`
- Validate configuration before execution

---

## 7. Conclusion

The separation between input and output datasets is a **design principle** in Dagster workflows.

Simpl-Open pre-built workflows already implement this approach by:
- Clearly distinguishing input and output configurations
- Ensuring safe, non-destructive data processing

Adhering to these principles guarantees:
- Data integrity
- Reproducibility
- Safe pipeline execution without unintended overwrites