---
title: Quarantine
meta:
    description: Quarantine data sources store data that doesn't fit the schema.
---

# Quarantine data sources

Every data source in your workspace has an associated quarantine data source that stores data that doesn't fit the schema. If you send rows that don't match the data source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail.

By convention, quarantine data sources follow the naming pattern `{datasource_name}_quarantine`. You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest.

## Review quarantined data

To check your quarantine data sources, run the `tb sql` command. For example:

```shell
tb sql "select * from <datasource_name>_quarantine limit 10"
```

A sample output of the `tb sql` command is the following:

```text
──────────────────────────────────────────────────────────────────
c__error_column: ['abslevel']
c__error: ["value '' on column 'abslevel' is not Float32"]
c__import_id: 01JKQPWT8GVXAN5GJ1VBD4XM27
day: 2014-07-30
station: Embassament de Siurana (Cornudella de Montsant)
volume: 11.57
insertion_date: 2025-02-10 10:36:20
──────────────────────────────────────────────────────────────────
```

The quarantine data source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine:

- `c__error_column` Array(String) contains an array of all the columns that contain an invalid value.
- `c__error` Array(String) contains an array of all the errors that caused the ingestion to fail and led to the row being stored in quarantine. This column, along with `c__error_column`, allows you to easily identify which columns have problems and what the specific errors are
- `c__import_id` Nullable(String) contains the job's identifier in case the column was imported through a job.
- `insertion_date` (DateTime) contains the timestamp in which the ingestion was done.

## Fixing quarantined data example

Using the Electric Vehicle Population Data example:

```shell
tb datasource create --url "https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD" --name rows
tb build
tb datasource append rows "https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD"
```

You should get the following quarantine error:
`Error appending fixtures for 'rows': There was an error with file contents: 564 rows in quarantine.`

Inspecting the `rows_quarantine` data source:

```shell
tb sql "SELECT DISTINCT c__error FROM rows_quarantine"

# ────────────────────────────────────────────────────────────────────────────────────────────
# c__error: ["value '' on column 'postal_code' is not Int64", "value '' on column 'legislative_district' is not Int16", # "value '' on column 'c_2020_census_tract' is not Int64"]
# ────────────────────────────────────────────────────────────────────────────────────────────
# c__error: ["value '' on column 'electric_range' is not Int32", "value '' on column 'base_msrp' is not Int64"]
# ────────────────────────────────────────────────────────────────────────────────────────────
# c__error: ["value '' on column 'legislative_district' is not Int16"]
# ────────────────────────────────────────────────────────────────────────────────────────────
```

The problem is that some columns should be Nullable or have a DEFAULT value. Let's proceed with adding a DEFAULT value of 0 for them.

Edit the `datasources/rows.datasource` file.

``` {% title="datasources/rows.datasource" %}
DESCRIPTION >
    Generated from https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD

SCHEMA >
    `vin__1_10_` String,
    `county` String,
    `city` String,
    `state` String,
    `postal_code` Int64 DEFAULT 0,
    `model_year` Int32,
    `make` String,
    `model` String,
    `electric_vehicle_type` String,
    `clean_alternative_fuel_vehicle__cafv__eligibility` String,
    `electric_range` Int32 DEFAULT 0,
    `base_msrp` Int64 DEFAULT 0,
    `legislative_district` Int16 DEFAULT 0,
    `dol_vehicle_id` Int64,
    `vehicle_location` String,
    `electric_utility` String,
    `c_2020_census_tract` Int64 DEFAULT 0
```

The dev server will rebuild the edited resources.

```shell
tb build
tb datasource append rows "https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD"
```

No errors now, you're good to continue developing.

## Recovering data from quarantine

Once you've fixed the schema issues that caused data to be quarantined, you can recover the quarantined data back to your main data source. The approach depends on the amount of quarantined data you need to recover.

### Small datasets

For small amounts of quarantined data, you can recover it directly using the CLI:

1. **Export the fixed data** from quarantine to a CSV file:

```shell
tb --cloud --output csv sql "select <query_fixing_issues> from ds_quarantine" --rows-limit 120 > rows.csv
```

Replace `<query_fixing_issues>` with a query that transforms the quarantined data to match your fixed schema. For example:

```shell
tb --cloud --output csv sql "select vin__1_10_, county, city, state, COALESCE(postal_code, 0) as postal_code, model_year, make, model from rows_quarantine" --rows-limit 120 > rows.csv
```

2. **Deploy your schema changes** to the workspace:

```shell
tb deploy
```

3. **Append the recovered data** to your data source:

```shell
tb --cloud datasource append ds rows.csv
```

### Large datasets (more than 120 rows)

For larger amounts of quarantined data, use a three-step deployment process with temporary resources:

#### Step 1: Deploy temporary resources

Create a temporary data source and copy pipe to process the quarantined data:

```shell
tb deploy
```

This deployment should include:
- A temporary data source to host the recovered data
- A copy pipe that transforms quarantined data and writes to the temporary data source

#### Step 2: Trigger the copy and deploy final schema

1. **Trigger the copy pipe** to process quarantined data into the temporary data source:

```shell
tb --cloud copy run <copy_pipe_name>
```

2. **Deploy the final schema changes** and create a copy pipe from the temporary data source to your fixed main data source:

```shell
tb --cloud deploy
```

3. **Trigger the final copy** from temporary to main data source:

```shell
tb --cloud copy run <final_copy_pipe_name>
```

#### Step 3: Clean up temporary resources

Deploy a final time to remove the temporary data source and copy pipes:

```shell
tb deploy
```

This approach ensures that large datasets are processed efficiently without hitting CLI limits while maintaining data integrity throughout the recovery process.