Esc
Start typing to search...

Data Contracts

Data contracts validate DataFrame column values against enum types at compile time, catching invalid data before runtime. This is especially useful for survey data, coded variables, and any dataset where columns should only contain specific values.

String Enum Contracts

The simplest form of data contract uses an enum type to validate string column values:

import DataFrame

type City = Berlin | Munich | Hamburg

let data : DataFrame { name: String, city: City, .. } = DataFrame.readCsv "users.csv"

If the CSV file contains a city value that's not Berlin, Munich, or Hamburg, you get a compile-time error:

Error: Data contract violation in column 'city' (file: users.csv)
       Invalid values: ["Cologne", "Frankfurt"]
       Valid values: Berlin, Munich, Hamburg

This catches data quality issues before your program ever runs.

Display-Mapped Variants

Real-world data often contains values with spaces, special characters, or different formatting than Keel identifiers allow. Use display strings to map variant names to actual data values:

import DataFrame

type Country
    = Germany
    | BurkinaFaso "Burkina Faso"
    | UnitedKingdom "United Kingdom"
    | CoteDIvoire "Côte d'Ivoire"

let data : DataFrame { country: Country, .. } = DataFrame.readCsv "countries.csv"

The variant names (BurkinaFaso, UnitedKingdom) are valid Keel identifiers, while the display strings ("Burkina Faso", "United Kingdom") match the actual values in your data.

Display matching is case-sensitive: "burkina faso" will not match "Burkina Faso".

Int+Label Contracts (Value Labels)

Survey data and statistical datasets often use integer codes with associated labels (like STATA value labels). Keel supports this with value-label contracts:

import DataFrame

type Education = Primary | Secondary | Tertiary | Postgraduate

let survey : DataFrame {
    respondent_id: Int,
    education: [(1, Education::Primary), (2, Education::Secondary), (3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readCsv "survey.csv"

This validates that the education column contains only the integers 1, 2, 3, or 4, and provides semantic meaning to each code.

Non-Exhaustive Value Labels

Use .. to allow additional values not in the mapping:

import DataFrame

type YesNo = No | Yes

let data : DataFrame {
    response: [(0, YesNo::No), (1, YesNo::Yes), ..]
} = DataFrame.readCsv "responses.csv"

Values 0 and 1 are labeled, but other integers (e.g., -1 for "missing", 9 for "refused") are allowed without causing a contract violation.

Open vs Closed Schemas

Data contracts work with both open and closed schemas:

import DataFrame

type Status = Active | Inactive | Pending

-- Closed schema: must match exactly these columns
let exact : DataFrame { id: Int, name: String, status: Status }
    = DataFrame.readCsv "users.csv"

-- Open schema: validates specified columns, allows extras
let flexible : DataFrame { status: Status, .. }
    = DataFrame.readCsv "users.csv"

With a closed schema, extra columns in the file cause an error. With an open schema (..), extra columns are ignored.

Multiple Contract Columns

You can apply contracts to multiple columns in the same schema:

import DataFrame

type City = Berlin | Munich | Hamburg
type Gender = Male | Female | Other

let data : DataFrame { city: City, gender: Gender, .. }
    = DataFrame.readCsv "demographics.csv"

Both columns are validated, and all violations are reported together.

Error Messages

Contract violations produce detailed error messages:

Error: Data contract violation in column 'city' (file: data.csv)
       Invalid values: ["Cologne", "Frankfurt"]
       Valid values: Berlin, Munich, Hamburg
Hint: Check that column values match the enum variants.
Note: Data contracts are validated at compile time for literal file paths.

For int+label contracts:

Error: Data contract violation in column 'education' (file: survey.csv)
       Invalid values: [5, 6, 99]
       Valid mappings: (1, Primary), (2, Secondary), (3, Tertiary), (4, Postgraduate)

When Validation Happens

Data contract validation occurs at compile time only when:

  1. The file path is a string literal (not a variable)
  2. The file exists and is readable
import DataFrame

type Status = Active | Inactive

-- Validated at compile time (literal path)
let data1 : DataFrame { status: Status, .. }
    = DataFrame.readCsv "users.csv"

-- NOT validated (variable path)
let path = "users.csv"
let data2 : DataFrame { status: Status, .. }
    = DataFrame.readCsv path

This allows flexibility when file paths come from configuration or user input while still providing compile-time safety for known data files.

Use Cases

Survey Data Validation

import DataFrame

type AgeGroup
    = Under18 "Under 18"
    | Young "18-34"
    | Middle "35-54"
    | Senior "55+"

type Satisfaction = VeryDissatisfied | Dissatisfied | Neutral | Satisfied | VerySatisfied

let survey : DataFrame {
    age_group: AgeGroup,
    satisfaction: [(1, Satisfaction::VeryDissatisfied), (2, Satisfaction::Dissatisfied),
                   (3, Satisfaction::Neutral), (4, Satisfaction::Satisfied),
                   (5, Satisfaction::VerySatisfied)]
} = DataFrame.readCsv "survey_results.csv"

Geographic Data

import DataFrame

type Region
    = NorthEast "North East"
    | NorthWest "North West"
    | Midlands
    | SouthEast "South East"
    | SouthWest "South West"
    | London

let sales : DataFrame { region: Region, amount: Float, .. }
    = DataFrame.readCsv "regional_sales.csv"

STATA-Style Value Labels

import DataFrame

type Employment = Employed | Unemployed | NotInLaborForce | Retired

let labor : DataFrame {
    person_id: Int,
    emp_status: [(1, Employment::Employed), (2, Employment::Unemployed),
                 (3, Employment::NotInLaborForce), (4, Employment::Retired)]
} = DataFrame.readDta "labor_force.dta"

Best Practices

  1. Define domain enums for all categorical variables in your data
  2. Use display strings for values that don't map to valid identifiers
  3. Use open schemas (..) during exploration, closed schemas for production
  4. Document your contracts — they serve as schema documentation
  5. Use non-exhaustive contracts (.. in value-label lists) when you expect unknown codes

Next Steps