Data Contracts
Data contracts validate DataFrame column values against enum types at compile time, catching invalid data before runtime. This is especially useful for survey data, coded variables, and any dataset where columns should only contain specific values.
String Enum Contracts
The simplest form of data contract uses an enum type to validate string column values:
import DataFrame
type City = Berlin | Munich | Hamburg
let data : DataFrame { name: String, city: City, .. } = DataFrame.readCsv "users.csv"
If the CSV file contains a city value that's not Berlin, Munich, or Hamburg, you get a compile-time error:
Error: Data contract violation in column 'city' (file: users.csv)
Invalid values: ["Cologne", "Frankfurt"]
Valid values: Berlin, Munich, Hamburg
This catches data quality issues before your program ever runs.
Display-Mapped Variants
Real-world data often contains values with spaces, special characters, or different formatting than Keel identifiers allow. Use display strings to map variant names to actual data values:
import DataFrame
type Country
= Germany
| BurkinaFaso "Burkina Faso"
| UnitedKingdom "United Kingdom"
| CoteDIvoire "Côte d'Ivoire"
let data : DataFrame { country: Country, .. } = DataFrame.readCsv "countries.csv"
The variant names (BurkinaFaso, UnitedKingdom) are valid Keel identifiers, while the display strings ("Burkina Faso", "United Kingdom") match the actual values in your data.
Display matching is case-sensitive: "burkina faso" will not match "Burkina Faso".
Int+Label Contracts (Value Labels)
Survey data and statistical datasets often use integer codes with associated labels (like STATA value labels). Keel supports this with value-label contracts:
import DataFrame
type Education = Primary | Secondary | Tertiary | Postgraduate
let survey : DataFrame {
respondent_id: Int,
education: [(1, Education::Primary), (2, Education::Secondary), (3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readCsv "survey.csv"
This validates that the education column contains only the integers 1, 2, 3, or 4, and provides semantic meaning to each code.
Non-Exhaustive Value Labels
Use .. to allow additional values not in the mapping:
import DataFrame
type YesNo = No | Yes
let data : DataFrame {
response: [(0, YesNo::No), (1, YesNo::Yes), ..]
} = DataFrame.readCsv "responses.csv"
Values 0 and 1 are labeled, but other integers (e.g., -1 for "missing", 9 for "refused") are allowed without causing a contract violation.
Open vs Closed Schemas
Data contracts work with both open and closed schemas:
import DataFrame
type Status = Active | Inactive | Pending
-- Closed schema: must match exactly these columns
let exact : DataFrame { id: Int, name: String, status: Status }
= DataFrame.readCsv "users.csv"
-- Open schema: validates specified columns, allows extras
let flexible : DataFrame { status: Status, .. }
= DataFrame.readCsv "users.csv"
With a closed schema, extra columns in the file cause an error. With an open schema (..), extra columns are ignored.
Multiple Contract Columns
You can apply contracts to multiple columns in the same schema:
import DataFrame
type City = Berlin | Munich | Hamburg
type Gender = Male | Female | Other
let data : DataFrame { city: City, gender: Gender, .. }
= DataFrame.readCsv "demographics.csv"
Both columns are validated, and all violations are reported together.
Error Messages
Contract violations produce detailed error messages:
Error: Data contract violation in column 'city' (file: data.csv)
Invalid values: ["Cologne", "Frankfurt"]
Valid values: Berlin, Munich, Hamburg
Hint: Check that column values match the enum variants.
Note: Data contracts are validated at compile time for literal file paths.
For int+label contracts:
Error: Data contract violation in column 'education' (file: survey.csv)
Invalid values: [5, 6, 99]
Valid mappings: (1, Primary), (2, Secondary), (3, Tertiary), (4, Postgraduate)
When Validation Happens
Data contract validation occurs at compile time only when:
- The file path is a string literal (not a variable)
- The file exists and is readable
import DataFrame
type Status = Active | Inactive
-- Validated at compile time (literal path)
let data1 : DataFrame { status: Status, .. }
= DataFrame.readCsv "users.csv"
-- NOT validated (variable path)
let path = "users.csv"
let data2 : DataFrame { status: Status, .. }
= DataFrame.readCsv path
This allows flexibility when file paths come from configuration or user input while still providing compile-time safety for known data files.
Use Cases
Survey Data Validation
import DataFrame
type AgeGroup
= Under18 "Under 18"
| Young "18-34"
| Middle "35-54"
| Senior "55+"
type Satisfaction = VeryDissatisfied | Dissatisfied | Neutral | Satisfied | VerySatisfied
let survey : DataFrame {
age_group: AgeGroup,
satisfaction: [(1, Satisfaction::VeryDissatisfied), (2, Satisfaction::Dissatisfied),
(3, Satisfaction::Neutral), (4, Satisfaction::Satisfied),
(5, Satisfaction::VerySatisfied)]
} = DataFrame.readCsv "survey_results.csv"
Geographic Data
import DataFrame
type Region
= NorthEast "North East"
| NorthWest "North West"
| Midlands
| SouthEast "South East"
| SouthWest "South West"
| London
let sales : DataFrame { region: Region, amount: Float, .. }
= DataFrame.readCsv "regional_sales.csv"
STATA-Style Value Labels
import DataFrame
type Employment = Employed | Unemployed | NotInLaborForce | Retired
let labor : DataFrame {
person_id: Int,
emp_status: [(1, Employment::Employed), (2, Employment::Unemployed),
(3, Employment::NotInLaborForce), (4, Employment::Retired)]
} = DataFrame.readDta "labor_force.dta"
Best Practices
- Define domain enums for all categorical variables in your data
- Use display strings for values that don't map to valid identifiers
- Use open schemas (
..) during exploration, closed schemas for production - Document your contracts — they serve as schema documentation
- Use non-exhaustive contracts (
..in value-label lists) when you expect unknown codes
Next Steps
- Learn about DataFrame operations for working with validated data
- See Types for more on enums and type aliases