DataFrame Type Safety
Keel enforces DataFrame column types at compile time. Before your program runs, the compiler reads the file, checks column names and types against your annotation, validates enum contracts, and tracks how schema changes flow through every operation. Column type errors are reported like any other type error — with a source location and a clear message.
Schema Annotations
You annotate a DataFrame binding with a record type describing the expected columns:
import DataFrame
let data : DataFrame { name: String, age: Int, city: String } =
case DataFrame.readCsv "users.csv" of
Ok df -> df
Err _ -> DataFrame.fromRecords []
Try itThe compiler reads users.csv at compile time and checks:
- All declared columns exist in the file
- Each column's inferred type matches the declared type
- (With a closed schema) no extra columns exist
If age is a floating-point column in the file but you declared Int, you get a compile-time error. If city doesn't exist in the file at all, you get an error listing the columns that do exist.
Named Row Schemas
When a record type annotation is long or used in multiple places, define a type alias to give it a short name. The alias is fully transparent — the compiler treats the alias name and the structural record type as identical.
import DataFrame
type alias Row = { x : Int, y : String }
let rows : [Row] =
[ { x = 1, y = "a" }
, { x = 2, y = "b" }
]
DataFrame.fromRecords rows
Try itA type alias Row = { x: Int, y: String } annotation on a [Row] list is equivalent to annotating with [{ x: Int, y: String }] directly. DataFrame.fromRecords accepts either form.
Named schemas are especially useful for multi-field records that appear in task contracts (task expecting { rows: [Row] }), module signatures, and repeated pipeline steps where the full structural type would be unwieldy to repeat.
Closed and Open Schemas
A schema without .. is closed: the file must contain exactly the declared columns, no more. Add .. to declare an open schema: declared columns are checked, extra columns are allowed.
Open schemas are useful during exploration or when reading files with many columns you don't need. Closed schemas are better for production code — they catch unexpected schema drift in source files.
For examples and how open and closed schemas interact with data contracts, see Data Contracts: Open vs Closed Schemas.
Column Type Inference
The compiler infers column types from the file itself, not from your annotation. For CSV, TSV, DTA, and Parquet, it reads a sample of rows (not the whole file) and asks Polars to determine each column's type. The result maps to Keel types:
This matters because column types in a file can differ from what you expect. Trusting user-supplied metadata or embedded labels for type inference would mean a mislabeled column — common in real-world datasets — silently propagates the wrong type into your program. Inferring from the data itself ensures the compiler's type is what the data actually contains. The annotation you write is then checked against that inferred type, not used as a substitute for it.
| File column type | Keel type |
|---|---|
| Integer | Int |
| Float / Double | Float |
| String / Utf8 | String |
| Boolean | Bool |
| Date | Date |
| Datetime | DateTime |
For Parquet files, column types are read from the file's metadata rather than by scanning rows, which makes inference faster. For JSON and JSONL files the entire file is read, since JSON has no schema metadata.
Column-Level Safety Through Operations
The compiler tracks the schema through chained operations. Every call that changes which columns exist, or what they're named, produces a refined type that subsequent operations check against.
import DataFrame
let data : DataFrame { name: String, age: Int, city: String, .. } =
case DataFrame.readCsv "users.csv" of
Ok df -> df
Err _ -> DataFrame.fromRecords []
-- select returns Result; ? propagates Err and unwraps Ok
let names = DataFrame.select [@name, @city] data?
-- Compile-time error: "age" doesn't exist in names
let oops = DataFrame.column @age names
Operations tracked this way include select, remove, rename, column, groupBy, partitionBy, and window aggregations. When you write DataFrame.column @age and age no longer exists in the schema at that point in the chain, the error names the missing column and lists what is available.
A typo in a column name — @nmae instead of @name — is caught at compile time, not at runtime on the first execution.
Column Literals
Keel uses @name as the syntax for column name arguments. A value written @CO has type DataFrameColumn — a distinct type that carries the column name as a string at runtime but is not interchangeable with String or Symbol at the type level.
data |> Table.freq @CO -- @CO : DataFrameColumn
data |> DataFrame.select [@CO, @SEX]
data |> Table.cross @CO @SEX
df |> DataFrame.setVarLabel @CO "Country of origin"
Try itColumn literals bind as values, so you can store them in variables and pass them around:
let col = @CO -- type: DataFrameColumn
data |> Table.freq col -- works — no annotation needed
Try itFor column names that contain spaces or hyphens, use the quoted form:
data |> DataFrame.column @"birth year"
data |> DataFrame.select [@"first-name", @"last-name"]
Try itThe DataFrameColumn type is only compatible with itself. Passing a String or :symbol where a DataFrameColumn is expected is a compile-time type error. This keeps column arguments visually distinct from ordinary strings in data operations.
Schema Refinement Examples
select narrows the schema to the listed columns:
-- Input: DataFrame { name: String, age: Int, city: String }
-- Output: Result DataFrame DataFrameError
let slim = DataFrame.select [@name, @city] data?
remove drops the listed columns:
-- Input: DataFrame { name: String, age: Int, city: String }
-- Output: Result DataFrame DataFrameError
let without_age = DataFrame.remove [@age] data?
rename updates the column name in the type. The first argument is the existing column (DataFrameColumn); the second is the new name (String):
-- Input: DataFrame { name: String, age: Int }
-- Output: Result DataFrame DataFrameError
let renamed = DataFrame.rename @age "years" data?
column extracts a column as a list, returning Result:
-- Input: DataFrame { name: String, age: Int }
-- Output: Result [Maybe Int] DataFrameError
let ages = (DataFrame.column @age data)?
File Format Support
Schema annotations work with all supported file formats:
| Format | Schema source | Notes |
|---|---|---|
| CSV / TSV | Sample rows | 100 rows read at compile time |
| Parquet | File footer | Schema from footer, no row scan |
| Stata DTA | File metadata + sample rows | Embedded variable labels available |
| JSON / JSONL | Full file | No schema metadata; entire file read |
Parquet is the most efficient format for compile-time checking: column names and types come from the file footer without reading any data rows.
When Checking Happens
Schema checking happens at compile time when the file path is evaluable at compile time. All three forms work:
import DataFrame
let path = "data.csv"
let dir = "data/"
let a : DataFrame { name: String, .. } =
case DataFrame.readCsv "data.csv" of
Ok df -> df
Err _ -> DataFrame.fromRecords []
let b : DataFrame { name: String, .. } =
case DataFrame.readCsv path of
Ok df -> df
Err _ -> DataFrame.fromRecords []
let c : DataFrame { name: String, .. } =
case DataFrame.readCsv (dir ++ "data.csv") of
Ok df -> df
Err _ -> DataFrame.fromRecords []
Try itWhen a function takes a path as a parameter and returns a typed DataFrame, the compiler extends this analysis to the call site. If the argument at the call site is a compile-time constant, the compiler substitutes it in, reads the file, and validates the function's declared return type against the actual schema:
-- modules/users.kl
fn load_users : String -> DataFrame { name: String, age: Int }
fn load_users dir =
case DataFrame.readCsv (dir ++ "users.csv") of
Ok df -> df
Err _ -> DataFrame.fromRecords []
-- main.kl
let data_dir = "data/"
let users = load_users data_dir -- compiler reads "data/users.csv" and checks
-- that name:String and age:Int exist
Each call site is checked independently. If two bindings call the same loader with different paths, each is validated against its own file:
let cohort1 = load_users "cohort1/" -- checks cohort1/users.csv
let cohort2 = load_users "cohort2/" -- checks cohort2/users.csv separately
When the path cannot be resolved at compile time — for example, when it comes from user input at runtime — checking is skipped for that call site. The annotation still constrains how columns can be used downstream, but the file itself isn't read until execution.
This is a deliberate boundary: Keel guarantees type safety for the code you write, not for data you receive at runtime. A function that accepts a file path from user input could receive any file; the compiler cannot read a file that doesn't exist yet. The annotation you write on such a binding is a contract you're asserting — if it's wrong, you get a runtime error, not a compile-time one. For files your code owns and ships with, use literal or constant paths to get the full compile-time guarantee.
Performance Characteristics
Compile-time checking reads files on every compilation. Understanding the cost helps you choose the right format and avoid surprises on large datasets.
Schema Inference
For CSV, TSV, and DTA, the compiler reads 100 rows to infer column types. This is hardcoded and not configurable. For JSON and JSONL, the entire file is read (JSON has no schema metadata).
Parquet is different: column names and types come from the Parquet footer alone, with no row scan required. For a 10 GB Parquet file, schema inference reads only a few kilobytes.
Schema inference results are cached within a single compilation — if the same file path appears multiple times in your code, the file is read only once. The cache is discarded between compilations; there is no on-disk cache.
Re-reading on every compilation is the conservative choice: a stale cache could hide the fact that a source file changed and now has different column types. Since the compile-time cost of a 100-row CSV read is low, Keel accepts this cost in exchange for the guarantee that the types you see always reflect the file that's actually on disk.
Data Contract Validation
When you declare an enum or value-label contract on a column (see Data Contracts), the compiler validates every row in the file against the declared values. This is more expensive than schema inference:
- CSV / TSV: the full file is scanned. A 10 million row CSV with one contract column will be fully read on every compile.
- Parquet: projection pushdown applies — only the contract columns are read from disk, not the full row width. For wide files with few contract columns this is significantly cheaper than CSV.
- Stata DTA: the full file is read. DTA has no lazy reader.
- JSON / JSONL: the full file is read.
The validation short-circuits on the first invalid value per column, so files that fail contracts are fast to reject. Files that pass validation always pay the full scan cost.
This asymmetry with schema inference — 100 rows there, every row here — is intentional. Column types are consistent across rows by definition: if row 1 is an integer, row 500,000 is too. But a data contract is a claim about values, and a single invalid value anywhere in the file is a bug. Sampling would give false confidence: a dataset where 99.9% of values are valid but 0.1% are corrupt would pass a sampled check and then produce incorrect results at runtime. Keel treats contract violations the same way it treats type errors — something that must not reach execution.
Keel-annotated Parquet
When you write a Parquet file using DataFrame.writeParquet on a DataFrame that carries metadata (value labels, variable labels, lineage), Keel embeds that metadata in the Parquet footer as a keel:metadata key. Reading such a file gives the compiler access to embedded value labels without scanning any rows.
For contract validation, Keel-annotated Parquet still scans actual column values — embedded labels are checked for consistency against your declared contract, but they don't replace the value scan. The scan is column-projected (only contract columns), so the cost scales with file height and the number of contract columns, not file width.
Skipping the value scan based on embedded labels would be unsafe: metadata is written by whoever produced the file, and can be wrong. A Parquet file could claim that column status contains only Active | Inactive while actually containing a third value introduced by a data pipeline that wasn't aware of the contract. The metadata check catches label mismatches — where the file says value 1 means "Employed" but your contract maps it to Unemployed — which is a separate problem from whether the values themselves are valid. Both checks are needed, and neither substitutes for the other.
Practical Guidance
If compile times matter with large datasets:
- Prefer Parquet over CSV for files with type annotations or contracts. Schema inference is metadata-only, and contract validation uses projection pushdown.
- Convert wide CSV files to Parquet before annotating them. A 200-column CSV scans all 200 columns during contract validation; a Parquet file with projection pushdown reads only the contract columns.
- Use open schemas (
..) on columns you don't declare contracts for — the compiler only validates columns you name. - Minimize contract columns to what your code actually depends on. Every contract column adds a full-height scan.
Next Steps
- Data Contracts — enum and value-label contracts for categorical columns
- DataFrame Expressions — composable column expressions
- Data Lineage — automatic provenance tracking