Data Lineage
Keel automatically tracks data provenance for every DataFrame. Each column records where it came from, what transformations were applied, and which parent DataFrames contributed to it. This lineage is available both in display output and through programmatic access.
Automatic Tracking
When you print a DataFrame, lineage appears below the data. It shows parent operations, column origins, and global operations — with no extra code required:
-- norun
-- tags: dataframe, lineage, provenance
-- Lineage appears automatically when printing a DataFrame
import DataFrame
import DataFrame.Expr as Expr
import Result
let sales =
DataFrame.fromRecords
[ { product = "Laptop", revenue = 1200 }
, { product = "Phone", revenue = 800 }
]
let filtered =
sales
|> DataFrame.filter (@revenue |> Expr.gt 500)
|> Result.withDefault sales
let result =
filtered
|> DataFrame.select [@product, @revenue]
|> Result.withDefault filtered
-- Printing the DataFrame shows data AND lineage:
--
-- shape: (2, 2)
-- ...
-- Lineage:
-- Derived from: df#... (select)
-- revenue: from records
-- product: from records
-- Global operations: 1
result
Try itSource Paths
DataFrame.sourcePath returns the file path a DataFrame was read from, or Nothing for DataFrames created in memory:
-- DataFrame.sourcePath returns Nothing for fromRecords
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
DataFrame.sourcePath df
Try itFor DataFrames read with readCsv, readJson, or readParquet, this returns Just "/path/to/file.csv".
Parent Tracking (DAG)
Every DataFrame gets a unique UUID. Derived DataFrames reference their parents, forming a directed acyclic graph (DAG). DataFrame.parents returns a list of records, each with id, name, operation, and lineage fields. Parent records embed the full lineage of the parent DataFrame.
Root DataFrames have no parents:
-- Root DataFrames have no parents
import DataFrame
let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]
DataFrame.parents df
Try itDerived DataFrames record which operation created them. You can count parents to verify the DAG structure:
-- Derived DataFrames track parent operations
import DataFrame
import List
import Result
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let selected =
case df |> DataFrame.select [@name] of
Ok d -> d
Err _ -> DataFrame.fromRecords []
-- Each parent record has id, name, operation, and lineage fields
List.length (DataFrame.parents selected)
Try itColumn Lineage
DataFrame.columnLineage returns lineage for a single column as Maybe Record. The record contains name, origin, transformations, and dependencies:
-- norun
-- tags: dataframe, lineage
-- DataFrame.columnLineage returns origin info for a column
import DataFrame
let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]
-- Returns Just { name, origin, transformations, dependencies }
-- origin.type is "FromRecords" for columns from DataFrame.fromRecords
DataFrame.columnLineage @name df
Try itAfter a rename, the transformation history records the operation:
-- norun
-- tags: dataframe, lineage
-- After rename, the transformation tracks the operation
import DataFrame
import Result
let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]
let renamed =
case (df |> DataFrame.rename @name "person") of
Ok d -> d
Err _ -> DataFrame.fromRecords [{ person = "Alice", age = 30 }]
-- The "person" column's lineage shows:
-- origin.type = "FromRecords" (original source)
-- transformations = [{ operation = "rename", description = "Renamed 'name' to 'person'" }]
DataFrame.columnLineage @person renamed
Try itOrigin Types
Each column's origin describes where it came from. The type field identifies the origin kind.
File
Columns read from CSV, JSON, or Parquet files. Origin includes path and originalName.
FromRecords
Columns from DataFrame.fromRecords or DataFrame.fromLists. A simple marker with no additional fields.
Computed
Columns created by withColumn or expressions. Origin includes operation and sourceColumns.
Aggregated
Columns produced by groupBy + agg. Origin includes sourceColumn, aggregationFunc, and groupByColumns:
import DataFrame
import DataFrame.Expr exposing col
import List
import DataFrame.Expr as Expr
let df =
DataFrame.fromRecords
[ { category = "A", value = 10 }
, { category = "A", value = 20 }
, { category = "B", value = 30 }
]
let avgExpr =
col @value
|> Expr.mean
|> Expr.named "value"
let agged =
df
|> DataFrame.groupBy [@category]
|> DataFrame.agg [avgExpr]
-- The "value" column exists in the aggregated result
DataFrame.columns agged |> List.nth 1
Try itJoinedFrom
Columns brought in from the right side of a join. Origin includes sourceDataFrame and originalName:
-- norun
-- tags: dataframe, lineage, join
-- Joined columns track their source DataFrame
import DataFrame
let users =
DataFrame.fromRecords
[ { id = 1, name = "Alice" }
, { id = 2, name = "Bob" }
]
let scores =
DataFrame.fromRecords
[ { id = 1, score = 95 }
, { id = 2, score = 87 }
]
let joined =
case (DataFrame.join [@id] [@id] JoinType::OneToOne scores users) of
Ok df -> df
Err _ -> DataFrame.fromRecords []
-- The "score" column's lineage shows:
-- origin.type = "JoinedFrom"
-- origin.sourceDataFrame = "right"
-- origin.originalName = "score"
DataFrame.columnLineage @score joined
Try itTransformations and Global Operations
Lineage separates per-column transformations from global operations.
Per-column transformations are recorded on each affected column: select, drop, rename, withColumn, agg, join, concat. Each transformation has an operation name and a description:
-- norun
-- tags: dataframe, lineage
-- Columns track their transformation history
import DataFrame
import Result
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let selected =
df
|> DataFrame.select [@name, @age]
|> Result.withDefault df
-- Each column's transformations list records operations applied:
-- [{ operation = "select", description = "Selected columns: name, age" }]
DataFrame.columnLineage @name selected
Try itGlobal operations affect all rows without changing column structure: filter, sort, head, tail, unique, sample, groupBy. They are tracked in the top-level globalOperations list:
-- norun
-- tags: dataframe, lineage
-- Global operations (filter, sort) are tracked separately
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
, { name = "Carol", age = 35 }
]
let sorted =
case (df |> DataFrame.sort [@age]) of
Ok df2 -> df2
Err _ -> DataFrame.fromRecords []
let result = sorted
-- The lineage record's globalOperations list contains:
-- [{ operation = "filter", description = "Filtered via Expr" },
-- { operation = "sort", description = "Sorted by age (ascending)" }]
DataFrame.lineage result
Try itMulti-Source Operations
Joins produce two parents and merge source paths from both DataFrames:
-- Join produces two parents in the DAG
import DataFrame
import List
let users =
DataFrame.fromRecords
[ { id = 1, name = "Alice" }
, { id = 2, name = "Bob" }
]
let scores =
DataFrame.fromRecords
[ { id = 1, score = 95 }
, { id = 2, score = 87 }
]
let joined =
case (DataFrame.join [@id] [@id] JoinType::OneToOne scores users) of
Ok df -> df
Err _ -> DataFrame.fromRecords []
List.length (DataFrame.parents joined)
Try itDataFrame.concat produces N parents (one per input DataFrame) and deduplicates source paths.
Full Lineage Record
DataFrame.lineage returns the complete lineage record with all fields:
-- norun
-- tags: dataframe, lineage
-- Full lineage record structure
import DataFrame
import DataFrame.Expr as Expr
import Result
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let filtered =
df
|> DataFrame.filter (@age |> Expr.gt 20)
|> Result.withDefault df
let result =
case filtered |> DataFrame.select [@name] of
Ok d -> d
Err _ -> DataFrame.fromRecords []
let lineage = DataFrame.lineage result
-- lineage is a Record with these fields:
-- id : String -- unique UUID for this DataFrame
-- columns : Record -- per-column lineage (keyed by column name)
-- name : Record
-- name : String -- current column name
-- origin : Record -- where column came from
-- type : String -- "File", "FromRecords", "Computed", etc.
-- ... -- type-specific fields
-- transformations : [Record] -- list of operations applied
-- operation : String -- e.g. "select", "rename"
-- description : String -- human-readable description
-- dependencies : [String] -- source column names
-- globalOperations : [Record] -- operations affecting all rows
-- sourcePaths : [String] -- file paths from read operations
-- parents : [Record] -- parent DataFrames in DAG
-- id : String -- parent UUID
-- name : String -- e.g. "df#a1b2c3d4"
-- operation : String -- e.g. "select", "filter"
-- lineage : Record -- embedded parent lineage (recursive)
lineage
Try itLineage Registry Lookups
Keel maintains a global lineage registry keyed by DataFrame UUID. Two functions let you query it programmatically.
lineageById
DataFrame.lineageById : String -> Maybe Record — looks up a DataFrame's lineage record by its unique UUID. Returns Nothing if the UUID is not in the registry:
-- lineageById looks up a DataFrame in the lineage registry by its UUID
import DataFrame
import Maybe
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
-- lineageById returns Maybe Record — Nothing if the id is not found
DataFrame.lineageById "nonexistent-id"
Try itTo get the UUID of a DataFrame you already hold, read (DataFrame.lineage df).id.
lineageByName
DataFrame.lineageByName : String -> [Record] — searches the registry by name prefix. Returns all matching lineage records as a list, or an empty list if there are no matches:
-- lineageByName searches the registry by name prefix, returns a list
import DataFrame
import List
-- lineageByName returns [Record] — empty list if no match
List.length (DataFrame.lineageByName "nonexistent-name")
Try itDataFrame names in the registry take the form "df#<short-uuid>". Pass a prefix such as "df#a1b2" to narrow the search.
Next Steps
- Learn about DataFrame Expressions for composable column operations
- Learn about DataFrame Reshaping to reshape wide data to long format
- See the DataFrame stdlib reference for the complete function list