flightrisk.data

class flightrisk.data.loaders.KKBoxArtifacts(members, transactions, user_logs, labels)[source]

Bases: object

Bundle of validated KKBox raw frames.

Parameters:
  • members (DataFrame) – User master table.

  • transactions (DataFrame) – Transaction history (one row per payment event).

  • user_logs (DataFrame) – Daily listening logs.

  • labels (DataFrame) – Training labels keyed by msno.

members: DataFrame
transactions: DataFrame
user_logs: DataFrame
labels: DataFrame
flightrisk.data.loaders.load_kkbox(*, sample_frac=None, seed=1337)[source]

Load and validate the four KKBox raw tables.

A small sample_frac is useful in tests and notebooks — sampling is performed on user IDs so transactions and logs stay consistent for each sampled user.

Parameters:
  • sample_frac (float | None) – Optional fraction in (0, 1] to subsample by user ID.

  • seed (int) – Seed used when sample_frac is provided.

Returns:

A KKBoxArtifacts bundle ready for feature building.

Raises:
Return type:

KKBoxArtifacts

class flightrisk.data.loaders.OrangeBelgiumArtifacts(features, treatment, outcome)[source]

Bases: object

Bundle for the Orange Belgium uplift benchmark.

Parameters:
  • features (DataFrame) – 178 anonymised covariates.

  • treatment (Series) – 0/1 treatment column from the original RCT.

  • outcome (Series) – 0/1 outcome (retention indicator).

features: DataFrame
treatment: Series
outcome: Series
flightrisk.data.loaders.load_orange_belgium(path=None)[source]

Load and validate the Orange Belgium uplift benchmark.

Parameters:

path (str | Path | None) – Optional override pointing at a single Parquet or CSV file. Defaults to data/raw/orange-belgium/orange_belgium.parquet and falls back to orange_belgium.csv if Parquet is absent.

Returns:

An OrangeBelgiumArtifacts bundle.

Raises:

FileNotFoundError – If no input file is found.

Return type:

OrangeBelgiumArtifacts

class flightrisk.data.splits.TemporalSplit(train_idx, val_idx, test_idx, train_cutoff, val_cutoff)[source]

Bases: object

Indices for a strict temporal train / validation / test split.

Parameters:
  • train_idx (ndarray) – Row positions in the train slice.

  • val_idx (ndarray) – Row positions in the validation slice.

  • test_idx (ndarray) – Row positions in the test slice.

  • train_cutoff (Timestamp) – Last timestamp included in train.

  • val_cutoff (Timestamp) – Last timestamp included in validation.

train_idx: ndarray
val_idx: ndarray
test_idx: ndarray
train_cutoff: Timestamp
val_cutoff: Timestamp
flightrisk.data.splits.temporal_split(timestamps, *, train_cutoff, val_cutoff)[source]

Split rows into past / near-past / future slices, in that order.

No row may appear in more than one slice. Rows beyond val_cutoff go to test; rows in (train_cutoff, val_cutoff] go to validation; the rest forms the train slice.

Parameters:
  • timestamps (Series) – Per-row timestamps. Must be convertible to datetime.

  • train_cutoff (str | Timestamp) – Last timestamp kept in train.

  • val_cutoff (str | Timestamp) – Last timestamp kept in validation. Must be later than train_cutoff.

Returns:

A TemporalSplit with non-overlapping integer indices.

Raises:

ValueError – If val_cutoff does not lie strictly after train_cutoff.

Return type:

TemporalSplit

flightrisk.data.splits.stratified_rct_folds(treatment, outcome, *, n_splits=5, seed=1337)[source]

Build cross-validation folds that preserve the treatment/outcome ratio.

Stratification is done on the joint treatment * 2 + outcome label so each fold sees a representative slice of the four cells.

Parameters:
  • treatment (Series) – Binary 0/1 treatment indicator.

  • outcome (Series) – Binary 0/1 outcome indicator.

  • n_splits (int) – Number of folds.

  • seed (int) – Random seed for fold ordering.

Returns:

List of (train_idx, val_idx) tuples.

Raises:

ValueError – If treatment and outcome have different lengths.

Return type:

list[tuple[ndarray, ndarray]]

flightrisk.data.hashing.sha256_file(path)[source]

Compute the SHA-256 of a single file, streaming in 1 MiB chunks.

Parameters:

path (str | Path) – Path to the file.

Returns:

Hex digest of the file contents.

Raises:

FileNotFoundError – If the file does not exist.

Return type:

str

flightrisk.data.hashing.sha256_paths(paths)[source]

Hash a deterministic ordering of files into one digest.

Each file is hashed individually and the resulting name:digest lines are fed into a final SHA-256 in sorted order, so the result depends on file contents and names only, not directory ordering.

Parameters:

paths (Iterable[str | Path]) – Iterable of file paths under a common root.

Returns:

Combined hex digest.

Return type:

str

flightrisk.data.ingest.dvc_pull(targets=None)[source]

Invoke dvc pull to materialise raw datasets.

Parameters:

targets (list[str] | None) – Optional list of DVC targets to pull. If None, pull everything tracked by the project.

Returns:

The exit code of the DVC subprocess.

Raises:

RuntimeError – If DVC is not installed.

Return type:

int

flightrisk.data.ingest.kaggle_download_kkbox()[source]

Download the KKBox raw bundle via the Kaggle CLI as a fallback.

Requires KAGGLE_USERNAME / KAGGLE_KEY (or FLIGHTRISK_KAGGLE_USERNAME / FLIGHTRISK_KAGGLE_KEY) and the user to have accepted the competition rules.

Returns:

The directory where files were extracted.

Raises:

RuntimeError – If Kaggle credentials are missing or the CLI fails.

Return type:

Path