flightrisk.data¶
- class flightrisk.data.loaders.KKBoxArtifacts(members, transactions, user_logs, labels)[source]¶
Bases:
objectBundle of validated KKBox raw frames.
- Parameters:
members (DataFrame) – User master table.
transactions (DataFrame) – Transaction history (one row per payment event).
user_logs (DataFrame) – Daily listening logs.
labels (DataFrame) – Training labels keyed by
msno.
- members: DataFrame¶
- transactions: DataFrame¶
- user_logs: DataFrame¶
- labels: DataFrame¶
- flightrisk.data.loaders.load_kkbox(*, sample_frac=None, seed=1337)[source]¶
Load and validate the four KKBox raw tables.
A small
sample_fracis useful in tests and notebooks — sampling is performed on user IDs so transactions and logs stay consistent for each sampled user.- Parameters:
- Returns:
A
KKBoxArtifactsbundle ready for feature building.- Raises:
FileNotFoundError – If any of the four raw files is missing.
ValueError – If
sample_fracis out of range.
- Return type:
- class flightrisk.data.loaders.OrangeBelgiumArtifacts(features, treatment, outcome)[source]¶
Bases:
objectBundle for the Orange Belgium uplift benchmark.
- Parameters:
features (DataFrame) – 178 anonymised covariates.
treatment (Series) – 0/1 treatment column from the original RCT.
outcome (Series) – 0/1 outcome (retention indicator).
- features: DataFrame¶
- treatment: Series¶
- outcome: Series¶
- flightrisk.data.loaders.load_orange_belgium(path=None)[source]¶
Load and validate the Orange Belgium uplift benchmark.
- Parameters:
path (str | Path | None) – Optional override pointing at a single Parquet or CSV file. Defaults to
data/raw/orange-belgium/orange_belgium.parquetand falls back toorange_belgium.csvif Parquet is absent.- Returns:
An
OrangeBelgiumArtifactsbundle.- Raises:
FileNotFoundError – If no input file is found.
- Return type:
- class flightrisk.data.splits.TemporalSplit(train_idx, val_idx, test_idx, train_cutoff, val_cutoff)[source]¶
Bases:
objectIndices for a strict temporal train / validation / test split.
- Parameters:
- train_cutoff: Timestamp¶
- val_cutoff: Timestamp¶
- flightrisk.data.splits.temporal_split(timestamps, *, train_cutoff, val_cutoff)[source]¶
Split rows into past / near-past / future slices, in that order.
No row may appear in more than one slice. Rows beyond
val_cutoffgo to test; rows in(train_cutoff, val_cutoff]go to validation; the rest forms the train slice.- Parameters:
- Returns:
A
TemporalSplitwith non-overlapping integer indices.- Raises:
ValueError – If
val_cutoffdoes not lie strictly aftertrain_cutoff.- Return type:
- flightrisk.data.splits.stratified_rct_folds(treatment, outcome, *, n_splits=5, seed=1337)[source]¶
Build cross-validation folds that preserve the treatment/outcome ratio.
Stratification is done on the joint
treatment * 2 + outcomelabel so each fold sees a representative slice of the four cells.- Parameters:
- Returns:
List of
(train_idx, val_idx)tuples.- Raises:
ValueError – If treatment and outcome have different lengths.
- Return type:
- flightrisk.data.hashing.sha256_file(path)[source]¶
Compute the SHA-256 of a single file, streaming in 1 MiB chunks.
- Parameters:
- Returns:
Hex digest of the file contents.
- Raises:
FileNotFoundError – If the file does not exist.
- Return type:
- flightrisk.data.hashing.sha256_paths(paths)[source]¶
Hash a deterministic ordering of files into one digest.
Each file is hashed individually and the resulting
name:digestlines are fed into a final SHA-256 in sorted order, so the result depends on file contents and names only, not directory ordering.
- flightrisk.data.ingest.dvc_pull(targets=None)[source]¶
Invoke
dvc pullto materialise raw datasets.- Parameters:
targets (list[str] | None) – Optional list of DVC targets to pull. If
None, pull everything tracked by the project.- Returns:
The exit code of the DVC subprocess.
- Raises:
RuntimeError – If DVC is not installed.
- Return type:
- flightrisk.data.ingest.kaggle_download_kkbox()[source]¶
Download the KKBox raw bundle via the Kaggle CLI as a fallback.
Requires
KAGGLE_USERNAME/KAGGLE_KEY(orFLIGHTRISK_KAGGLE_USERNAME/FLIGHTRISK_KAGGLE_KEY) and the user to have accepted the competition rules.- Returns:
The directory where files were extracted.
- Raises:
RuntimeError – If Kaggle credentials are missing or the CLI fails.
- Return type: