API Reference: Data¶
Dataset loaders¶
auditml.data.datasets
¶
Dataset loading, splitting, and DataLoader creation for AuditML.
The key abstraction here is the member / non-member split — every privacy attack needs to know which samples were used for training (members) and which were not (non-members). This module provides reproducible, seed-controlled splits at the dataset level.
DatasetInfo
dataclass
¶
get_dataset(name: str, train: bool = True, data_dir: str = './data', download: bool = True) -> Dataset
¶
Load a torchvision dataset by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
|
required |
train
|
bool
|
Load the training split if |
True
|
data_dir
|
str
|
Root directory for downloaded data. |
'./data'
|
download
|
bool
|
Download the dataset if not already present. |
True
|
Returns:
| Type | Description |
|---|---|
Dataset
|
A torchvision dataset with the appropriate transforms applied. |
Source code in src/auditml/data/datasets.py
create_member_nonmember_split(dataset: Dataset, member_ratio: float = 0.5, seed: int = 42) -> tuple[Subset, Subset, np.ndarray, np.ndarray]
¶
Split dataset into disjoint member and non-member subsets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
The full training dataset. |
required |
member_ratio
|
float
|
Fraction of samples assigned to the member set. |
0.5
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Returns:
| Type | Description |
|---|---|
(member_subset, nonmember_subset, member_indices, nonmember_indices)
|
|
Source code in src/auditml/data/datasets.py
get_shadow_data_splits(dataset: Dataset, n_shadows: int = 5, member_ratio: float = 0.5, seed: int = 42) -> list[tuple[Subset, Subset, np.ndarray, np.ndarray]]
¶
Create n_shadows independent member/non-member splits.
Each shadow model will be trained on its own member subset, providing diverse "in" vs "out" examples for the attack classifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
The full dataset to split. |
required |
n_shadows
|
int
|
Number of independent splits. |
5
|
member_ratio
|
float
|
Fraction of the dataset each shadow's member set uses. |
0.5
|
seed
|
int
|
Base seed; each split uses |
42
|
Returns:
| Type | Description |
|---|---|
list of (member_subset, nonmember_subset, member_indices, nonmember_indices)
|
|
Source code in src/auditml/data/datasets.py
get_dataloaders(dataset_name: str, batch_size: int = 64, member_ratio: float = 0.5, num_workers: int = 2, seed: int = 42, data_dir: str = './data', download: bool = True) -> dict[str, DataLoader | np.ndarray]
¶
Convenience loader that returns everything needed for an audit.
Returns:
| Type | Description |
|---|---|
dict with keys:
|
|
Source code in src/auditml/data/datasets.py
Transforms¶
auditml.data.transforms
¶
Standard transforms for each dataset.
get_transforms(dataset: str, train: bool = True) -> transforms.Compose
¶
Return the appropriate transform for a given dataset and split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
str
|
|
required |
train
|
bool
|
If |
True
|