Adding custom columns#
The AppointmentScheduler class supports extending the synthetic dataset with custom categorical variables, allowing you to model additional patient attributes such as insurance type, region, provider group, or socioeconomic category.
This feature makes the dataset more flexible and adaptable to different simulation or research contexts without modifying the core generation pipeline.
Overview#
After running generate(), you can call
scheduler.add_custom_column()
to append a new column to the patients_df table.
Each column is created by sampling from a user-defined set of categories according to one of three probability models — normal, uniform, or Pareto — or from custom probabilities supplied directly.
Function signature#
add_custom_column(
column_name: str,
categories: List[str],
*,
distribution_type: str = "normal",
custom_probs: Optional[List[float]] = None
) -> None
Parameters#
Parameter |
Type |
Description |
|---|---|---|
|
|
Name of the new column to add to |
|
|
Category labels to sample from (e.g., |
|
|
Defines how probabilities are assigned when no explicit vector is given. Default = |
|
|
User-specified probability vector matching |
Distribution models#
The sampling probabilities can be generated automatically using one of the following internal methods:
Distribution |
Function |
Behavior |
|---|---|---|
Normal |
|
Creates a bell-shaped distribution centered on middle categories. Suitable for variables with a most common group (default). |
Uniform |
|
Assigns nearly equal probability to all categories, with light noise. Useful for balanced attributes. |
Pareto |
|
Generates a heavy-tailed distribution where the first category dominates. Models skewed real-world variables such as provider caseload or insurance concentration. |
All three functions include a small multiplicative noise factor (±noise, default 10%) to prevent deterministic sampling.
Behavior and validation#
patients_dfmust already be populated — you must callgenerate()before adding custom columns.The column name cannot already exist in the table.
If
custom_probsis provided:Its length must match the number of categories.
Values must be positive and sum to a finite, nonzero total.
If no custom probabilities are supplied, the probabilities are generated using the selected
distribution_type.The column is added directly to
patients_df, making it available for analysis or joins.
Examples#
Example 1 – Uniform distribution#
from medscheduler import AppointmentScheduler
sched = AppointmentScheduler(seed=42)
sched.generate()
sched.add_custom_column(
column_name="insurance_type",
categories=["Public", "Private"],
distribution_type="uniform"
)
print(sched.patients_df[["patient_id", "insurance_type"]].head())
Example 2 – Normal distribution with more categories#
sched.add_custom_column(
column_name="region",
categories=["North", "Center", "South"],
distribution_type="normal"
)
Example 3 – Custom probabilities#
sched.add_custom_column(
column_name="clinic_branch",
categories=["A", "B", "C"],
custom_probs=[0.6, 0.3, 0.1]
)
Practical applications#
Adding custom columns is particularly useful for:
Simulating heterogeneous service providers or regional health centers.
Creating patient segmentation variables for dashboards.
Building realistic mock data for machine learning or visualization demos.
Introducing controlled bias or imbalance in training datasets.
Each column is generated deterministically under the same random seed, ensuring reproducibility across runs.