Adding custom columns#

The AppointmentScheduler class supports extending the synthetic dataset with custom categorical variables, allowing you to model additional patient attributes such as insurance type, region, provider group, or socioeconomic category.

This feature makes the dataset more flexible and adaptable to different simulation or research contexts without modifying the core generation pipeline.


Overview#

After running generate(), you can call

scheduler.add_custom_column()

to append a new column to the patients_df table.

Each column is created by sampling from a user-defined set of categories according to one of three probability models — normal, uniform, or Pareto — or from custom probabilities supplied directly.


Function signature#

add_custom_column(
    column_name: str,
    categories: List[str],
    *,
    distribution_type: str = "normal",
    custom_probs: Optional[List[float]] = None
) -> None

Parameters#

Parameter

Type

Description

column_name

str

Name of the new column to add to patients_df. Must not duplicate existing columns.

categories

list[str]

Category labels to sample from (e.g., ["Public", "Private"]).

distribution_type

{"normal", "uniform", "pareto"}

Defines how probabilities are assigned when no explicit vector is given. Default = "normal".

custom_probs

list[float], optional

User-specified probability vector matching categories. Must sum approximately to 1.


Distribution models#

The sampling probabilities can be generated automatically using one of the following internal methods:

Distribution

Function

Behavior

Normal

_normal_distribution()

Creates a bell-shaped distribution centered on middle categories. Suitable for variables with a most common group (default).

Uniform

_uniform_distribution()

Assigns nearly equal probability to all categories, with light noise. Useful for balanced attributes.

Pareto

_pareto_distribution()

Generates a heavy-tailed distribution where the first category dominates. Models skewed real-world variables such as provider caseload or insurance concentration.

All three functions include a small multiplicative noise factor (±noise, default 10%) to prevent deterministic sampling.


Behavior and validation#

  • patients_df must already be populated — you must call generate() before adding custom columns.

  • The column name cannot already exist in the table.

  • If custom_probs is provided:

    • Its length must match the number of categories.

    • Values must be positive and sum to a finite, nonzero total.

  • If no custom probabilities are supplied, the probabilities are generated using the selected distribution_type.

  • The column is added directly to patients_df, making it available for analysis or joins.


Examples#

Example 1 – Uniform distribution#

from medscheduler import AppointmentScheduler

sched = AppointmentScheduler(seed=42)
sched.generate()

sched.add_custom_column(
    column_name="insurance_type",
    categories=["Public", "Private"],
    distribution_type="uniform"
)

print(sched.patients_df[["patient_id", "insurance_type"]].head())

Example 2 – Normal distribution with more categories#

sched.add_custom_column(
    column_name="region",
    categories=["North", "Center", "South"],
    distribution_type="normal"
)

Example 3 – Custom probabilities#

sched.add_custom_column(
    column_name="clinic_branch",
    categories=["A", "B", "C"],
    custom_probs=[0.6, 0.3, 0.1]
)

Practical applications#

Adding custom columns is particularly useful for:

  • Simulating heterogeneous service providers or regional health centers.

  • Creating patient segmentation variables for dashboards.

  • Building realistic mock data for machine learning or visualization demos.

  • Introducing controlled bias or imbalance in training datasets.

Each column is generated deterministically under the same random seed, ensuring reproducibility across runs.