Automated Equipopulated Binning¶
The framework includes a utility script, adjust_binning.py
, for an automatic calculation and update binning definitions within the configuration in case of equipopulated binning approach of specified variable (var_dependence
) across various categories for a more robust statistical treatment in each analysis region. The general intention is to have an equipopulated data distribution in the SRlike region given possible additional splittings (i.e. in njets
).
The script is intended to be run berofe the fake factor and correction calculation by using
python adjust_binning.py --config PATH/CONFIG.yaml --cut-config PATH/CONFIG_WITH_REGION_CUTS.yaml --processes QCD Wjets --cut-region ARlike
Argument | Type | Description |
---|---|---|
--config |
string |
Path to the YAML configuration file that will be modified. |
--cuts-config |
string |
(Optional) Path to a separate YAML file containing cut definitions. This is useful for correction YAML files where cuts are not specified directly for each correction but then are sourced from the cuts-config file, which can just be the fake factor YAML configuration. |
--processes |
list[string] |
A list of processes to adjust (e.g., QCD , Wjets , ttbar , process_fractions ). The default is set to include all processes. |
--cut-region |
string |
The region cut to be applied when determining event counts (e.g., SRlike , ARlike ). process_fractions always use the AR_cut . |
--dry-run |
bool |
Preview the binning calculations and changes without modifying the config file. |
--dataset |
string |
Name of the dataset used for equipopulated binning adjustments. Default is set to 'data', deviations can be made for i.e. MC studies. |
Configuration¶
To enable this feature, you must add a equipopulated_binning_options
block to the relevant process or correction section in your configuration file. The script uses this block to generate the new binning and will populate the var_bins
key and update split_categories
and split_categories_binedges
if they are left blank i.e. for continuous variables such as pt_1
.
For example, a manually binned configuration for the QCD
process:
# Before: Manual binning for QCD process
target_processes:
QCD:
split_categories:
njets: ["==0", "==1", ">=2"]
split_categories_binedges:
njets: [-0.5, 0.5, 1.5, 12.5]
var_dependence: pt_2
var_bins:
"==0": [30.0, 31.6, 33.5, 35.9, 38.9, 43.6, 51.6, 150.0]
"==1": [30.0, 32.2, 35.3, 40.3, 50.9, 150.0]
">=2": [30.0, 32.7, 36.5, 42.5, 57.0, 150.0]
Can be replaced with a configuration for automated binning:
# After: Configuration for automated binning
target_processes:
QCD:
split_categories:
njets: ["==0", "==1", ">=2"]
split_categories_binedges:
njets: [-0.5, 0.5, 1.5, 12.5]
var_dependence: pt_2
var_bins: {} # This will be filled by the script
equipopulated_binning_options:
variable_config:
pt_2:
min: 30.0
max: 150.0
rounding: 2
n_bins: # this part is necessarily only needed for continuous variables
njets: 3 # number of bins for njets ==0, ==1, >=2
var_dependence_n_bins: [7, 5, 5]
The equipopulated_binning_options
block has the following structure:
Parameter | Type | Description |
---|---|---|
variable_config |
dict |
Defines parameters for variables used in binning. See sub-parameters below. |
n_bins |
dict |
Specifies the desired number of bins per variable in split_categories if those are not set there in case of continuous variables, i.e. four equidistant bins in pt_1 : pt_1: 4 |
var_dependence_n_bins |
int or list[int] or list[list[int]] |
Number of bins used for the variable defined in var_dependence , either as int (used by all categories) or a (nested) list of integers defining number of bins per (nested) category created. |
The variable_config
for each variable contains:
Sub-parameter | Type | Description |
---|---|---|
min |
float |
The lower bound for the variable's range. The first bin edge will be set to this value. This is applied as a cut before the calculation of equipopulated binning starts. |
max |
float |
The upper bound for the variable's range. The last bin edge will be set to this value. This is applied as a cut before the calculation of equipopulated binning starts. |
rounding |
int |
The number of decimal places to which the calculated bin edges will be rounded to and written into the configuration file. |
Advanced Splitting Strategies¶
The script supports various one- and two-dimensional splitting strategies, determined by the structure of split_categories
. The order of variables defines the nesting hierarchy.
- One-Dimensional Splitting
-
By a discrete variable (e.g.,
Number of bins of `var_dependence` in each category are set explicitlynjets
): Categories are explicitly listed.
Same number of bins of `var_dependence` in each categorysplit_categories: njets: ["==0", "==1", ">=2"] split_categories_binedges: njets: [-0.5, 0.5, 1.5, 12.5] equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 var_dependence_n_bins: [7, 5, 5]
split_categories: njets: ["==0", "==1", ">=2"] split_categories_binedges: njets: [-0.5, 0.5, 1.5, 12.5] equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 var_dependence_n_bins: 7
-
By a continuous variable (e.g.,
The script will first create equipopulated bins for `pt_1` and then bin `var_dependence` within each of those new `pt_1` categories.pt_1
): An empty list[]
is used as a placeholder.split_categories: pt_1: [] split_categories_binedges: pt_1: [] equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 pt_1: min: 0.0 max: 1000.0 rounding: 2 n_bins: pt_1: 4 # Creates 4 categories for pt_1, which are then used for pt_2 binning var_dependence_n_bins: [7, 7, 6, 5] # number of bins per pt_1 split
- Two-Dimensional (Nested) Splitting
-
Two discrete variables (e.g.,
You can also merge categories using thetau_decaymode_2
thennjets
).#||#
syntax.split_categories: tau_decaymode_2: ["==0", "==1", "==10", "==11"] njets: ["==0", "==1", ">=2"] split_categories_binedges: tau_decaymode_2: [-0.5, 0.5, 9.5, 11.5] njets: [-0.5, 0.5, 1.5, 12.5] equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 var_dependence_n_bins: [[9, 8, 7], [7, 7, 7], 6, 6] # here 6 == [6, 6, 6]
-
Discrete then continuous variable (e.g.,
`pt_1` is binned equipopulated within each `njets` category.njets
thenpt_1
).split_categories: njets: ["==0", "==1", ">=2"] pt_1: [] # Placeholder for equipopulated split split_categories_binedges: njets: [-0.5, 0.5, 1.5, 12.5] pt_1: [] # Placeholder for equipopulated split equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 pt_1: min: 0.0 max: 3000.0 rounding: 2 n_bins: pt_1: 4 # Create 4 pt_1 bins inside each njets bin var_dependence_n_bins: [[9, 8, 7, 7], 10, 6] # logic analogously to Two discrete variables example
-
Two continuous variables (e.g.,
deltaR_ditaupair
thenpt_1
).split_categories: deltaR_ditaupair: [] pt_1: [] split_categories_binedges: deltaR_ditaupair: [] pt_1: [] equipopulated_binning_options: variable_config: pt_2: min: 30.0 max: 150.0 rounding: 4 deltaR_ditaupair: min: 0.0 max: 5.0 rounding: 3 pt_1: min: 0.0 max: 3000.0 rounding: 2 n_bins: deltaR_ditaupair: 2 # Create 2 bins for the first split pt_1: 4 # Create 4 bins for the second split var_dependence_n_bins: [[10, 10, 5, 5], [8, 8, 4, 4]]