Performs a bootstrap hypothesis testing on the given dataset to find which elements have significant contributions.
Null hypothesis is: Elements have on average no contributions, unless a reference mean is given. This can be used
for both a dataset of Shapley values (Shapley table) or a dataset of lesions if there are many samples for each
element, e.g., if lesioning an element significantly impacted some feature of the system.
For more information, watch this brilliant tutorial:
https://www.youtube.com/watch?v=isEcgoCmlO0&t=893s
Parameters:
Name |
Type |
Description |
Default |
dataset |
DataFrame
|
Common case use is Shapley table but anything else works too,
as long as it follows the same structure/format and you know what you're doing.
|
required
|
p_value |
float
|
Default is 0.05 please first correct for multiple-comparisons easiest and most conservative way:
Bonferroni correction, p_value/n_elements.
|
0.05
|
bootstrap_samples |
int
|
Number of bootstraps, default is 10_000 and I seems like it's a common practice to use 10_000.
|
10000
|
reference_mean |
Optional[int]
|
In case the means should not be compared with a zero-mean distribution.
|
None
|
Returns:
Type |
Description |
DataFrame
|
pd.DataFrame: Distributions that are significantly different from H0.
|
Source code in msapy/utils.py
| @typechecked
def bootstrap_hypothesis_testing(*,
dataset: pd.DataFrame,
p_value: float = 0.05,
bootstrap_samples: int = 10_000,
reference_mean: Optional[int] = None) -> pd.DataFrame:
"""
Performs a bootstrap hypothesis testing on the given dataset to find which elements have significant contributions.
Null hypothesis is: Elements have on average no contributions, unless a reference mean is given. This can be used
for both a dataset of Shapley values (Shapley table) or a dataset of lesions if there are many samples for each
element, e.g., if lesioning an element significantly impacted some feature of the system.
For more information, watch this brilliant tutorial:
https://www.youtube.com/watch?v=isEcgoCmlO0&t=893s
Args:
dataset (pd.DataFrame):
Common case use is Shapley table but anything else works too,
as long as it follows the same structure/format and you know what you're doing.
p_value (float):
Default is 0.05 **please first correct for multiple-comparisons** easiest and most conservative way:
Bonferroni correction, p_value/n_elements.
bootstrap_samples (int):
Number of bootstraps, default is 10_000 and I seems like it's a common practice to use 10_000.
reference_mean (Optional[int]):
In case the means should not be compared with a zero-mean distribution.
Returns:
pd.DataFrame: Distributions that are significantly different from H0.
#TODO: really needs some performance optimization, probably with Numba and a change in the algorithm!
"""
mean_adjusted_distributions = pd.DataFrame()
bootstrapped_distributions = pd.DataFrame()
significants = pd.DataFrame()
percentile = (1 - p_value) * 100
if p_value <= 0.:
raise ValueError("A zero/negative value for p_value? What?")
if bootstrap_samples <= 0.:
raise ValueError("A zero/negative value for bootstrap_samples? What?")
elif 1 < bootstrap_samples < 1_000:
warnings.warn("Bootstrap sample size is small, please go above 1000.", stacklevel=2)
for distribution in tqdm(dataset.columns, total=len(dataset.columns), desc='Bootstrapping: '):
if reference_mean: # adjusting the distributions to have the same mean as the reference.
mean_adjusted_distributions[distribution] = \
dataset[distribution] - dataset[distribution].mean() + reference_mean
else: # adjusting the distributions to center around zero.
mean_adjusted_distributions[distribution] = \
dataset[distribution] - dataset[distribution].mean()
resampled_means = []
for sample in range(bootstrap_samples): # resampling (with replacement) from the mean-adjusted distribution
resampled_means.append(np.mean((np.random.choice(
list(mean_adjusted_distributions[distribution]),
len(mean_adjusted_distributions[distribution].values),
replace=True))))
bootstrapped_distributions[distribution] = resampled_means
# checking if the means are significantly different.
for bootstrapped_distribution in bootstrapped_distributions.columns:
percentiles = np.percentile(bootstrapped_distributions[bootstrapped_distribution], [0, percentile])
if not percentiles[0] <= dataset[bootstrapped_distribution].mean() <= percentiles[1]:
significants[bootstrapped_distribution] = dataset[bootstrapped_distribution]
significants = sorter(significants)
return significants
|