Causal Inference
Statistics
- dataframe.statistics.ATEestimator(df, Y, T, B=500)[source]
Estimate the Average Treatment Effect (ATE) using a simple difference in means approach.
- Parameters:
table (str, required) – the name of the input data table.
Y (str, required) – the column name of the outcome variable.
T (str, required) – the column name of the treatment variable.
B (int, optional) – the number of bootstrap samples, default is 500.
- Returns:
dict, containing the following key-value pairs:
‘ATE’: Average Treatment Effect. ‘stddev’: Standard deviation. ‘p_value’: p-value. ‘95% confidence_interval’: 95% confidence interval.
Example
- dataframe.statistics.IPWestimator(df, Y, T, P, B=500)[source]
Estimate the Average Treatment Effect (ATE) using Inverse Probability of Treatment Weighting (IPTW).
- Parameters:
table (str, required) – the name of the input data table.
Y (str, required) – the column name of the outcome variable.
T (str, required) – the column name of the treatment variable.
P (str, required) – the column name of the propensity score.
B (int, optional) – the number of bootstrap samples, default is 500.
- Returns:
dict, containing the following key-value pairs:
‘ATE’: Average Treatment Effect. ‘stddev’: Standard deviation. ‘p_value’: p-value. ‘95% confidence_interval’: 95% confidence interval.
Example
import fast_causal_inference table = 'test_data_small' df = fast_causal_inference.readClickHouse(table) Y = 'numerator' T = 'treatment' P = 'weight' import fast_causal_inference.dataframe.statistics as S S.IPWestimator(df,Y,T,P,B=500)
- dataframe.statistics.boot_strap(func, sample_num, bs_num)[source]
Compute a two-sided bootstrap confidence interval of a statistic. boot_strap sample_num samples from data and compute the func.
- Parameters:
func (str, required) – function to apply.
sample_num (int, required) – number of samples.
bs_num (int, required) – number of bootstrap samples.
- Returns:
list of calculated statistics.
Example
- dataframe.statistics.delta_method(expr=None, std=True)[source]
Compute the delta method on the given expression.
- Parameters:
expr (str, optional) – Form like f (avg(x1), avg(x2), …) , f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.
std (bool, optional) – Whether to return standard deviation, default is True.
- Returns:
DataFrame contains the following columns: var or std computed by delta_method.
- Return type:
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small') df.groupBy('treatment').delta_method('avg(x1)', False).show() df.groupBy('treatment').agg(S.delta_method('avg(x1)')).show()
This will output:
treatment std 0 0 1.934587277675054E-4 1 1 1.9646284055862068E-4 treatment var 0 0 0.013908944164367954 1 1 0.014016520272828797
- dataframe.statistics.kolmogorov_smirnov_test(sample_data, sample_index)[source]
This function is used to calculate the Kolmogorov-Smirnov test for goodness of fit. It returns the calculated statistic and the two-tailed p-value.
- Parameters:
sample_data (int, float or decimal, required) – Sample data. Integer, Float or Decimal.
sample_index (int, required) – Sample index. Integer.
- Returns:
Tuple with two elements:
calculated statistic: Float64. calculated p-value: Float64.
Example:
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.kolmogorov_smirnov_test('y', 'treatment').show() [0.6382961593945475, 0.0] >>> df.agg(S.kolmogorov_smirnov_test('y', 'treatment')).show() [0.6382961593945475, 0.0]
- dataframe.statistics.mann_whitney_utest(sample_data, sample_index, alternative='two-sided', continuity_correction=1)[source]
This function is used to calculate the Mann-Whitney U test. It returns the calculated U-statistic and the two-tailed p-value.
- Parameters:
sample_data (str, required) – column name, the numerator of the metric, can use SQL expression, the column must be numeric.
sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group.
alternative (str, optional) – ‘two-sided’: the default value, two-sided test. ‘greater’: one-tailed test in the positive direction. ‘less’: one-tailed test in the negative direction.
continuous_correction (bool, optional) – bool, default 1, whether to apply continuity correction.
- Returns:
Tuple with two elements:
U-statistic: Float64. p-value: Float64.
Example:
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.mann_whitney_utest('x1', 'treatment').show() [2380940.0, 0.0] >>> df.agg(S.mann_whitney_utest('x1', 'treatment')).show() [2380940.0, 0.0]
- dataframe.statistics.matrix_multiplication(*col, std=False, invert=False)[source]
- Parameters:
col (int, float or decimal, required) – columns to apply function to.
std (bool, required) – whether to return standard deviation.
invert (bool, required) – whether to invert the matrix.
- Returns:
list of calculated statistics.
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small') df.matrix_multiplication('x1', 'x2', std = False, invert = False).show() df.agg(S.matrix_multiplication('x1', 'x2', std = False, invert = False)).show() df.agg(S.matrix_multiplication('x1', 'x2', std = True, invert = True)).show()
- dataframe.statistics.mean_z_test(sample_data, sample_index, population_variance_x, population_variance_y, confidence_level)[source]
This function is used to calculate the z-test for the mean of two independent samples of scores. It returns the calculated z-statistic and the two-tailed p-value.
- Parameters:
sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric
sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group
population_variance_x (Float, required) – Variance for control group.
population_variance_y (Float, required) – Variance for experimental group.
confidence_level (Float, required) – Confidence level in order to calculate confidence intervals.
- Returns:
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small') df.mean_z_test('y', 'treatment', 0.9, 0.9, 0.95).show() df.agg(S.mean_z_test('y', 'treatment', 0.9, 0.9, 0.95)).show()
- dataframe.statistics.permutation(func, permutation_num, mde='0', mde_type='1')[source]
- Parameters:
func (str, required) – function to apply.
permutation_num (int, required) – number of permutations.
col (int, float or decimal, required) – columns to apply function to.
- Returns:
list of calculated statistics.
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small') df.permutation('mannWhitneyUTest', 3, 'x1') df.agg(S.permutation('mannWhitneyUTest', 3, 'x1')).show()
- dataframe.statistics.srm(x, groupby, ratio='[1,1]')[source]
perform srm test
- Parameters:
x (str, required) – column name, the numerator of the metric, can use SQL expression, the column must be numeric. If you are concerned about whether the sum of x1 meets expectations, you should fill in x1, then it will calculate sum(x1); If you are concerned about whether the sample size meets expectations, you should fill in 1, then it will calculate sum(1).
groupby (str, required) – column name, representing the field for aggregation grouping, can support Integer/String.
ratio (list, required) – list. The expected traffic ratio, needs to be filled in according to the order of the groupby field. Each value must be >0. For example, [1,1,2] represents the expected ratio is 1:1:2.
- Returns:
DataFrame contains the following columns:
groupname: the name of the group. f_obs: the observed traffic. ratio: the expected traffic ratio. chisquare: the calculated chi-square. p-value: the calculated p-value.
Example:
import fast_causal_inference.dataframe.statistics as S
>>> df.srm('x1', 'treatment', '[1,2]').show() groupname f_obs ratio chisquare p-value 0 23058.627723 1.000000 48571.698643 0.000000 1 1.0054e+05 1.000000
>>> df.agg(S.srm('x1', 'treatment', '[1,2]')).show() groupname f_obs ratio chisquare p-value 0 23058.627723 1.000000 48571.698643 0.000000 1 1.0054e+05 1.000000
- dataframe.statistics.student_ttest(sample_data, sample_index)[source]
This function is used to calculate the t-test for the mean of one group of scores. It returns the calculated t-statistic and the two-tailed p-value.
- Parameters:
sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric
sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group
- Returns:
Tuple with two elements:
calculated statistic: Float64. calculated p-value: Float64.
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.student_ttest('y', 'treatment').show() [-72.8602591880598, 0.0]
>>> df.agg(S.student_ttest('y', 'treatment')).show() [-72.8602591880598, 0.0]
- dataframe.statistics.ttest_1samp(Y, alternative='two-sided', mu=0, X='')[source]
This function is used to calculate the t-test for the mean of one group of scores. It returns the calculated t-statistic and the two-tailed p-value.
- Parameters:
Y (str, required) – str, form like f (avg(x1), avg(x2), …), f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.
alternative (str, optional) – str, use ‘two-sided’ for two-tailed test, ‘greater’ for one-tailed test in the positive direction, and ‘less’ for one-tailed test in the negative direction.
mu (float, optional) – the mean of the null hypothesis.
X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.
- Returns:
DataFrame contains the following columns:
estimate: the mean value of the statistic to be tested. stderr: the standard error of the statistic to be tested. t-statistic: the calculated t-statistic. p-value: the calculated p-value. lower: the lower bound of the confidence interval. upper: the upper bound of the confidence interval.
Example:
import fast_causal_inference.dataframe.statistics as S import fast_causal_inference df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.groupBy('x_cat1').ttest_1samp('avg(numerator)/avg(denominator)', alternative = 'two-sided', mu = 0).show() >>> df.groupBy('x_cat1').agg(S.ttest_1samp('avg(numerator)', alternative = 'two-sided', mu = 0, X = 'avg(numerator_pre)/avg(denominator_pre)')).show() x_cat1 estimate stderr t-statistic p-value lower upper 0 B 1.455223 0.041401 35.149887 0.000000 1.374029 1.536417 1 E 1.753613 0.042083 41.670491 0.000000 1.671082 1.836143 2 D 1.752348 0.043173 40.589377 0.000000 1.667680 1.837016 3 C 1.804776 0.046642 38.694122 0.000000 1.713303 1.896249 4 A 2.108937 0.042558 49.554601 0.000000 2.025477 2.192398 x_cat1 estimate stderr t-statistic p-value lower upper 0 B 10.220695 0.261317 39.112304 0.000000 9.708205 10.733185 1 E 12.407975 0.267176 46.441156 0.000000 11.884002 12.931947 2 D 11.924641 0.258935 46.052716 0.000000 11.416831 12.432451 3 C 12.274732 0.281095 43.667495 0.000000 11.723457 12.826006 4 A 14.824860 0.241133 61.480129 0.000000 14.351972 15.297748
- dataframe.statistics.ttest_2samp(Y, index, alternative='two-sided', X='', pse='')[source]
This function is used to calculate the t-test for the means of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.
- Parameters:
Y (str, required) – str, form like f (avg(x1), avg(x2), …), f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.
index (str, required) – str, the treatment variable.
alternative (str, optional) – str, use ‘two-sided’ for two-tailed test, ‘greater’ for one-tailed test in the positive direction, and ‘less’ for one-tailed test in the negative direction.
X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.
pse (str, optional) – str, an expression used as discrete covariates for post-stratification variance reduction. It involves grouping by a covariate, calculating variances separately, and then weighting them. It can be any complex function form, such as ‘x_cat1’.
- Returns:
DataFrame contains the following columns:
estimate: the mean value of the statistic to be tested. stderr: the standard error of the statistic to be tested. t-statistic: the calculated t-statistic. p-value: the calculated p-value. lower: the lower bound of the confidence interval. upper: the upper bound of the confidence interval.
Example:
import fast_causal_inference.dataframe.statistics as S import fast_causal_inference df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', pse = 'x_cat1')).show() >>> df.agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)/avg(denominator_pre)')).show() >>> df.groupBy('x_cat1').ttest_2samp('avg(numerator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)').show() >>> df.groupBy('x_cat1').agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)/avg(denominator_pre)')).show()
mean0 mean1 estimate stderr t-statistic p-value lower 0 0.791139 2.487152 1.696013 0.032986 51.416725 0.000000 1.631355 upper 0 1.760672 mean0 mean1 estimate stderr t-statistic p-value lower 0 0.793732 2.486118 1.692386 0.026685 63.419925 0.000000 1.640077 upper 0 1.744694 x_cat1 mean0 mean1 estimate stderr t-statistic p-value 0 B 2.481226 17.787127 15.305901 0.365716 41.851896 0.000000 1 E 4.324137 19.437071 15.112935 0.370127 40.831785 0.000000 2 D 4.582961 19.156961 14.574000 0.373465 39.023766 0.000000 3 C 4.579375 19.816027 15.236652 0.419183 36.348422 0.000000 4 A 7.518409 22.195092 14.676682 0.342147 42.895825 0.000000 lower upper 0 14.588665 16.023138 1 14.387062 15.838808 2 13.841579 15.306421 3 14.414564 16.058739 4 14.005694 15.347671 x_cat1 mean0 mean1 estimate stderr t-statistic p-value 0 B 0.409006 2.202847 1.793841 0.053917 33.270683 0.000000 1 E 0.714211 2.435665 1.721455 0.056144 30.661265 0.000000 2 D 0.781435 2.455767 1.674332 0.058940 28.407344 0.000000 3 C 0.778977 2.562364 1.783388 0.065652 27.164280 0.000000 4 A 1.242126 2.766098 1.523972 0.060686 25.112311 0.000000 lower upper 0 1.688101 1.899581 1 1.611348 1.831562 2 1.558742 1.789923 3 1.654633 1.912142 4 1.404959 1.642984
- dataframe.statistics.welch_ttest(sample_data, sample_index)[source]
This function is used to calculate welch’s t-test for the mean of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.
- Parameters:
sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric
sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group
- Returns:
Tuple with two elements:
calculated statistic: Float64. calculated p-value: Float64.
Example
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.welch_ttest('y', 'treatment').show() [-73.78492246858345, 0.0] >>> df.agg(S.welch_ttest('y', 'treatment')).show() [-73.78492246858345, 0.0]
- dataframe.statistics.xexpt_ttest_2samp(numerator, denominator, index, uin, metric_type='avg', group_buckets='[1,1]', alpha=0.05, MDE=0.005, power=0.8, X='')[source]
This function is used to calculate the t-test for the means of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.
- Parameters:
numerator (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric.
denominator (str, required) – column name, the denominator of the metric, can use sql expression, the column must be numeric.
index (str, required) – column name, used to represent the control group and the experimental group.
uin (str, required) – column name, used to bucket samples, can use sql expression, int64 type.
metric_type (str, optional) – avg: used to test the mean indicator, avg(num)/avg(demo), default is avg. sum: used to test the sum indicator, at this time the denominator can be omitted or 1, otherwise the user is prompted.
group_buckets (list, optional) – the number of traffic buckets for each group, only effective when metric_type=’sum’. The default is [1,1], the number of elements is equal to the number of groups, only the correct ratio is required.
alpha (float, optional) – numeric, significance level, default 0.05.
MDE (float, optional) – numeric, minimum test difference, default 0.005.
power (float, optional) – numeric, statistical power, default 0.8.
X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.
- Returns:
DataFrame contains the following columns:
groupname: the name of the group. numerator: the mean of the numerator. denominator: the mean of the denominator (only when metric_type=avg). numerator_pre: the mean of numerator before the experiment (only when metric_type=avg). denominator_pre: the mean of denominator before the experiment (only when metric_type=avg). mean: the mean of the metric (only when metric_type=avg). std_samp: the standard deviation of the metric (only when metric_type=avg). ratio: group_buckets (only when metric_type=sum). diff_relative: the relative difference between the two groups. 95%_relative_CI: the 95% confidence interval of the relative difference. p-value: the calculated p-value. t-statistic: the calculated t-statistic. power: the calculated power. recommend_samples: the recommended sample size.
Example:
import fast_causal_inference import fast_causal_inference.dataframe.statistics as S df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'sum', group_buckets=[1,1]).show() groupname numerator ratio 0 23058.627723 1 1 100540.303112 1 diff_relative 95%_relative_CI p-value t-statistic power recommend_samples 336.020323% [320.514511%,351.526135%] 0.000000 42.478747 0.050458 24404575
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'sum', group_buckets=[1,1], X = 'avg(numerator_pre)/avg(denominator_pre)').show() groupname numerator ratio numerator_pre 0 23058.627723 1 21903.112431 1 100540.303112 1 23096.875608 diff_relative 95%_relative_CI p-value t-statistic power recommend_samples 310.412514% [299.416469%,321.408558%] 0.000000 55.335445 0.050911 12696830
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'avg', X = 'avg(numerator_pre)/avg(denominator_pre)').show() groupname numerator denominator numerator_pre denominator_pre mean std_samp 0 23058.627723 29023.233157 21903.112431 29131.831739 0.793678 1.253257 1 100540.303112 40452.337656 23096.875608 30776.559777 2.486168 5.123161 diff_relative 95%_relative_CI p-value t-statistic diff 95%_CI power recommend_samples 213.246344% [206.698202%,219.794486%] 0.000000 63.835777 1.692490 [1.640519,1.744461] 0.052570 14172490
>>> df.agg(S.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'avg', alpha = 0.05, MDE = 0.005, power = 0.8, X = 'avg(numerator_pre)+avg(x1)')).show() groupname numerator denominator numerator_pre denominator_pre mean std_samp 0 23058.627723 29023.233157 21903.112431 -62.102593 1.057338 2.341991 1 100540.303112 40452.337656 23096.875608 -122.234609 2.732950 5.918014 diff_relative 95%_relative_CI p-value t-statistic diff 95%_CI power recommend_samples 158.474659% [152.453710%,164.495607%] 0.000000 51.593567 1.675612 [1.611950,1.739274] 0.053041 11982290
Regression
- class dataframe.regression.DID[source]
Parameters
- Y:
Column name, refers to the outcome of interest, a numerical variable.
- treatment:
Column name, a Boolean variable, can only take values 0 or 1, where 1 represents the experimental group.
- time:
Column name, a Boolean variable, represents the time factor. time = 0 represents before the strategy takes effect, time = 1 represents after the strategy takes effect.
- (Optional parameter) X:
Some covariates before the experiment, which can be used to reduce variance. Written in the form of [‘x1’, ‘x2’, ‘x3’] they must be numerical variables.
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.DID() model.fit(df=df,Y='y',treatment='treatment',time='t_ob',X=['x1','x2']) model.summary() # Call: # lm( formula = y ~ treatment + t_ob + treatment*t_ob + x1 + x2 ) # Coefficients: # . Estimate Std. Error t value Pr(>|t|) # (Intercept) 4.461905 0.213302 20.918288 0.000000 # treatment 13.902920 0.291365 47.716586 0.000000 # t_ob 0.416831 0.280176 1.487748 0.136849 # treatment*t_ob 1.812698 0.376476 4.814905 0.000001 # x1 1.769065 0.100727 17.562939 0.000000 # x2 2.020569 0.047162 42.842817 0.000000 # Residual standard error: 9.222100 on 9994 degrees of freedom # Multiple R-squared: 0.478329, Adjusted R-squared: 0.478068 # F-statistic: 1832.730042 on 5 and 9994 DF, p-value: 0.000000 # other ways import fast_causal_inference.dataframe.regression as Regression df.did('y', 'treatment', 't_ob',['x1','x2','x3']).show() df.agg(Regression.did('y', 'treatment', 't_ob',['x1','x2','x3'])).show()
- class dataframe.regression.IV[source]
Instrumental Variable (IV) estimator class. Instrumental variables (IV) is a method used in statistics, econometrics, epidemiology, and related disciplines to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. The idea behind IV is to use a variable, known as an instrument, that is correlated with the endogenous explanatory variables (the variables that are correlated with the error term), but uncorrelated with the error term itself. This allows us to isolate the variation in the explanatory variable that is purely due to the instrument and thus uncorrelated with the error term, which can then be used to estimate the causal effect of the explanatory variable on the dependent variable.
Here is an example:
\[t_{ob} = treatment + X_1 + X_2\]\[Y = \hat{t}_{ob} + X_1 + X_2\]\(X_1\) and \(X_2\) are independent variables or predictors.
\(t_{ob}\) is the dependent variable that you are trying to explain or predict.
\(treatment\) is an independent variable representing some intervention or condition that you believe affects \(t_{ob}\).
\(Y\) is the dependent variable that you are trying to explain or predict.
\(\hat{t}_{ob}\) is the predicted value of \(t_{ob}\) from the first equation.
We first regress \(X_3\) on the treatment and the other exogenous variables \(X_1\) and \(X_2\) to get the predicted values \(\hat{t}_{ob}\). Then, we replace \(t_{ob}\) with \(\hat{t}_{ob}\) in the second equation and estimate the parameters. This gives us the causal effect of \(t_{ob}\) on \(Y\), purged of the endogeneity problem.
- Methods:
fit: Fits the model with the given formula.
summary: Displays the summary of the model fit.
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.IV() model.fit(df,formula='y~(t_ob~treatment)+x1+x2') model.summary() df.iv_regression('y~(t_ob~treatment)+x1+x2').show() df.agg(Regression.iv_regression('y~(t_ob~treatment)+x1+x2')).show()
- class dataframe.regression.Logistic(tol=1e-12, iter=500)[source]
This class implements a Logistic Regression model.
Parameters
- tolfloat
The tolerance for stopping criteria.
- iterint
The maximum number of iterations.
Example
import fast_causal_inference from fast_causal_inference.dataframe.regression import Logistic table = 'test_data_small' df = fast_causal_inference.readClickHouse(table) X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'] Y = 't_ob' logit = Logistic(tol=1e-6, iter=500) logit.fit(Y, X, df) logit.summary() # Output: # x beta # 0 intercept 0.083472 # 1 x1 0.957999 # 2 x2 0.217600 # 3 x3 0.534323 # 4 x4 -0.006258 # 5 x5 -0.020528 # 6 x_long_tail1 -0.036267 # 7 x_long_tail2 0.000232 # predict df_predict = logit.predict(df) df_predict.select('prob').show() # prob # 0 0.549151214991665 # 1 0.8876947633647565 # 2 0.10790926234343089 # 3 0.791206731095578 # 4 0.7341882818925854 # .. ... # 195 0.21966953201618872 # 196 0.5813872122369445 # 197 0.5766490178541132 # 198 0.5210472623083635 # 199 0.35841097345616885 logit.get_auc(df=df_predict,Y=Y,prob_col="prob") # 0.7587271750805586
- class dataframe.regression.Ols(use_bias=True)[source]
This function is for an Ordinary Least Squares (OLS) model calculated using Stochastic Gradient Descent. The fit method is used to train the model using a specified regression formula and dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame.
Parameters: use_bias: bool, default=True, whether to use an intercept
- Methods:
- fit(expr, df): Train the model
expr : str, regression formula df : DataFrame, dataset
- effect(expr, df, effect_name): Predict
expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’
summary(): Display the summary of the model
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.Ols(False) model.fit('y~x1+x2+x3', df) effect_df = model.effect('x1+x2+x3', df) effect_df.show()
- class dataframe.regression.StochasticLinearRegression(learning_rate=1e-05, l1=0.1, batch_size=15, method='SGD')[source]
This function is for a Stochastic Linear Regression model. The fit method is used to train the model using a specified regression formula and a dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The learning_rate, l1, batch_size, and method parameters are used to control the learning rate, L1 regularization coefficient, batch size, and optimization method respectively.
- Parameters:
learning_rate : float, default=0.00001, learning rate l1 : float, default=0.1, L1 regularization coefficient batch_size : int, default=15, batch size method : str, default=’SGD’, optimization method
- Methods:
- fit(expr, df): Train the model
expr : str, regression formula df : DataFrame, dataset
- effect(expr, df, effect_name): Predict
expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.StochasticLinearRegression(learning_rate=0.00001, l1=0.1, batch_size=15, method='SGD') model.fit('y~x1+x2+x3', df) effect_df = model.effect('x1+x2+x3', df) effect_df.show()
- class dataframe.regression.StochasticLogisticRegression(learning_rate=1e-05, l1=0.1, batch_size=15, method='SGD')[source]
This function is for a Stochastic Logistic Regression model. The fit method is used to train the model using a specified regression formula and a dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The learning_rate, l1, batch_size, and method parameters are used to control the learning rate, L1 regularization coefficient, batch size, and optimization method respectively.
- Parameters:
learning_rate : float, default=0.00001, learning rate l1 : float, default=0.1, L1 regularization coefficient batch_size : int, default=15, batch size method : str, default=’SGD’, optimization method
- Methods:
- fit(expr, df): Train the model
expr : str, regression formula df : DataFrame, dataset
- effect(expr, df, effect_name): Predict
expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.StochasticLogisticRegression(learning_rate=0.00001, l1=0.1, batch_size=15, method='SGD') model.fit('y~x1+x2+x3', df) effect_df = model.effect('x1+x2+x3', df) effect_df.show()
- class dataframe.regression.Wls(weight='1', use_bias=True)[source]
This function is for a Weighted Least Squares (WLS) model. The fit method is used to train the model using a specified regression formula and dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The weight parameter specifies the column name for weights in the DataFrame.
- Parameters:
weight : str, column name for weights use_bias : bool, default=True, whether to use an intercept
- Methods:
- fit(expr, df): Train the model
expr : str, regression formula df : DataFrame, dataset
- effect(expr, df, effect_name): Predict
expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’
summary(): Display the summary of the model
Example
import fast_causal_inference.dataframe.regression as Regression model = Regression.Wls(weight='1', use_bias=False) model.fit('y~x1+x2+x3', df) effect_df = model.effect('x1+x2+x3', df) effect_df.show()
Uplift
- class dataframe.uplift.CausalForest(depth=7, min_node_size=-1, mtry=3, num_trees=10, sample_fraction=0.7, weight_index='', honesty=False, honesty_fraction=0.5, quantile_num=50)[source]
This class implements the Causal Forest method for causal inference.
Parameters
- depthint, default=7
The maximum depth of the tree.
- min_node_sizeint, default=-1
The minimum node size.
- mtryint, default=3
The number of variables randomly sampled as candidates at each split.
- num_treesint, default=10
The number of trees to grow in the forest.
- sample_fractionfloat, default=0.7
The fraction of observations to consider when fitting the forest.
- weight_indexstr, default=’’
The weight index.
- honestybool, default=False
Whether to use honesty when fitting the forest.
- honesty_fractionfloat, default=0.5
The fraction of observations to use for determining splits if honesty is used.
- quantile_numint, default=50
The number of quantiles.
Methods
- fit(Y, T, X, df):
Fit the Causal Forest model to the input data.
- effect(input_df=None, X=[]):
Estimate the causal effect using the fitted model.
Example
import fast_causal_inference from fast_causal_inference.dataframe.uplift import * Y='y' T='treatment' table = 'test_data_small' X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'] df = fast_causal_inference.readClickHouse(table) df_train, df_test = df.split(0.5) from fast_causal_inference.dataframe.uplift import CausalForest model = CausalForest(depth=7, min_node_size=-1, mtry=3, num_trees=10, sample_fraction=0.7) model.fit(Y, T, X, df_train)
- effect(df=None, X=[])[source]
Estimate the causal effect using the fitted model.
Parameters
- dfDataFrame, default=None
The input dataframe for which to estimate the causal effect. If None, use the dataframe from the fit method.
- Xlist, default=[]
The covariates to use when estimating the causal effect. [‘x1’, ‘x2’, ‘x3’, ‘x4’, ‘x5’, ‘x_long_tail1’, ‘x_long_tail2’]
Returns
- DataFrame
The output dataframe with the estimated causal effect.
Example
df_test_effect_cf = model.effect(df=df_test, X=['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']) df_train_effect_cf = model.effect(df=df_train, X=['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']) lift_train = get_lift_gain("effect", Y, T, df_test_effect_cf,discrete_treatment=True, K=100) lift_test = get_lift_gain("effect", Y, T, df_train_effect_cf,discrete_treatment=True, K=100) print(lift_train,lift_test) hte_plot([lift_train,lift_test],labels=['train','test'])
- fit(Y, T, X, df)[source]
Fit the Causal Forest model to the input data.
Parameters
- Ystr
The outcome variable.
- Tstr
The treatment variable.
- Xlist
The numeric covariates. Strings are not supported. [‘x1’, ‘x2’, ‘x3’, ‘x4’, ‘x5’, ‘x_long_tail1’, ‘x_long_tail2’]
- dfDataFrame
The input dataframe.
Returns
None
- class dataframe.uplift.CausalTree(depth=3, min_sample_ratio_leaf=0.001, bin_num=10)[source]
This class implements a Causal Tree for uplift/HTE analysis.
Parameters
- depthint
The maximum depth of the tree.
- thresholdfloat
The minimum sample ratio for a leaf.
- bin_numint
The number of bins for the need_cut_X to cut.
Example
import fast_causal_inference from fast_causal_inference.dataframe.uplift import * Y='y' T='treatment' table = 'test_data_small' X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'] needcut_X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'] df = fast_causal_inference.readClickHouse(table) df_train, df_test = df.split(0.5) hte = CausalTree(depth = 3,min_sample_ratio_leaf=0.001) hte.fit(Y,T,X,needcut_X,df_train) treeplot = hte.treeplot() # causal tree plot treeplot.render('digraph.gv', view=False) # 可以在digraph.gv.pdf文件里查看tree的完整图片并下载 print(hte.feature_importance) # Output: # featName importance # 1 x2_buckets 1.015128e+06 # 0 x1_buckets 2.181346e+05 # 3 x4_buckets 1.023273e+05 # 5 x_long_tail1_buckets 5.677131e+04 # 2 x3_buckets 2.537835e+04 # 6 x_long_tail2_buckets 2.536951e+04 # 4 x5_buckets 7.259992e+03 df_train_pred = hte.effect(df=df_train,keep_col='*') df_test_pred = hte.effect(df=df_test,keep_col='*') lift_train = get_lift_gain("effect", Y, T, df_train_pred,discrete_treatment=True, K=100) lift_test = get_lift_gain("effect", Y, T, df_test_pred,discrete_treatment=True, K=100) print(lift_train,lift_test) hte_plot([lift_train,lift_test],labels=['train','test']) # auuc: 0.6624369283393814 # auuc: 0.6532554148698826 # ratio lift gain ate ramdom_gain # 0 0.009990 2.164241 0.021621 1.0 0.009990 # 1 0.019980 2.131245 0.042582 1.0 0.019980 # 2 0.029970 2.056440 0.061632 1.0 0.029970 # 3 0.039960 2.177768 0.087024 1.0 0.039960 # 4 0.049950 2.175329 0.108658 1.0 0.049950 # .. ... ... ... ... ... # 95 0.959241 1.015223 0.973843 1.0 0.959241 # 96 0.969431 1.010023 0.979147 1.0 0.969431 # 97 0.979620 1.006843 0.986324 1.0 0.979620 # 98 0.989810 1.003508 0.993283 1.0 0.989810 # 99 1.000000 1.000000 1.000000 1.0 1.000000 # [100 rows x 5 columns] ratio lift gain ate ramdom_gain # 0 0.009810 1.948220 0.019112 1.0 0.009810 # 1 0.019620 2.221654 0.043588 1.0 0.019620 # 2 0.029429 2.419752 0.071212 1.0 0.029429 # 3 0.039239 2.288460 0.089797 1.0 0.039239 # 4 0.049049 2.343432 0.114943 1.0 0.049049 # .. ... ... ... ... ... # 95 0.959960 1.014897 0.974260 1.0 0.959960 # 96 0.969970 1.011624 0.981245 1.0 0.969970 # 97 0.979980 1.009358 0.989150 1.0 0.979980 # 98 0.989990 1.006340 0.996267 1.0 0.989990 # 99 1.000000 1.000000 1.000000 1.0 1.000000 # [100 rows x 5 columns]
- dataframe.uplift.get_lift_gain(ITE, Y, T, df, normalize=True, K=1000, discrete_treatment=True)[source]
Calculate the uplift & gain.
Parameters
- ITEstr
The Individual Treatment Effect column.
- Ystr
The outcome variable column.
- Tstr
The treatment variable column.
- dfDataFrame
The input data.
- normalizebool, optional
Whether to normalize the result, default is True.
- Kint, optional
The number of bins for discretization, default is 1000.
- discrete_treatmentbool, optional
Whether the treatment is discrete, default is True.
Returns
- LiftGainCurveResult
An object containing the result of the uplift & gain calculation.
Example
import fast_causal_inference from fast_causal_inference.dataframe.uplift import * Y='y' T='treatment' table = 'test_data_small' X = 'x1+x2+x3+x4+x5+x_long_tail1+x_long_tail2' needcut_X = 'x1+x2+x3+x4+x5+x_long_tail1+x_long_tail2' df = readClickHouse(table) df_train, df_test = df.split(0.5) hte = CausalTree(depth = 3,min_sample_ratio_leaf=0.001) hte.fit(Y,T,X,needcut_X,df_train) df_train_pred = hte.effect(df=df_train,keep_col='*') df_test_pred = hte.effect(df=df_test,keep_col='*') lift_train = get_lift_gain("effect", Y, T, df_train_pred,discrete_treatment=True, K=100) lift_test = get_lift_gain("effect", Y, T, df_test_pred,discrete_treatment=True, K=100) print(lift_train,lift_test) hte_plot([lift_train,lift_test],labels=['train','test']) # auuc: 0.6624369283393814 # auuc: 0.6532554148698826 # ratio lift gain ate ramdom_gain # 0 0.009990 2.164241 0.021621 1.0 0.009990 # 1 0.019980 2.131245 0.042582 1.0 0.019980 # 2 0.029970 2.056440 0.061632 1.0 0.029970 # 3 0.039960 2.177768 0.087024 1.0 0.039960 # 4 0.049950 2.175329 0.108658 1.0 0.049950 # .. ... ... ... ... ... # 95 0.959241 1.015223 0.973843 1.0 0.959241 # 96 0.969431 1.010023 0.979147 1.0 0.969431 # 97 0.979620 1.006843 0.986324 1.0 0.979620 # 98 0.989810 1.003508 0.993283 1.0 0.989810 # 99 1.000000 1.000000 1.000000 1.0 1.000000 # [100 rows x 5 columns] ratio lift gain ate ramdom_gain # 0 0.009810 1.948220 0.019112 1.0 0.009810 # 1 0.019620 2.221654 0.043588 1.0 0.019620 # 2 0.029429 2.419752 0.071212 1.0 0.029429 # 3 0.039239 2.288460 0.089797 1.0 0.039239 # 4 0.049049 2.343432 0.114943 1.0 0.049049 # .. ... ... ... ... ... # 95 0.959960 1.014897 0.974260 1.0 0.959960 # 96 0.969970 1.011624 0.981245 1.0 0.969970 # 97 0.979980 1.009358 0.989150 1.0 0.979980 # 98 0.989990 1.006340 0.996267 1.0 0.989990 # 99 1.000000 1.000000 1.000000 1.0 1.000000 # [100 rows x 5 columns]
Match
- class dataframe.match.CaliperMatching(caliper=0.2)[source]
This class implements the Caliper Matching method for causal inference.
Parameters
- caliperfloat, default=0.2
The caliper width for matching. Units are in terms of the standard deviation of the logit of the propensity score.
Methods
- fit(dataframe, treatment, score, exacts=[], alias = ‘matching_index’):
Apply the Caliper Matching method to the input dataframe.
Example
import fast_causal_inference import fast_causal_inference.dataframe.match as Match df = fast_causal_inference.readClickHouse('test_data_small') model = Match.CaliperMatching(0.5) tmp = model.fit(df, treatment='treatment', score='weight', exacts=['x_cat1']) match_df = tmp.filter("matching_index!=0") # filter out the unmatched records
>>> print('sample size Before match: ') >>> df.count().show() >>> print('sample size After match: ') >>> match_df.count().show() sample size Before match: 10000 sample size After match: 9652 >>> import fast_causal_inference.dataframe.match as Match >>> d1 = Match.smd(df, 'treatment', ['x1','x2']) >>> print(d1) Control Treatment SMD x1 -0.012658 -0.023996 -0.011482 x2 0.005631 0.037718 0.016156 >>> import fast_causal_inference.dataframe.match as Match >>> d2 = Match.smd(match_df, 'treatment', ['x1','x2']) >>> print(d2) Control Treatment SMD x1 -0.015521 -0.025225 -0.009821 x2 0.004834 0.039698 0.017551
>>> Match.matching_plot(df_score,'treatment','prob') >>> Match.matching_plot(match_df,'treatment','prob')
- fit(dataframe, treatment, score, exacts=[], alias='matching_index')[source]
Apply the Caliper Matching method to the input dataframe.
Parameters
- dataframeDataFrame
The input dataframe.
- treatmentstr
The treatment column name.
- scorestr
The propensity score column name.
- exactslist, default=’’
The column names for exact matching, [‘x_cat1’].
- aliasstr, default=’matching_index’
The alias for the matching index column in the output dataframe.
Returns
- DataFrame
The output dataframe with an additional column for the matching index.
- dataframe.match.matching_plot(df, T, col, xlim=(0, 1), figsize=(8, 8), xlabel='', ylabel='density', legend=['Control', 'Treatment'])[source]
This function plots the overlaid distribution of col in df over treat and control group.
Parameters
- table: str
The name of the table to query from.
- Tstr
The name of the treatment indicator column in the table.
- colstr
The name of the column that corresponds to the variable to plot.
- xlimtuple, optional
The tuple of xlim of the plot. (0,1) by default.
- figsizetuple, optional
The size of the histogram; (8,8) by default.
- xlabelstr, optional
The name of xlabel; col by default.
- ylabelstr, optional
The name of ylabel; density by default.
- legenditerable, optional
The legend; Control and Treatment by default.
Yields
An overlaied histogram
>>> import fast_causal_inference.dataframe.match as Match >>> Match.matching_plot(df,'treatment','x1')
- dataframe.match.smd(df, T, cols)[source]
Calculate the Standardized Mean Difference (SMD) for the input dataframe.
Parameters
- dfDataFrame
The input dataframe.
- Tstr
The treatment column name.
- colsstr
The column names to calculate the SMD, separated by ‘+’.
Returns
- DataFrame
The output dataframe with the SMD results.
Example
>>> import fast_causal_inference.dataframe.match as Match >>> d2 = Match.smd(match_df, 'treatment', ['x1','x2']) >>> print(d2) Control Treatment SMD x1 -0.015521 -0.025225 -0.009821 x2 0.004834 0.039698 0.017551