Causal Inference

Statistics

dataframe.statistics.ATEestimator(df, Y, T, B=500)[source]

Estimate the Average Treatment Effect (ATE) using a simple difference in means approach.

Parameters:
  • table (str, required) – the name of the input data table.

  • Y (str, required) – the column name of the outcome variable.

  • T (str, required) – the column name of the treatment variable.

  • B (int, optional) – the number of bootstrap samples, default is 500.

Returns:

dict, containing the following key-value pairs:

‘ATE’: Average Treatment Effect. ‘stddev’: Standard deviation. ‘p_value’: p-value. ‘95% confidence_interval’: 95% confidence interval.

Example

dataframe.statistics.IPWestimator(df, Y, T, P, B=500)[source]

Estimate the Average Treatment Effect (ATE) using Inverse Probability of Treatment Weighting (IPTW).

Parameters:
  • table (str, required) – the name of the input data table.

  • Y (str, required) – the column name of the outcome variable.

  • T (str, required) – the column name of the treatment variable.

  • P (str, required) – the column name of the propensity score.

  • B (int, optional) – the number of bootstrap samples, default is 500.

Returns:

dict, containing the following key-value pairs:

‘ATE’: Average Treatment Effect. ‘stddev’: Standard deviation. ‘p_value’: p-value. ‘95% confidence_interval’: 95% confidence interval.

Example

import fast_causal_inference
table = 'test_data_small'
df = fast_causal_inference.readClickHouse(table)
Y = 'numerator'
T = 'treatment'
P = 'weight'
import fast_causal_inference.dataframe.statistics as S
S.IPWestimator(df,Y,T,P,B=500)
dataframe.statistics.boot_strap(func, sample_num, bs_num)[source]

Compute a two-sided bootstrap confidence interval of a statistic. boot_strap sample_num samples from data and compute the func.

Parameters:
  • func (str, required) – function to apply.

  • sample_num (int, required) – number of samples.

  • bs_num (int, required) – number of bootstrap samples.

Returns:

list of calculated statistics.

Example

dataframe.statistics.delta_method(expr=None, std=True)[source]

Compute the delta method on the given expression.

Parameters:
  • expr (str, optional) – Form like f (avg(x1), avg(x2), …) , f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.

  • std (bool, optional) – Whether to return standard deviation, default is True.

Returns:

DataFrame contains the following columns: var or std computed by delta_method.

Return type:

DataFrame

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
df.groupBy('treatment').delta_method('avg(x1)', False).show()
df.groupBy('treatment').agg(S.delta_method('avg(x1)')).show()

This will output:

treatment             std
0         0         1.934587277675054E-4
1         1        1.9646284055862068E-4
treatment             var
0         0        0.013908944164367954
1         1        0.014016520272828797
dataframe.statistics.kolmogorov_smirnov_test(sample_data, sample_index)[source]

This function is used to calculate the Kolmogorov-Smirnov test for goodness of fit. It returns the calculated statistic and the two-tailed p-value.

Parameters:
  • sample_data (int, float or decimal, required) – Sample data. Integer, Float or Decimal.

  • sample_index (int, required) – Sample index. Integer.

Returns:

Tuple with two elements:

calculated statistic: Float64. calculated p-value: Float64.

Example:

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.kolmogorov_smirnov_test('y', 'treatment').show()
[0.6382961593945475, 0.0]
>>> df.agg(S.kolmogorov_smirnov_test('y', 'treatment')).show()
[0.6382961593945475, 0.0]
dataframe.statistics.mann_whitney_utest(sample_data, sample_index, alternative='two-sided', continuity_correction=1)[source]

This function is used to calculate the Mann-Whitney U test. It returns the calculated U-statistic and the two-tailed p-value.

Parameters:
  • sample_data (str, required) – column name, the numerator of the metric, can use SQL expression, the column must be numeric.

  • sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group.

  • alternative (str, optional) – ‘two-sided’: the default value, two-sided test. ‘greater’: one-tailed test in the positive direction. ‘less’: one-tailed test in the negative direction.

  • continuous_correction (bool, optional) – bool, default 1, whether to apply continuity correction.

Returns:

Tuple with two elements:

U-statistic: Float64. p-value: Float64.

Example:

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.mann_whitney_utest('x1', 'treatment').show()
[2380940.0, 0.0]
>>> df.agg(S.mann_whitney_utest('x1', 'treatment')).show()
[2380940.0, 0.0]
dataframe.statistics.matrix_multiplication(*col, std=False, invert=False)[source]
Parameters:
  • col (int, float or decimal, required) – columns to apply function to.

  • std (bool, required) – whether to return standard deviation.

  • invert (bool, required) – whether to invert the matrix.

Returns:

list of calculated statistics.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
df.matrix_multiplication('x1', 'x2', std = False, invert = False).show()
df.agg(S.matrix_multiplication('x1', 'x2', std = False, invert = False)).show()
df.agg(S.matrix_multiplication('x1', 'x2', std = True, invert = True)).show()
dataframe.statistics.mean_z_test(sample_data, sample_index, population_variance_x, population_variance_y, confidence_level)[source]

This function is used to calculate the z-test for the mean of two independent samples of scores. It returns the calculated z-statistic and the two-tailed p-value.

Parameters:
  • sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric

  • sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group

  • population_variance_x (Float, required) – Variance for control group.

  • population_variance_y (Float, required) – Variance for experimental group.

  • confidence_level (Float, required) – Confidence level in order to calculate confidence intervals.

Returns:

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
df.mean_z_test('y', 'treatment', 0.9, 0.9, 0.95).show()
df.agg(S.mean_z_test('y', 'treatment', 0.9, 0.9, 0.95)).show()
dataframe.statistics.permutation(func, permutation_num, mde='0', mde_type='1')[source]
Parameters:
  • func (str, required) – function to apply.

  • permutation_num (int, required) – number of permutations.

  • col (int, float or decimal, required) – columns to apply function to.

Returns:

list of calculated statistics.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
df.permutation('mannWhitneyUTest', 3, 'x1')
df.agg(S.permutation('mannWhitneyUTest', 3, 'x1')).show()
dataframe.statistics.srm(x, groupby, ratio='[1,1]')[source]

perform srm test

Parameters:
  • x (str, required) – column name, the numerator of the metric, can use SQL expression, the column must be numeric. If you are concerned about whether the sum of x1 meets expectations, you should fill in x1, then it will calculate sum(x1); If you are concerned about whether the sample size meets expectations, you should fill in 1, then it will calculate sum(1).

  • groupby (str, required) – column name, representing the field for aggregation grouping, can support Integer/String.

  • ratio (list, required) – list. The expected traffic ratio, needs to be filled in according to the order of the groupby field. Each value must be >0. For example, [1,1,2] represents the expected ratio is 1:1:2.

Returns:

DataFrame contains the following columns:

groupname: the name of the group. f_obs: the observed traffic. ratio: the expected traffic ratio. chisquare: the calculated chi-square. p-value: the calculated p-value.

Example:

import fast_causal_inference.dataframe.statistics as S
>>> df.srm('x1', 'treatment', '[1,2]').show()
        groupname   f_obs       ratio       chisquare   p-value
0           23058.627723 1.000000    48571.698643 0.000000
1           1.0054e+05  1.000000
>>> df.agg(S.srm('x1', 'treatment', '[1,2]')).show()
        groupname   f_obs       ratio       chisquare   p-value
0           23058.627723 1.000000    48571.698643 0.000000
1           1.0054e+05  1.000000
dataframe.statistics.student_ttest(sample_data, sample_index)[source]

This function is used to calculate the t-test for the mean of one group of scores. It returns the calculated t-statistic and the two-tailed p-value.

Parameters:
  • sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric

  • sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group

Returns:

Tuple with two elements:

calculated statistic: Float64. calculated p-value: Float64.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.student_ttest('y', 'treatment').show()
[-72.8602591880598, 0.0]
>>> df.agg(S.student_ttest('y', 'treatment')).show()
[-72.8602591880598, 0.0]
dataframe.statistics.ttest_1samp(Y, alternative='two-sided', mu=0, X='')[source]

This function is used to calculate the t-test for the mean of one group of scores. It returns the calculated t-statistic and the two-tailed p-value.

Parameters:
  • Y (str, required) – str, form like f (avg(x1), avg(x2), …), f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.

  • alternative (str, optional) – str, use ‘two-sided’ for two-tailed test, ‘greater’ for one-tailed test in the positive direction, and ‘less’ for one-tailed test in the negative direction.

  • mu (float, optional) – the mean of the null hypothesis.

  • X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.

Returns:

DataFrame contains the following columns:

estimate: the mean value of the statistic to be tested. stderr: the standard error of the statistic to be tested. t-statistic: the calculated t-statistic. p-value: the calculated p-value. lower: the lower bound of the confidence interval. upper: the upper bound of the confidence interval.

Example:

import fast_causal_inference.dataframe.statistics as S
import fast_causal_inference
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.groupBy('x_cat1').ttest_1samp('avg(numerator)/avg(denominator)', alternative = 'two-sided', mu = 0).show()
>>> df.groupBy('x_cat1').agg(S.ttest_1samp('avg(numerator)', alternative = 'two-sided', mu = 0, X = 'avg(numerator_pre)/avg(denominator_pre)')).show()
x_cat1  estimate    stderr t-statistic   p-value     lower     upper
0      B  1.455223  0.041401   35.149887  0.000000  1.374029  1.536417
1      E  1.753613  0.042083   41.670491  0.000000  1.671082  1.836143
2      D  1.752348  0.043173   40.589377  0.000000  1.667680  1.837016
3      C  1.804776  0.046642   38.694122  0.000000  1.713303  1.896249
4      A  2.108937  0.042558   49.554601  0.000000  2.025477  2.192398
x_cat1   estimate    stderr t-statistic   p-value      lower      upper
0      B  10.220695  0.261317   39.112304  0.000000   9.708205  10.733185
1      E  12.407975  0.267176   46.441156  0.000000  11.884002  12.931947
2      D  11.924641  0.258935   46.052716  0.000000  11.416831  12.432451
3      C  12.274732  0.281095   43.667495  0.000000  11.723457  12.826006
4      A  14.824860  0.241133   61.480129  0.000000  14.351972  15.297748
dataframe.statistics.ttest_2samp(Y, index, alternative='two-sided', X='', pse='')[source]

This function is used to calculate the t-test for the means of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.

Parameters:
  • Y (str, required) – str, form like f (avg(x1), avg(x2), …), f is the complex function expression, x1 and x2 are column names, the columns involved here must be numeric.

  • index (str, required) – str, the treatment variable.

  • alternative (str, optional) – str, use ‘two-sided’ for two-tailed test, ‘greater’ for one-tailed test in the positive direction, and ‘less’ for one-tailed test in the negative direction.

  • X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.

  • pse (str, optional) – str, an expression used as discrete covariates for post-stratification variance reduction. It involves grouping by a covariate, calculating variances separately, and then weighting them. It can be any complex function form, such as ‘x_cat1’.

Returns:

DataFrame contains the following columns:

estimate: the mean value of the statistic to be tested. stderr: the standard error of the statistic to be tested. t-statistic: the calculated t-statistic. p-value: the calculated p-value. lower: the lower bound of the confidence interval. upper: the upper bound of the confidence interval.

Example:

import fast_causal_inference.dataframe.statistics as S
import fast_causal_inference
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', pse = 'x_cat1')).show()
>>> df.agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)/avg(denominator_pre)')).show()
>>> df.groupBy('x_cat1').ttest_2samp('avg(numerator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)').show()
>>> df.groupBy('x_cat1').agg(S.ttest_2samp('avg(numerator)/avg(denominator)', 'treatment', alternative = 'two-sided', X = 'avg(numerator_pre)/avg(denominator_pre)')).show()
    mean0     mean1  estimate    stderr t-statistic   p-value     lower          0  0.791139  2.487152  1.696013  0.032986   51.416725  0.000000  1.631355   

    upper  
0  1.760672  
    mean0     mean1  estimate    stderr t-statistic   p-value     lower          0  0.793732  2.486118  1.692386  0.026685   63.419925  0.000000  1.640077   

    upper  
0  1.744694  
x_cat1     mean0      mean1   estimate    stderr t-statistic   p-value          0      B  2.481226  17.787127  15.305901  0.365716   41.851896  0.000000   
1      E  4.324137  19.437071  15.112935  0.370127   40.831785  0.000000   
2      D  4.582961  19.156961  14.574000  0.373465   39.023766  0.000000   
3      C  4.579375  19.816027  15.236652  0.419183   36.348422  0.000000   
4      A  7.518409  22.195092  14.676682  0.342147   42.895825  0.000000   

    lower      upper  
0  14.588665  16.023138  
1  14.387062  15.838808  
2  13.841579  15.306421  
3  14.414564  16.058739  
4  14.005694  15.347671  
x_cat1     mean0     mean1  estimate    stderr t-statistic   p-value          0      B  0.409006  2.202847  1.793841  0.053917   33.270683  0.000000   
1      E  0.714211  2.435665  1.721455  0.056144   30.661265  0.000000   
2      D  0.781435  2.455767  1.674332  0.058940   28.407344  0.000000   
3      C  0.778977  2.562364  1.783388  0.065652   27.164280  0.000000   
4      A  1.242126  2.766098  1.523972  0.060686   25.112311  0.000000   

    lower     upper  
0  1.688101  1.899581  
1  1.611348  1.831562  
2  1.558742  1.789923  
3  1.654633  1.912142  
4  1.404959  1.642984  
dataframe.statistics.welch_ttest(sample_data, sample_index)[source]

This function is used to calculate welch’s t-test for the mean of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.

Parameters:
  • sample_data (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric

  • sample_index (str, required) – column name, the index to represent the control group and the experimental group, 1 for the experimental group and 0 for the control group

Returns:

Tuple with two elements:

calculated statistic: Float64. calculated p-value: Float64.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.welch_ttest('y', 'treatment').show()
[-73.78492246858345, 0.0]
>>> df.agg(S.welch_ttest('y', 'treatment')).show()
[-73.78492246858345, 0.0]
dataframe.statistics.xexpt_ttest_2samp(numerator, denominator, index, uin, metric_type='avg', group_buckets='[1,1]', alpha=0.05, MDE=0.005, power=0.8, X='')[source]

This function is used to calculate the t-test for the means of two independent samples of scores. It returns the calculated t-statistic and the two-tailed p-value.

Parameters:
  • numerator (str, required) – column name, the numerator of the metric, can use sql expression, the column must be numeric.

  • denominator (str, required) – column name, the denominator of the metric, can use sql expression, the column must be numeric.

  • index (str, required) – column name, used to represent the control group and the experimental group.

  • uin (str, required) – column name, used to bucket samples, can use sql expression, int64 type.

  • metric_type (str, optional) – avg: used to test the mean indicator, avg(num)/avg(demo), default is avg. sum: used to test the sum indicator, at this time the denominator can be omitted or 1, otherwise the user is prompted.

  • group_buckets (list, optional) – the number of traffic buckets for each group, only effective when metric_type=’sum’. The default is [1,1], the number of elements is equal to the number of groups, only the correct ratio is required.

  • alpha (float, optional) – numeric, significance level, default 0.05.

  • MDE (float, optional) – numeric, minimum test difference, default 0.005.

  • power (float, optional) – numeric, statistical power, default 0.8.

  • X (str, optional) – str, an expression used as continuous covariates for CUPED variance reduction. It follows the regression approach and can be a simple form like ‘avg(x1)/avg(x2)’,’avg(x3)’,’avg(x1)/avg(x2)+avg(x3)’.

Returns:

DataFrame contains the following columns:

groupname: the name of the group. numerator: the mean of the numerator. denominator: the mean of the denominator (only when metric_type=avg). numerator_pre: the mean of numerator before the experiment (only when metric_type=avg). denominator_pre: the mean of denominator before the experiment (only when metric_type=avg). mean: the mean of the metric (only when metric_type=avg). std_samp: the standard deviation of the metric (only when metric_type=avg). ratio: group_buckets (only when metric_type=sum). diff_relative: the relative difference between the two groups. 95%_relative_CI: the 95% confidence interval of the relative difference. p-value: the calculated p-value. t-statistic: the calculated t-statistic. power: the calculated power. recommend_samples: the recommended sample size.

Example:

import fast_causal_inference
import fast_causal_inference.dataframe.statistics as S
df = fast_causal_inference.readClickHouse('test_data_small')
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'sum', group_buckets=[1,1]).show()
groupname   numerator     ratio
0           23058.627723  1
1           100540.303112 1
diff_relative 95%_relative_CI           p-value     t-statistic power       recommend_samples
336.020323%   [320.514511%,351.526135%] 0.000000    42.478747   0.050458    24404575
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'sum', group_buckets=[1,1], X = 'avg(numerator_pre)/avg(denominator_pre)').show()
groupname   numerator     ratio       numerator_pre
0           23058.627723  1           21903.112431
1           100540.303112 1           23096.875608
diff_relative 95%_relative_CI           p-value     t-statistic power       recommend_samples
310.412514%   [299.416469%,321.408558%] 0.000000    55.335445   0.050911    12696830
>>> df.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'avg', X = 'avg(numerator_pre)/avg(denominator_pre)').show()
groupname   numerator     denominator  numerator_pre denominator_pre mean        std_samp
0           23058.627723  29023.233157 21903.112431  29131.831739    0.793678    1.253257
1           100540.303112 40452.337656 23096.875608  30776.559777    2.486168    5.123161
diff_relative 95%_relative_CI           p-value     t-statistic diff        95%_CI              power       recommend_samples
213.246344%   [206.698202%,219.794486%] 0.000000    63.835777   1.692490    [1.640519,1.744461] 0.052570    14172490
>>> df.agg(S.xexpt_ttest_2samp('numerator', 'denominator', 'treatment', uin = 'rand()', metric_type = 'avg', alpha = 0.05, MDE = 0.005, power = 0.8, X = 'avg(numerator_pre)+avg(x1)')).show()
groupname   numerator     denominator  numerator_pre denominator_pre mean        std_samp
0           23058.627723  29023.233157 21903.112431  -62.102593      1.057338    2.341991
1           100540.303112 40452.337656 23096.875608  -122.234609     2.732950    5.918014
diff_relative 95%_relative_CI           p-value     t-statistic diff        95%_CI              power       recommend_samples
158.474659%   [152.453710%,164.495607%] 0.000000    51.593567   1.675612    [1.611950,1.739274] 0.053041    11982290

Regression

class dataframe.regression.DID[source]

Parameters

Y:

Column name, refers to the outcome of interest, a numerical variable.

treatment:

Column name, a Boolean variable, can only take values 0 or 1, where 1 represents the experimental group.

time:

Column name, a Boolean variable, represents the time factor. time = 0 represents before the strategy takes effect, time = 1 represents after the strategy takes effect.

(Optional parameter) X:

Some covariates before the experiment, which can be used to reduce variance. Written in the form of [‘x1’, ‘x2’, ‘x3’] they must be numerical variables.

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.DID()
model.fit(df=df,Y='y',treatment='treatment',time='t_ob',X=['x1','x2'])
model.summary()
# Call:
# lm( formula = y ~ treatment + t_ob + treatment*t_ob + x1 + x2 )

# Coefficients:
# .               Estimate    Std. Error  t value     Pr(>|t|)
# (Intercept)     4.461905    0.213302    20.918288   0.000000
# treatment       13.902920   0.291365    47.716586   0.000000
# t_ob            0.416831    0.280176    1.487748    0.136849
# treatment*t_ob  1.812698    0.376476    4.814905    0.000001
# x1              1.769065    0.100727    17.562939   0.000000
# x2              2.020569    0.047162    42.842817   0.000000

# Residual standard error: 9.222100 on 9994 degrees of freedom
# Multiple R-squared: 0.478329, Adjusted R-squared: 0.478068
# F-statistic: 1832.730042 on 5 and 9994 DF,  p-value: 0.000000

# other ways
import fast_causal_inference.dataframe.regression as Regression
df.did('y', 'treatment', 't_ob',['x1','x2','x3']).show()
df.agg(Regression.did('y', 'treatment', 't_ob',['x1','x2','x3'])).show()
class dataframe.regression.IV[source]

Instrumental Variable (IV) estimator class. Instrumental variables (IV) is a method used in statistics, econometrics, epidemiology, and related disciplines to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. The idea behind IV is to use a variable, known as an instrument, that is correlated with the endogenous explanatory variables (the variables that are correlated with the error term), but uncorrelated with the error term itself. This allows us to isolate the variation in the explanatory variable that is purely due to the instrument and thus uncorrelated with the error term, which can then be used to estimate the causal effect of the explanatory variable on the dependent variable.

Here is an example:

\[t_{ob} = treatment + X_1 + X_2\]
\[Y = \hat{t}_{ob} + X_1 + X_2\]
  • \(X_1\) and \(X_2\) are independent variables or predictors.

  • \(t_{ob}\) is the dependent variable that you are trying to explain or predict.

  • \(treatment\) is an independent variable representing some intervention or condition that you believe affects \(t_{ob}\).

  • \(Y\) is the dependent variable that you are trying to explain or predict.

  • \(\hat{t}_{ob}\) is the predicted value of \(t_{ob}\) from the first equation.

We first regress \(X_3\) on the treatment and the other exogenous variables \(X_1\) and \(X_2\) to get the predicted values \(\hat{t}_{ob}\). Then, we replace \(t_{ob}\) with \(\hat{t}_{ob}\) in the second equation and estimate the parameters. This gives us the causal effect of \(t_{ob}\) on \(Y\), purged of the endogeneity problem.

Methods:

  • fit: Fits the model with the given formula.

  • summary: Displays the summary of the model fit.

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.IV()
model.fit(df,formula='y~(t_ob~treatment)+x1+x2')
model.summary()

df.iv_regression('y~(t_ob~treatment)+x1+x2').show()
df.agg(Regression.iv_regression('y~(t_ob~treatment)+x1+x2')).show()
fit(df, formula)[source]

Fits the model with the given formula.

Parameters:

formula (str, required) – str, the formula to fit the model.

summary()[source]

Displays the summary of the model fit.

class dataframe.regression.Logistic(tol=1e-12, iter=500)[source]

This class implements a Logistic Regression model.

Parameters

tolfloat

The tolerance for stopping criteria.

iterint

The maximum number of iterations.

Example

import fast_causal_inference
from fast_causal_inference.dataframe.regression import Logistic
table = 'test_data_small'
df = fast_causal_inference.readClickHouse(table)
X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']
Y = 't_ob'

logit = Logistic(tol=1e-6, iter=500)
logit.fit(Y, X, df)
logit.summary()
# Output:
#                  x      beta
#     0     intercept  0.083472
#     1            x1  0.957999
#     2            x2  0.217600
#     3            x3  0.534323
#     4            x4 -0.006258
#     5            x5 -0.020528
#     6  x_long_tail1 -0.036267
#     7  x_long_tail2  0.000232

# predict
df_predict = logit.predict(df)
df_predict.select('prob').show()
#                             prob
# 0      0.549151214991665
# 1     0.8876947633647565
# 2    0.10790926234343089
# 3      0.791206731095578
# 4     0.7341882818925854
# ..                   ...
# 195  0.21966953201618872
# 196   0.5813872122369445
# 197   0.5766490178541132
# 198   0.5210472623083635
# 199  0.35841097345616885

logit.get_auc(df=df_predict,Y=Y,prob_col="prob")
# 0.7587271750805586
class dataframe.regression.Ols(use_bias=True)[source]

This function is for an Ordinary Least Squares (OLS) model calculated using Stochastic Gradient Descent. The fit method is used to train the model using a specified regression formula and dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame.

Parameters: use_bias: bool, default=True, whether to use an intercept

Methods:
fit(expr, df): Train the model

expr : str, regression formula df : DataFrame, dataset

effect(expr, df, effect_name): Predict

expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’

summary(): Display the summary of the model

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.Ols(False)
model.fit('y~x1+x2+x3', df)
effect_df = model.effect('x1+x2+x3', df)
effect_df.show()
class dataframe.regression.StochasticLinearRegression(learning_rate=1e-05, l1=0.1, batch_size=15, method='SGD')[source]

This function is for a Stochastic Linear Regression model. The fit method is used to train the model using a specified regression formula and a dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The learning_rate, l1, batch_size, and method parameters are used to control the learning rate, L1 regularization coefficient, batch size, and optimization method respectively.

Parameters:

learning_rate : float, default=0.00001, learning rate l1 : float, default=0.1, L1 regularization coefficient batch_size : int, default=15, batch size method : str, default=’SGD’, optimization method

Methods:
fit(expr, df): Train the model

expr : str, regression formula df : DataFrame, dataset

effect(expr, df, effect_name): Predict

expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.StochasticLinearRegression(learning_rate=0.00001, l1=0.1, batch_size=15, method='SGD')
model.fit('y~x1+x2+x3', df)
effect_df = model.effect('x1+x2+x3', df)
effect_df.show()
class dataframe.regression.StochasticLogisticRegression(learning_rate=1e-05, l1=0.1, batch_size=15, method='SGD')[source]

This function is for a Stochastic Logistic Regression model. The fit method is used to train the model using a specified regression formula and a dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The learning_rate, l1, batch_size, and method parameters are used to control the learning rate, L1 regularization coefficient, batch size, and optimization method respectively.

Parameters:

learning_rate : float, default=0.00001, learning rate l1 : float, default=0.1, L1 regularization coefficient batch_size : int, default=15, batch size method : str, default=’SGD’, optimization method

Methods:
fit(expr, df): Train the model

expr : str, regression formula df : DataFrame, dataset

effect(expr, df, effect_name): Predict

expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.StochasticLogisticRegression(learning_rate=0.00001, l1=0.1, batch_size=15, method='SGD')
model.fit('y~x1+x2+x3', df)
effect_df = model.effect('x1+x2+x3', df)
effect_df.show()
class dataframe.regression.Wls(weight='1', use_bias=True)[source]

This function is for a Weighted Least Squares (WLS) model. The fit method is used to train the model using a specified regression formula and dataset. The effect method is used to make predictions based on the trained model, the regression formula, and a new dataset. The predicted results are stored in a column with a specified name in the DataFrame. The weight parameter specifies the column name for weights in the DataFrame.

Parameters:

weight : str, column name for weights use_bias : bool, default=True, whether to use an intercept

Methods:
fit(expr, df): Train the model

expr : str, regression formula df : DataFrame, dataset

effect(expr, df, effect_name): Predict

expr : str, regression formula df : DataFrame, dataset effect_name : str, column name for the prediction result, default is ‘effect’

summary(): Display the summary of the model

Example

import fast_causal_inference.dataframe.regression as Regression
model = Regression.Wls(weight='1', use_bias=False)
model.fit('y~x1+x2+x3', df)
effect_df = model.effect('x1+x2+x3', df)
effect_df.show()

Uplift

class dataframe.uplift.CausalForest(depth=7, min_node_size=-1, mtry=3, num_trees=10, sample_fraction=0.7, weight_index='', honesty=False, honesty_fraction=0.5, quantile_num=50)[source]

This class implements the Causal Forest method for causal inference.

Parameters

depthint, default=7

The maximum depth of the tree.

min_node_sizeint, default=-1

The minimum node size.

mtryint, default=3

The number of variables randomly sampled as candidates at each split.

num_treesint, default=10

The number of trees to grow in the forest.

sample_fractionfloat, default=0.7

The fraction of observations to consider when fitting the forest.

weight_indexstr, default=’’

The weight index.

honestybool, default=False

Whether to use honesty when fitting the forest.

honesty_fractionfloat, default=0.5

The fraction of observations to use for determining splits if honesty is used.

quantile_numint, default=50

The number of quantiles.

Methods

fit(Y, T, X, df):

Fit the Causal Forest model to the input data.

effect(input_df=None, X=[]):

Estimate the causal effect using the fitted model.

Example

import fast_causal_inference
from fast_causal_inference.dataframe.uplift import *
Y='y'
T='treatment'
table = 'test_data_small'
X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']
df = fast_causal_inference.readClickHouse(table)
df_train, df_test = df.split(0.5)
from fast_causal_inference.dataframe.uplift import CausalForest
model = CausalForest(depth=7, min_node_size=-1, mtry=3, num_trees=10, sample_fraction=0.7)
model.fit(Y, T, X, df_train)
effect(df=None, X=[])[source]

Estimate the causal effect using the fitted model.

Parameters

dfDataFrame, default=None

The input dataframe for which to estimate the causal effect. If None, use the dataframe from the fit method.

Xlist, default=[]

The covariates to use when estimating the causal effect. [‘x1’, ‘x2’, ‘x3’, ‘x4’, ‘x5’, ‘x_long_tail1’, ‘x_long_tail2’]

Returns

DataFrame

The output dataframe with the estimated causal effect.

Example

df_test_effect_cf = model.effect(df=df_test, X=['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'])
df_train_effect_cf = model.effect(df=df_train, X=['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2'])
lift_train = get_lift_gain("effect", Y, T, df_test_effect_cf,discrete_treatment=True, K=100)
lift_test = get_lift_gain("effect", Y, T, df_train_effect_cf,discrete_treatment=True, K=100)
print(lift_train,lift_test)
hte_plot([lift_train,lift_test],labels=['train','test'])
fit(Y, T, X, df)[source]

Fit the Causal Forest model to the input data.

Parameters

Ystr

The outcome variable.

Tstr

The treatment variable.

Xlist

The numeric covariates. Strings are not supported. [‘x1’, ‘x2’, ‘x3’, ‘x4’, ‘x5’, ‘x_long_tail1’, ‘x_long_tail2’]

dfDataFrame

The input dataframe.

Returns

None

class dataframe.uplift.CausalTree(depth=3, min_sample_ratio_leaf=0.001, bin_num=10)[source]

This class implements a Causal Tree for uplift/HTE analysis.

Parameters

depthint

The maximum depth of the tree.

thresholdfloat

The minimum sample ratio for a leaf.

bin_numint

The number of bins for the need_cut_X to cut.

Example

import fast_causal_inference
from fast_causal_inference.dataframe.uplift import *
Y='y'
T='treatment'
table = 'test_data_small'
X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']
needcut_X = ['x1', 'x2', 'x3', 'x4', 'x5', 'x_long_tail1', 'x_long_tail2']
df = fast_causal_inference.readClickHouse(table)
df_train, df_test = df.split(0.5)
hte = CausalTree(depth = 3,min_sample_ratio_leaf=0.001)
hte.fit(Y,T,X,needcut_X,df_train)
treeplot = hte.treeplot() # causal tree plot
treeplot.render('digraph.gv', view=False) # 可以在digraph.gv.pdf文件里查看tree的完整图片并下载
print(hte.feature_importance)
# Output:
#                    featName    importance
#     1            x2_buckets  1.015128e+06
#     0            x1_buckets  2.181346e+05
#     3            x4_buckets  1.023273e+05
#     5  x_long_tail1_buckets  5.677131e+04
#     2            x3_buckets  2.537835e+04
#     6  x_long_tail2_buckets  2.536951e+04
#     4            x5_buckets  7.259992e+03

df_train_pred = hte.effect(df=df_train,keep_col='*')
df_test_pred = hte.effect(df=df_test,keep_col='*')
lift_train = get_lift_gain("effect", Y, T, df_train_pred,discrete_treatment=True, K=100)
lift_test = get_lift_gain("effect", Y, T, df_test_pred,discrete_treatment=True, K=100)
print(lift_train,lift_test)
hte_plot([lift_train,lift_test],labels=['train','test'])
# auuc: 0.6624369283393814
# auuc: 0.6532554148698826
#        ratio      lift      gain  ate  ramdom_gain
# 0   0.009990  2.164241  0.021621  1.0     0.009990
# 1   0.019980  2.131245  0.042582  1.0     0.019980
# 2   0.029970  2.056440  0.061632  1.0     0.029970
# 3   0.039960  2.177768  0.087024  1.0     0.039960
# 4   0.049950  2.175329  0.108658  1.0     0.049950
# ..       ...       ...       ...  ...          ...
# 95  0.959241  1.015223  0.973843  1.0     0.959241
# 96  0.969431  1.010023  0.979147  1.0     0.969431
# 97  0.979620  1.006843  0.986324  1.0     0.979620
# 98  0.989810  1.003508  0.993283  1.0     0.989810
# 99  1.000000  1.000000  1.000000  1.0     1.000000

# [100 rows x 5 columns]        ratio      lift      gain  ate  ramdom_gain
# 0   0.009810  1.948220  0.019112  1.0     0.009810
# 1   0.019620  2.221654  0.043588  1.0     0.019620
# 2   0.029429  2.419752  0.071212  1.0     0.029429
# 3   0.039239  2.288460  0.089797  1.0     0.039239
# 4   0.049049  2.343432  0.114943  1.0     0.049049
# ..       ...       ...       ...  ...          ...
# 95  0.959960  1.014897  0.974260  1.0     0.959960
# 96  0.969970  1.011624  0.981245  1.0     0.969970
# 97  0.979980  1.009358  0.989150  1.0     0.979980
# 98  0.989990  1.006340  0.996267  1.0     0.989990
# 99  1.000000  1.000000  1.000000  1.0     1.000000

# [100 rows x 5 columns]
effect(df, keep_col='*')[source]

Calculate the individual treatment effect.

Parameters

dfDataFrame

The input data.

keep_colstr, optional

The columns to keep. Defaults to ‘*’.

Returns

DataFrame

The result data.

dataframe.uplift.get_lift_gain(ITE, Y, T, df, normalize=True, K=1000, discrete_treatment=True)[source]

Calculate the uplift & gain.

Parameters

ITEstr

The Individual Treatment Effect column.

Ystr

The outcome variable column.

Tstr

The treatment variable column.

dfDataFrame

The input data.

normalizebool, optional

Whether to normalize the result, default is True.

Kint, optional

The number of bins for discretization, default is 1000.

discrete_treatmentbool, optional

Whether the treatment is discrete, default is True.

Returns

LiftGainCurveResult

An object containing the result of the uplift & gain calculation.

Example

import fast_causal_inference
from fast_causal_inference.dataframe.uplift import *
Y='y'
T='treatment'
table = 'test_data_small'
X = 'x1+x2+x3+x4+x5+x_long_tail1+x_long_tail2'
needcut_X = 'x1+x2+x3+x4+x5+x_long_tail1+x_long_tail2'
df = readClickHouse(table)
df_train, df_test = df.split(0.5)
hte = CausalTree(depth = 3,min_sample_ratio_leaf=0.001)
hte.fit(Y,T,X,needcut_X,df_train)

df_train_pred = hte.effect(df=df_train,keep_col='*')
df_test_pred = hte.effect(df=df_test,keep_col='*')
lift_train = get_lift_gain("effect", Y, T, df_train_pred,discrete_treatment=True, K=100)
lift_test = get_lift_gain("effect", Y, T, df_test_pred,discrete_treatment=True, K=100)
print(lift_train,lift_test)
hte_plot([lift_train,lift_test],labels=['train','test'])
# auuc: 0.6624369283393814
# auuc: 0.6532554148698826
#        ratio      lift      gain  ate  ramdom_gain
# 0   0.009990  2.164241  0.021621  1.0     0.009990
# 1   0.019980  2.131245  0.042582  1.0     0.019980
# 2   0.029970  2.056440  0.061632  1.0     0.029970
# 3   0.039960  2.177768  0.087024  1.0     0.039960
# 4   0.049950  2.175329  0.108658  1.0     0.049950
# ..       ...       ...       ...  ...          ...
# 95  0.959241  1.015223  0.973843  1.0     0.959241
# 96  0.969431  1.010023  0.979147  1.0     0.969431
# 97  0.979620  1.006843  0.986324  1.0     0.979620
# 98  0.989810  1.003508  0.993283  1.0     0.989810
# 99  1.000000  1.000000  1.000000  1.0     1.000000

# [100 rows x 5 columns]        ratio      lift      gain  ate  ramdom_gain
# 0   0.009810  1.948220  0.019112  1.0     0.009810
# 1   0.019620  2.221654  0.043588  1.0     0.019620
# 2   0.029429  2.419752  0.071212  1.0     0.029429
# 3   0.039239  2.288460  0.089797  1.0     0.039239
# 4   0.049049  2.343432  0.114943  1.0     0.049049
# ..       ...       ...       ...  ...          ...
# 95  0.959960  1.014897  0.974260  1.0     0.959960
# 96  0.969970  1.011624  0.981245  1.0     0.969970
# 97  0.979980  1.009358  0.989150  1.0     0.979980
# 98  0.989990  1.006340  0.996267  1.0     0.989990
# 99  1.000000  1.000000  1.000000  1.0     1.000000

# [100 rows x 5 columns]
dataframe.uplift.hte_plot(results, labels=[])[source]

Plot the uplift & gain.

Parameters

resultslist

A list of LiftGainCurveResult objects to be plotted.

labelslist, optional

A list of labels for the results, default is an empty list.

Returns

None

This function will display a plot.

Match

class dataframe.match.CaliperMatching(caliper=0.2)[source]

This class implements the Caliper Matching method for causal inference.

Parameters

caliperfloat, default=0.2

The caliper width for matching. Units are in terms of the standard deviation of the logit of the propensity score.

Methods

fit(dataframe, treatment, score, exacts=[], alias = ‘matching_index’):

Apply the Caliper Matching method to the input dataframe.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.match as Match
df = fast_causal_inference.readClickHouse('test_data_small')
model = Match.CaliperMatching(0.5)
tmp = model.fit(df, treatment='treatment', score='weight', exacts=['x_cat1'])
match_df = tmp.filter("matching_index!=0") # filter out the unmatched records
>>> print('sample size Before match: ')
>>> df.count().show()
>>> print('sample size After match: ')
>>> match_df.count().show()
sample size Before match:
10000
sample size After match:
9652
>>> import fast_causal_inference.dataframe.match as Match
>>> d1 = Match.smd(df, 'treatment', ['x1','x2'])
>>> print(d1)
     Control  Treatment       SMD
x1 -0.012658  -0.023996 -0.011482
x2  0.005631   0.037718  0.016156
>>> import fast_causal_inference.dataframe.match as Match
>>> d2 = Match.smd(match_df, 'treatment', ['x1','x2'])
>>> print(d2)
     Control  Treatment       SMD
x1 -0.015521  -0.025225 -0.009821
x2  0.004834   0.039698  0.017551
>>> Match.matching_plot(df_score,'treatment','prob')
>>> Match.matching_plot(match_df,'treatment','prob')
fit(dataframe, treatment, score, exacts=[], alias='matching_index')[source]

Apply the Caliper Matching method to the input dataframe.

Parameters

dataframeDataFrame

The input dataframe.

treatmentstr

The treatment column name.

scorestr

The propensity score column name.

exactslist, default=’’

The column names for exact matching, [‘x_cat1’].

aliasstr, default=’matching_index’

The alias for the matching index column in the output dataframe.

Returns

DataFrame

The output dataframe with an additional column for the matching index.

dataframe.match.matching_plot(df, T, col, xlim=(0, 1), figsize=(8, 8), xlabel='', ylabel='density', legend=['Control', 'Treatment'])[source]

This function plots the overlaid distribution of col in df over treat and control group.

Parameters

table: str

The name of the table to query from.

Tstr

The name of the treatment indicator column in the table.

colstr

The name of the column that corresponds to the variable to plot.

xlimtuple, optional

The tuple of xlim of the plot. (0,1) by default.

figsizetuple, optional

The size of the histogram; (8,8) by default.

xlabelstr, optional

The name of xlabel; col by default.

ylabelstr, optional

The name of ylabel; density by default.

legenditerable, optional

The legend; Control and Treatment by default.

Yields

An overlaied histogram

>>> import fast_causal_inference.dataframe.match as Match
>>> Match.matching_plot(df,'treatment','x1')
dataframe.match.smd(df, T, cols)[source]

Calculate the Standardized Mean Difference (SMD) for the input dataframe.

Parameters

dfDataFrame

The input dataframe.

Tstr

The treatment column name.

colsstr

The column names to calculate the SMD, separated by ‘+’.

Returns

DataFrame

The output dataframe with the SMD results.

Example

>>> import fast_causal_inference.dataframe.match as Match
>>> d2 = Match.smd(match_df, 'treatment', ['x1','x2'])
>>> print(d2)
     Control  Treatment       SMD
x1 -0.015521  -0.025225 -0.009821
x2  0.004834   0.039698  0.017551