Dataframe Operations
dataframe
- class dataframe.dataframe.DataFrame(olap_engine=OlapEngineType.CLICKHOUSE)[source]
This class is used to create a DataFrame object.
- describe(cols='*')[source]
Returns the summary statistics for the columns in the DataFrame.
>>> df.describe() >>> df.describe(['x1','x2']) # count avg std min quantile_0.25 quantile_0.5 quantile_0.75 quantile_0.90 quantile_0.99 max # x1 10000.0 -0.018434 0.987606 -3.740101 -0.68929 -0.028665 0.654210 1.274144 2.321097 3.80166 # x2 10000.0 0.021976 1.986209 -8.893264 -1.28461 0.015400 1.357618 2.583523 4.725829 7.19662
- drop(*args)[source]
Drops specified columns from the DataFrame and returns a new DataFrame.
>>> new_df = df.drop('column1', 'column2') >>> new_df = df.drop(['column1', 'column2'])
- filter(filter)[source]
Alias for the ‘where’ function. Filters rows using the given condition.
>>> df.filter("column1 > 1").show()
- groupBy(*args)[source]
Groups the DataFrame by the specified columns and returns a new DataFrame.Exception: If the DataFrame is already in need_agg state.
>>> new_df = df.groupBy('column1', 'column2') >>> new_df = df.groupBy(['column1', 'column2'])
- orderBy(*args)[source]
Orders the DataFrame by the specified columns and returns a new DataFrame.
>>> import fast_causal_inference.dataframe.functions as Fn >>> new_df = df.orderBy('column1', Fn.desc('column2')) >>> new_df = df.orderBy(['column1', 'column2'])
- sample(fraction)[source]
This function samples a fraction of rows without replacement from the DataFrame.
>>> df1 = df.sample(1000) >>> df1.count().show()
>>> df2 = df.sample(0.5) >>> df2.count().show()
- select(*args)[source]
Selects specified columns from the DataFrame and returns a new DataFrame. >>> new_df = df.select(‘column1’, ‘column2’) >>> new_df = df.select([‘column1’, ‘column2’])
- split(test_size=0.5)[source]
This function splits the DataFrame into two DataFrames.
>>> df_train, df_test = df.split(0.5) >>> print(df_train.count()) >>> print(df_test.count())
- toClickHouse(clickhouse_table_name)[source]
ClickHouse table >> ClickHouse table.
Example
>>> df.toClickHouse("new_table")
- toCsv(csv_file_abs_path)[source]
Convert the data from ClickHouse table to a CSV file.
>>> df.toCsv("/path/to/output.csv")
- toPandas()[source]
This function is used to convert the result of the dataframe to pandas.DataFrame
- toSparkDf()[source]
ClickHouse table >> spark dataframe.
Example
>>> import fast_causal_inference >>> fast_causal_inference.set_default(tenant_id="",tenant_secret_key="") >>> spark = fast_causal_inference.set_spark_session(group_id='', gaia_id='') # group_id,gaia_id 参考 notebook上默认文件!Spark 资源池.html >>> df = fast_causal_inference.readClickHouse('test_data_small') >>> spark_df = df.toSparkDf()
- toTdw(tdw_database, tdw_table, tdw_user=None, tdw_passward=None, group='tl', is_drop_table=False, overwrite=True, priPart=None)[source]
ClickHouse table >> TDW-thive table.
Parameters
- tdw_database (str):
The name of the TDW database.
- tdw_table (str):
The name of the TDW table.
- tdw_user (str, optional):
The username for TDW. Default is None.
- tdw_passward (str, optional):
The password for TDW. Default is None.
- group (str, optional):
The group for TDW. Default is ‘tl’.
- is_drop_table (bool, optional):
Whether to drop the existing TDW table. Default is False.
- overwrite (bool, optional):
Whether to overwrite the existing data in the TDW table. Default is True.
- priPart (str, optional):
The primary partition for the TDW table. Default is None.
Example
>>> df.toTdw("tdw_database", "tdw_table", group='tl') >>> df.toTdw("tdw_database", "tdw_table", group='tl',priPart=['p_20220222'])
- union(df)[source]
This function is used to union two DataFrames. The two DataFrames must have the same number of columns, and the columns must have the same names and order. >>> df1 = df1.union(df2)
- withColumn(new_column, func)[source]
This function adds a new column to the DataFrame.
Example
import fast_causal_inference.dataframe.functions as Fn import fast_causal_inference.dataframe.statistics as S df1 = df.select('x1','x2','x3', 'numerator') # Method 1: Select columns through df's index: df['col'] df1 = df1.withColumn('new_col', Fn.sqrt(df1['numerator'])) df1.show() # Method 2: Select columns directly through string: Fn.col('new_col') df1 = df1.withColumn('new_col2', Fn.pow('new_col', 2)) df1.show() df1 = df1.withColumn('new_col3', Fn.col('new_col') * Fn.col('new_col')) df1.show() # Add constant df2 = df1.withColumn('c1', Fn.lit(1)) df2 = df2.withColumn('c2', Fn.lit('1')) df2.show() # Nesting df2 = df1.withColumn('c1', Fn.pow(Fn.sqrt(Fn.sqrt(Fn.col('x1'))), 2)) df2.show() # +-*/% operations df2 = df1.withColumn('c1', 22 + df1['x1'] / 2 + 2 / df1['x2'] * df1['x3'] % 2 - (df1['x2'])) df2.show() df2 = df1.withColumn('c1', Fn.col('x1 + x2 * x3 + x3')) df2.show() # if df2 = df1.withColumn('cc1', 'if(x1 > 0, 1, -1)') df2.show() df2 = df1.withColumn('cc1', Fn.If('x1 > 0',1,-1)) df2.show()
functions
- dataframe.functions.If(cond, x, y)[source]
If is used to create a new column based on the condition x. If x is true, y is returned, otherwise z is returned.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.If(df['weight'] > 0.5, '>0.5', '<0.5')) >>> df_new = df.withColumn('new_column', Fn.If('weight>0.5', 1, 0)) >>> df_new.show()
- dataframe.functions.L1Distance(x, y)[source]
L1Distance is used to calculate the L1 distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L1Distance('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.L1Norm(x)[source]
L1Norm is used to calculate the L1 norm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L1Norm('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.L1Normalize(x)[source]
L1Normalize is used to normalize column x using L1 norm.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L1Normalize('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.L2Distance(x, y)[source]
L2Distance is used to calculate the L2 distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L2Distance('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.L2Norm(x)[source]
L2Norm is used to calculate the L2 norm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L2Norm('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.L2Normalize(x)[source]
L2Normalize is used to normalize column x using L2 norm.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L2Normalize('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.L2SquaredDistance(x, y)[source]
L2SquaredDistance is used to calculate the squared L2 distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.L2SquaredDistance('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.LinfDistance(x, y)[source]
LinfDistance is used to calculate the L-infinity distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LinfDistance('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.LinfNorm(x)[source]
LinfNorm is used to calculate the L-infinity norm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LinfNorm('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.LinfNormalize(x)[source]
LinfNormalize is used to normalize column x using L-infinity norm.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LinfNormalize('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.LpDistance(x, y)[source]
LpDistance is used to calculate the Lp distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LpDistance('weight', 'height', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.LpNorm(x)[source]
LpNorm is used to calculate the Lp norm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LpNorm('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.LpNormalize(x)[source]
LpNormalize is used to normalize column x using Lp norm.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.LpNormalize('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.abs(col)[source]
abs is used to calculate the absolute value of a column.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.abs('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.acos(x)[source]
acos is used to calculate the arccosine of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.acos('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.asin(x)[source]
asin is used to calculate the arcsine of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.asin('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.atan(x)[source]
atan is used to calculate the arctangent of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.atan('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.avg(col1)[source]
avg is used to calculate the average value of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.avg('numerator').show() df.groupBy('treatment').avg('numerator').show() df.groupBy('treatment').agg(Fn.avg('numerator').alias('numerator')).show()
- dataframe.functions.cbrt(x)[source]
cbrt is used to calculate the cube root of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.cbrt('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.ceil(x)[source]
ceil is used to calculate the smallest integer greater than or equal to the column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.ceil('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.corr(x, y)[source]
corr is used to calculate the correlation between two columns.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.corr('numerator', 'numerator_pre').show() df.groupBy('treatment').agg(Fn.corr('numerator', 'numerator_pre').alias('numerator')).show() df.groupBy('treatment').corr('numerator', 'numerator_pre').show()
- dataframe.functions.cos(x)[source]
cos is used to calculate the cosine of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.cos('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.cosineDistance(x, y)[source]
cosineDistance is used to calculate the cosine distance between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.cosineDistance('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.cosineSimilarity(x, y)[source]
cosineSimilarity is used to calculate the cosine similarity between column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.cosineSimilarity('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.count(*, expr='*')[source]
count is used to count the number of rows.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.count().show() df.groupBy('treatment').count().show() df.groupBy('treatment').agg(Fn.count().alias('numerator')).show()
- dataframe.functions.covarPop(x, y)[source]
covarPop is used to calculate the population covariance between two columns.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.covarPop('numerator', 'numerator_pre').show() df.groupBy('treatment').agg(Fn.covarPop('numerator', 'numerator_pre').alias('numerator')).show() df.groupBy('treatment').covarPop('numerator', 'numerator_pre').show()
- dataframe.functions.covarSamp(x, y)[source]
covarSamp is used to calculate the sample covariance between two columns.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.covarSamp('numerator', 'numerator_pre').show() df.groupBy('treatment').agg(Fn.covarSamp('numerator', 'numerator_pre').alias('numerator')).show() df.groupBy('treatment').covarSamp('numerator', 'numerator_pre').show()
- dataframe.functions.desc(column)[source]
desc is used to sort column x in descending order.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.orderBy(Fn.desc('weight')) >>> df_new.show()
- dataframe.functions.e()[source]
e is used to get the mathematical constant e.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.e()) >>> df_new.avg('new_column').show()
- dataframe.functions.erf(x)[source]
erf is used to calculate the error function of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.erf('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.erfc(x)[source]
erfc is used to calculate the complementary error function of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.erfc('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.exp(x)[source]
exp is used to calculate e raised to the power of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.exp('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.exp10(x)[source]
exp10 is used to calculate 10 raised to the power of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.exp10('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.exp2(x)[source]
exp2 is used to calculate 2 raised to the power of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.exp2('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.floor(x)[source]
floor is used to calculate the largest integer less than or equal to the column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.floor('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.gcd(x, y)[source]
gcd is used to calculate the greatest common divisor of column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.gcd('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.intExp10(x)[source]
intExp10 is used to calculate 10 raised to the power of column x, and the result is an integer.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.intExp10('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.intExp2(x)[source]
intExp2 is used to calculate 2 raised to the power of column x, and the result is an integer.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.intExp2('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.isnotnull(x)[source]
isnotnull is used to check if column x is not null.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.isnotnull('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.isnull(x)[source]
isnull is used to check if column x is null.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.isnull('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.lcm(x, y)[source]
lcm is used to calculate the least common multiple of column x and y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.lcm('weight', 'height')) >>> df_new.avg('new_column').show()
- dataframe.functions.lgamma(x)[source]
lgamma is used to calculate the log gamma function of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.lgamma('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.lit(*cols)[source]
lit is used to create a constant column.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('constant', Fn.lit(1))
- dataframe.functions.ln(x)[source]
ln is used to calculate the natural logarithm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.ln('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.log(x)[source]
log is used to calculate the natural logarithm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.log('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.log10(x)[source]
log10 is used to calculate the base 10 logarithm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.log10('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.log2(x)[source]
log2 is used to calculate the base 2 logarithm of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.log2('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.max(*cols)[source]
max is used to calculate the maximum value of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.max('numerator').show() df.groupBy('treatment').max('numerator').show() df.groupBy('treatment').agg(Fn.max('numerator').alias('numerator')).show()
- dataframe.functions.mean(col1)[source]
mean is used to calculate the mean of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.mean('numerator').show() df.groupBy('treatment').mean('numerator').show() df.groupBy('treatment').agg(Fn.mean('numerator').alias('numerator')).show()
- dataframe.functions.min(*cols)[source]
min is used to calculate the minimum value of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.min('numerator').show() df.groupBy('treatment').min('numerator').show() df.groupBy('treatment').agg(Fn.min('numerator').alias('numerator')).show()
- dataframe.functions.mod(x, y)[source]
mod is used to calculate the modulo of column x by y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.mod('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.murmur_hash3_32(x)[source]
murmur_hash3_32 is used to calculate the murmur3 hash of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.murmur_hash3_32('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.murmur_hash3_64(x)[source]
murmur_hash3_64 is used to calculate the murmur3 hash of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.murmur_hash3_64('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.pi()[source]
pi is used to get the mathematical constant pi.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.pi()) >>> df_new.avg('new_column').show()
- dataframe.functions.pow(x, y)[source]
pow is used to calculate the column x raised to the power y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.pow('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.power(x, y)[source]
power is used to calculate the column x raised to the power y.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.power('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.quantile(x, *, level)[source]
quantile is used to calculate the quantile of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.quantile('numerator', level=0.5).show() df.groupBy('treatment').quantile('numerator', level=0.5).show() df.groupBy('treatment').agg(Fn.quantile('numerator', level=0.5).alias('numerator')).show()
- dataframe.functions.rand()[source]
rand is used to generate a random number.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.rand()) >>> df_new.avg('new_column').show()
- dataframe.functions.round(x, n='')[source]
round is used to round column x to y decimal places.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.round('weight', 2)) >>> df_new.avg('new_column').show()
- dataframe.functions.sin(x)[source]
sin is used to calculate the sine of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.sin('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.sqrt(x)[source]
sqrt is used to calculate the square root of a column.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.sqrt('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.stddevPop(col)[source]
stddevPop is used to calculate the population standard deviation of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse(‘test_data_small’) df.stddevPop(‘numerator’).show() df.groupBy(‘treatment’).stddevPop(‘numerator’).show() df.groupBy(‘treatment’).agg(Fn.stddevPop(‘numerator’).alias(‘numerator’)).show() df.groupBy(‘treatment’).agg({‘numerator’:’stddevPop’, ‘numerator_pre’:’stddevPop’}).show()
- dataframe.functions.stddevSamp(col)[source]
stddevSamp is used to calculate the sample standard deviation of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.stddevSamp('numerator').show() df.groupBy('treatment').stddevSamp('numerator').show() df.groupBy('treatment').agg(Fn.stddevSamp('numerator').alias('numerator')).show() df.groupBy('treatment').agg({'numerator':'stddevSamp', 'numerator_pre':'stddevSamp'}).show()
- dataframe.functions.sum(col1)[source]
sum is used to calculate the sum of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.sum('numerator').show() df.groupBy('treatment').sum('numerator').show() df.groupBy('treatment').agg(Fn.sum('numerator').alias('numerator')).show()
- dataframe.functions.tan(x)[source]
tan is used to calculate the tangent of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.tan('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.tgamma(x)[source]
tgamma is used to calculate the gamma function of column x.
>>> import fast_causal_inference.dataframe.functions as Fn >>> df_new = df.withColumn('new_column', Fn.tgamma('weight')) >>> df_new.avg('new_column').show()
- dataframe.functions.varPop(col)[source]
varPop is used to calculate the population variance of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.varPop('numerator').show() df.groupBy('treatment').agg(Fn.varPop('numerator').alias('numerator')).show() df.groupBy('treatment').varPop('numerator').show() df.groupBy('treatment').agg({'numerator':'varPop', 'numerator_pre':'varPop'}).show()
- dataframe.functions.varSamp(col)[source]
varSamp is used to calculate the sample variance of a column.
Example:
import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse('test_data_small') df.varSamp('numerator').show() df.groupBy('treatment').agg(Fn.varSamp('numerator').alias('numerator')).show() df.groupBy('treatment').varSamp('numerator').show() df.groupBy('treatment').agg({'numerator':'varSamp', 'numerator_pre':'varSamp'}).show()
features
- class dataframe.features.Bucketizer[source]
This class is used for bucketizing continuous variables into discrete bins.
- fit(df, inputCols, splitsArray, outputCols=[], if_string=True)[source]
This function applies the bucketizing transformation to the specified columns of the input dataframe.
Parameters
- Parameters:
df (DataFrame) – The input dataframe to be transformed.
inputCols (list) – A list of column names in the dataframe to be bucketized.
splitsArray (list) – A list of lists, where each inner list contains the split points for bucketizing the corresponding column in inputCols.
outputCols (list, optional) – A list of output column names after bucketizing. If not provided, ‘_buckets’ will be appended to the original column names.
if_string (bool, optional) – A flag indicating whether the bin values should be treated as strings. Default is True.
- Returns:
The transformed dataframe with bucketized columns.
- Return type:
Example
>>> import fast_causal_inference >>> import fast_causal_inference.dataframe.features as Features >>> df = fast_causal_inference.readClickHouse('test_data_small') >>> bucketizer = Features.Bucketizer() >>> df_new = bucketizer.fit(df,['x1','x2'],[[1,3],[0,2]],if_string=True) >>> df_new.select('x1','x2','x1_buckets','x2_buckets').head(5).show() x1 x2 x1_buckets x2_buckets 0 -0.131301907 -3.152383354 1 0 1 -0.966931088 -0.427920835 1 0 2 1.257744217 -2.050358546 [1,3) 0 3 -0.777228042 -2.621604715 1 0 4 -0.669571385 0.606404768 1 [0,2) >>> df_new = bucketizer.fit(df,['x1','x2'],[[1,3],[0,2]],if_string=False) >>> df_new.select('x1','x2','x1_buckets','x2_buckets').head(5).show() x1 x2 x1_buckets x2_buckets 0 -0.131301907 -3.152383354 1 1 1 -0.966931088 -0.427920835 1 1 2 1.257744217 -2.050358546 2 1 3 -0.777228042 -2.621604715 1 1 4 -0.669571385 0.606404768 1 2
- class dataframe.features.OneHotEncoder[source]
This class implements the OneHotEncoder method for causal inference.
Parameters
- colslist, default=None
The columns to be one-hot encoded.
Methods
- fit(dataframe):
Apply the OneHotEncoder method to the input dataframe.
Example
import fast_causal_inference import fast_causal_inference.dataframe.features as Features df = fast_causal_inference.readClickHouse('test_data_small') one_hot = Features.OneHotEncoder() df_new = one_hot.fit(df, cols=['x_cat1']) df_new.printSchema()