Dataframe Operations

dataframe

class dataframe.dataframe.DataFrame(olap_engine=OlapEngineType.CLICKHOUSE)[source]

This class is used to create a DataFrame object.

describe(cols='*')[source]

Returns the summary statistics for the columns in the DataFrame.

>>> df.describe()
>>> df.describe(['x1','x2'])
#       count       avg       std       min  quantile_0.25  quantile_0.5  quantile_0.75  quantile_0.90  quantile_0.99      max
# x1  10000.0 -0.018434  0.987606 -3.740101       -0.68929     -0.028665       0.654210       1.274144       2.321097  3.80166
# x2  10000.0  0.021976  1.986209 -8.893264       -1.28461      0.015400       1.357618       2.583523       4.725829  7.19662
drop(*args)[source]

Drops specified columns from the DataFrame and returns a new DataFrame.

>>> new_df = df.drop('column1', 'column2')
>>> new_df = df.drop(['column1', 'column2'])
filter(filter)[source]

Alias for the ‘where’ function. Filters rows using the given condition.

>>> df.filter("column1 > 1").show()
first()[source]

This function is used to get the first row of the dataframe

>>> df.first()
groupBy(*args)[source]

Groups the DataFrame by the specified columns and returns a new DataFrame.Exception: If the DataFrame is already in need_agg state.

>>> new_df = df.groupBy('column1', 'column2')
>>> new_df = df.groupBy(['column1', 'column2'])
head(n)[source]

This function is used to get the first n rows of the dataframe

>>> df.head(3)
orderBy(*args)[source]

Orders the DataFrame by the specified columns and returns a new DataFrame.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> new_df = df.orderBy('column1', Fn.desc('column2'))
>>> new_df = df.orderBy(['column1', 'column2'])
printSchema()[source]

This function is used to print the schema of the dataframe

sample(fraction)[source]

This function samples a fraction of rows without replacement from the DataFrame.

>>> df1 = df.sample(1000)
>>> df1.count().show()
>>> df2 = df.sample(0.5)
>>> df2.count().show()
select(*args)[source]

Selects specified columns from the DataFrame and returns a new DataFrame. >>> new_df = df.select(‘column1’, ‘column2’) >>> new_df = df.select([‘column1’, ‘column2’])

show()[source]

Prints the DataFrame, equivalent to print(dataframe).

>>> df.head(3).show()
split(test_size=0.5)[source]

This function splits the DataFrame into two DataFrames.

>>> df_train, df_test = df.split(0.5)
>>> print(df_train.count())
>>> print(df_test.count())
take(n)[source]

This function is used to get the first n rows of the dataframe

>>> df.head(3)
toClickHouse(clickhouse_table_name)[source]

ClickHouse table >> ClickHouse table.

Example

>>> df.toClickHouse("new_table")
toCsv(csv_file_abs_path)[source]

Convert the data from ClickHouse table to a CSV file.

>>> df.toCsv("/path/to/output.csv")
toPandas()[source]

This function is used to convert the result of the dataframe to pandas.DataFrame

toSparkDf()[source]

ClickHouse table >> spark dataframe.

Example

>>> import fast_causal_inference
>>> fast_causal_inference.set_default(tenant_id="",tenant_secret_key="")
>>> spark = fast_causal_inference.set_spark_session(group_id='', gaia_id='') # group_id,gaia_id 参考 notebook上默认文件!Spark 资源池.html
>>> df = fast_causal_inference.readClickHouse('test_data_small')
>>> spark_df = df.toSparkDf()
toTdw(tdw_database, tdw_table, tdw_user=None, tdw_passward=None, group='tl', is_drop_table=False, overwrite=True, priPart=None)[source]

ClickHouse table >> TDW-thive table.

Parameters

tdw_database (str):

The name of the TDW database.

tdw_table (str):

The name of the TDW table.

tdw_user (str, optional):

The username for TDW. Default is None.

tdw_passward (str, optional):

The password for TDW. Default is None.

group (str, optional):

The group for TDW. Default is ‘tl’.

is_drop_table (bool, optional):

Whether to drop the existing TDW table. Default is False.

overwrite (bool, optional):

Whether to overwrite the existing data in the TDW table. Default is True.

priPart (str, optional):

The primary partition for the TDW table. Default is None.

Example

>>> df.toTdw("tdw_database", "tdw_table", group='tl')
>>> df.toTdw("tdw_database", "tdw_table", group='tl',priPart=['p_20220222'])
union(df)[source]

This function is used to union two DataFrames. The two DataFrames must have the same number of columns, and the columns must have the same names and order. >>> df1 = df1.union(df2)

where(filter)[source]

Filters rows using the given condition.

>>> df.where("column1 > 1").show()
withColumn(new_column, func)[source]

This function adds a new column to the DataFrame.

Example

import fast_causal_inference.dataframe.functions as Fn
import fast_causal_inference.dataframe.statistics as S

df1 = df.select('x1','x2','x3', 'numerator')

# Method 1: Select columns through df's index: df['col']
df1 = df1.withColumn('new_col', Fn.sqrt(df1['numerator']))
df1.show()

# Method 2: Select columns directly through string: Fn.col('new_col')
df1 = df1.withColumn('new_col2', Fn.pow('new_col', 2))
df1.show()

df1 = df1.withColumn('new_col3', Fn.col('new_col') * Fn.col('new_col'))
df1.show()

# Add constant
df2 = df1.withColumn('c1', Fn.lit(1))
df2 = df2.withColumn('c2', Fn.lit('1'))
df2.show()

# Nesting
df2 = df1.withColumn('c1', Fn.pow(Fn.sqrt(Fn.sqrt(Fn.col('x1'))), 2))
df2.show()

# +-*/% operations
df2 = df1.withColumn('c1', 22 + df1['x1'] / 2 + 2 / df1['x2'] * df1['x3'] % 2 - (df1['x2']))
df2.show()
df2 = df1.withColumn('c1', Fn.col('x1 + x2 * x3 + x3'))
df2.show()

# if
df2 = df1.withColumn('cc1', 'if(x1 > 0, 1, -1)')
df2.show()
df2 = df1.withColumn('cc1', Fn.If('x1 > 0',1,-1))
df2.show()
withColumnRenamed(old_name, new_name)[source]

Returns a new DataFrame by renaming an existing column.

>>> df.withColumnRenamed("column1", "new_column1").show()

functions

dataframe.functions.If(cond, x, y)[source]

If is used to create a new column based on the condition x. If x is true, y is returned, otherwise z is returned.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.If(df['weight'] > 0.5, '>0.5', '<0.5'))
>>> df_new = df.withColumn('new_column', Fn.If('weight>0.5', 1, 0))
>>> df_new.show()
dataframe.functions.L1Distance(x, y)[source]

L1Distance is used to calculate the L1 distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L1Distance('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.L1Norm(x)[source]

L1Norm is used to calculate the L1 norm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L1Norm('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.L1Normalize(x)[source]

L1Normalize is used to normalize column x using L1 norm.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L1Normalize('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.L2Distance(x, y)[source]

L2Distance is used to calculate the L2 distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L2Distance('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.L2Norm(x)[source]

L2Norm is used to calculate the L2 norm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L2Norm('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.L2Normalize(x)[source]

L2Normalize is used to normalize column x using L2 norm.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L2Normalize('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.L2SquaredDistance(x, y)[source]

L2SquaredDistance is used to calculate the squared L2 distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.L2SquaredDistance('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.LinfDistance(x, y)[source]

LinfDistance is used to calculate the L-infinity distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LinfDistance('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.LinfNorm(x)[source]

LinfNorm is used to calculate the L-infinity norm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LinfNorm('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.LinfNormalize(x)[source]

LinfNormalize is used to normalize column x using L-infinity norm.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LinfNormalize('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.LpDistance(x, y)[source]

LpDistance is used to calculate the Lp distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LpDistance('weight', 'height', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.LpNorm(x)[source]

LpNorm is used to calculate the Lp norm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LpNorm('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.LpNormalize(x)[source]

LpNormalize is used to normalize column x using Lp norm.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.LpNormalize('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.abs(col)[source]

abs is used to calculate the absolute value of a column.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.abs('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.acos(x)[source]

acos is used to calculate the arccosine of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.acos('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.any(col)[source]

any is used to aggregate a column with any value.

dataframe.functions.asin(x)[source]

asin is used to calculate the arcsine of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.asin('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.atan(x)[source]

atan is used to calculate the arctangent of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.atan('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.avg(col1)[source]

avg is used to calculate the average value of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.avg('numerator').show()
df.groupBy('treatment').avg('numerator').show()
df.groupBy('treatment').agg(Fn.avg('numerator').alias('numerator')).show()
dataframe.functions.cbrt(x)[source]

cbrt is used to calculate the cube root of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.cbrt('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.ceil(x)[source]

ceil is used to calculate the smallest integer greater than or equal to the column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.ceil('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.corr(x, y)[source]

corr is used to calculate the correlation between two columns.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.corr('numerator', 'numerator_pre').show()
df.groupBy('treatment').agg(Fn.corr('numerator', 'numerator_pre').alias('numerator')).show()
df.groupBy('treatment').corr('numerator', 'numerator_pre').show()
dataframe.functions.cos(x)[source]

cos is used to calculate the cosine of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.cos('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.cosineDistance(x, y)[source]

cosineDistance is used to calculate the cosine distance between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.cosineDistance('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.cosineSimilarity(x, y)[source]

cosineSimilarity is used to calculate the cosine similarity between column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.cosineSimilarity('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.count(*, expr='*')[source]

count is used to count the number of rows.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.count().show()
df.groupBy('treatment').count().show()
df.groupBy('treatment').agg(Fn.count().alias('numerator')).show()
dataframe.functions.covarPop(x, y)[source]

covarPop is used to calculate the population covariance between two columns.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.covarPop('numerator', 'numerator_pre').show()
df.groupBy('treatment').agg(Fn.covarPop('numerator', 'numerator_pre').alias('numerator')).show()
df.groupBy('treatment').covarPop('numerator', 'numerator_pre').show()
dataframe.functions.covarSamp(x, y)[source]

covarSamp is used to calculate the sample covariance between two columns.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.covarSamp('numerator', 'numerator_pre').show()
df.groupBy('treatment').agg(Fn.covarSamp('numerator', 'numerator_pre').alias('numerator')).show()
df.groupBy('treatment').covarSamp('numerator', 'numerator_pre').show()
dataframe.functions.desc(column)[source]

desc is used to sort column x in descending order.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.orderBy(Fn.desc('weight'))
>>> df_new.show()
dataframe.functions.e()[source]

e is used to get the mathematical constant e.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.e())
>>> df_new.avg('new_column').show()
dataframe.functions.erf(x)[source]

erf is used to calculate the error function of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.erf('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.erfc(x)[source]

erfc is used to calculate the complementary error function of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.erfc('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.exp(x)[source]

exp is used to calculate e raised to the power of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.exp('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.exp10(x)[source]

exp10 is used to calculate 10 raised to the power of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.exp10('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.exp2(x)[source]

exp2 is used to calculate 2 raised to the power of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.exp2('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.floor(x)[source]

floor is used to calculate the largest integer less than or equal to the column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.floor('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.gcd(x, y)[source]

gcd is used to calculate the greatest common divisor of column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.gcd('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.intExp10(x)[source]

intExp10 is used to calculate 10 raised to the power of column x, and the result is an integer.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.intExp10('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.intExp2(x)[source]

intExp2 is used to calculate 2 raised to the power of column x, and the result is an integer.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.intExp2('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.isnotnull(x)[source]

isnotnull is used to check if column x is not null.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.isnotnull('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.isnull(x)[source]

isnull is used to check if column x is null.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.isnull('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.lcm(x, y)[source]

lcm is used to calculate the least common multiple of column x and y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.lcm('weight', 'height'))
>>> df_new.avg('new_column').show()
dataframe.functions.lgamma(x)[source]

lgamma is used to calculate the log gamma function of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.lgamma('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.lit(*cols)[source]

lit is used to create a constant column.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('constant', Fn.lit(1))
dataframe.functions.ln(x)[source]

ln is used to calculate the natural logarithm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.ln('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.log(x)[source]

log is used to calculate the natural logarithm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.log('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.log10(x)[source]

log10 is used to calculate the base 10 logarithm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.log10('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.log2(x)[source]

log2 is used to calculate the base 2 logarithm of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.log2('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.max(*cols)[source]

max is used to calculate the maximum value of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.max('numerator').show()
df.groupBy('treatment').max('numerator').show()
df.groupBy('treatment').agg(Fn.max('numerator').alias('numerator')).show()
dataframe.functions.mean(col1)[source]

mean is used to calculate the mean of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.mean('numerator').show()
df.groupBy('treatment').mean('numerator').show()
df.groupBy('treatment').agg(Fn.mean('numerator').alias('numerator')).show()
dataframe.functions.min(*cols)[source]

min is used to calculate the minimum value of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.min('numerator').show()
df.groupBy('treatment').min('numerator').show()
df.groupBy('treatment').agg(Fn.min('numerator').alias('numerator')).show()
dataframe.functions.mod(x, y)[source]

mod is used to calculate the modulo of column x by y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.mod('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.murmur_hash3_32(x)[source]

murmur_hash3_32 is used to calculate the murmur3 hash of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.murmur_hash3_32('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.murmur_hash3_64(x)[source]

murmur_hash3_64 is used to calculate the murmur3 hash of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.murmur_hash3_64('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.pi()[source]

pi is used to get the mathematical constant pi.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.pi())
>>> df_new.avg('new_column').show()
dataframe.functions.pow(x, y)[source]

pow is used to calculate the column x raised to the power y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.pow('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.power(x, y)[source]

power is used to calculate the column x raised to the power y.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.power('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.quantile(x, *, level)[source]

quantile is used to calculate the quantile of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.quantile('numerator', level=0.5).show()
df.groupBy('treatment').quantile('numerator', level=0.5).show()
df.groupBy('treatment').agg(Fn.quantile('numerator', level=0.5).alias('numerator')).show()
dataframe.functions.rand()[source]

rand is used to generate a random number.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.rand())
>>> df_new.avg('new_column').show()
dataframe.functions.round(x, n='')[source]

round is used to round column x to y decimal places.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.round('weight', 2))
>>> df_new.avg('new_column').show()
dataframe.functions.sin(x)[source]

sin is used to calculate the sine of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.sin('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.sqrt(x)[source]

sqrt is used to calculate the square root of a column.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.sqrt('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.stddevPop(col)[source]

stddevPop is used to calculate the population standard deviation of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn df = fast_causal_inference.readClickHouse(‘test_data_small’) df.stddevPop(‘numerator’).show() df.groupBy(‘treatment’).stddevPop(‘numerator’).show() df.groupBy(‘treatment’).agg(Fn.stddevPop(‘numerator’).alias(‘numerator’)).show() df.groupBy(‘treatment’).agg({‘numerator’:’stddevPop’, ‘numerator_pre’:’stddevPop’}).show()

dataframe.functions.stddevSamp(col)[source]

stddevSamp is used to calculate the sample standard deviation of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.stddevSamp('numerator').show()
df.groupBy('treatment').stddevSamp('numerator').show()
df.groupBy('treatment').agg(Fn.stddevSamp('numerator').alias('numerator')).show()
df.groupBy('treatment').agg({'numerator':'stddevSamp', 'numerator_pre':'stddevSamp'}).show()
dataframe.functions.sum(col1)[source]

sum is used to calculate the sum of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.sum('numerator').show()
df.groupBy('treatment').sum('numerator').show()
df.groupBy('treatment').agg(Fn.sum('numerator').alias('numerator')).show()
dataframe.functions.tan(x)[source]

tan is used to calculate the tangent of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.tan('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.tgamma(x)[source]

tgamma is used to calculate the gamma function of column x.

>>> import fast_causal_inference.dataframe.functions as Fn
>>> df_new = df.withColumn('new_column', Fn.tgamma('weight'))
>>> df_new.avg('new_column').show()
dataframe.functions.varPop(col)[source]

varPop is used to calculate the population variance of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.varPop('numerator').show()
df.groupBy('treatment').agg(Fn.varPop('numerator').alias('numerator')).show()
df.groupBy('treatment').varPop('numerator').show()
df.groupBy('treatment').agg({'numerator':'varPop', 'numerator_pre':'varPop'}).show()
dataframe.functions.varSamp(col)[source]

varSamp is used to calculate the sample variance of a column.

Example:

import fast_causal_inference.dataframe.functions as Fn
df = fast_causal_inference.readClickHouse('test_data_small')
df.varSamp('numerator').show()
df.groupBy('treatment').agg(Fn.varSamp('numerator').alias('numerator')).show()
df.groupBy('treatment').varSamp('numerator').show()
df.groupBy('treatment').agg({'numerator':'varSamp', 'numerator_pre':'varSamp'}).show()

features

class dataframe.features.Bucketizer[source]

This class is used for bucketizing continuous variables into discrete bins.

fit(df, inputCols, splitsArray, outputCols=[], if_string=True)[source]

This function applies the bucketizing transformation to the specified columns of the input dataframe.

Parameters

Parameters:
  • df (DataFrame) – The input dataframe to be transformed.

  • inputCols (list) – A list of column names in the dataframe to be bucketized.

  • splitsArray (list) – A list of lists, where each inner list contains the split points for bucketizing the corresponding column in inputCols.

  • outputCols (list, optional) – A list of output column names after bucketizing. If not provided, ‘_buckets’ will be appended to the original column names.

  • if_string (bool, optional) – A flag indicating whether the bin values should be treated as strings. Default is True.

Returns:

The transformed dataframe with bucketized columns.

Return type:

DataFrame

Example

>>> import fast_causal_inference
>>> import fast_causal_inference.dataframe.features as Features
>>> df = fast_causal_inference.readClickHouse('test_data_small')
>>> bucketizer = Features.Bucketizer()
>>> df_new = bucketizer.fit(df,['x1','x2'],[[1,3],[0,2]],if_string=True)
>>> df_new.select('x1','x2','x1_buckets','x2_buckets').head(5).show()
                    x1            x2 x1_buckets x2_buckets
0  -0.131301907  -3.152383354          1          0
1  -0.966931088  -0.427920835          1          0
2   1.257744217  -2.050358546      [1,3)          0
3  -0.777228042  -2.621604715          1          0
4  -0.669571385   0.606404768          1      [0,2)

>>> df_new = bucketizer.fit(df,['x1','x2'],[[1,3],[0,2]],if_string=False)
>>> df_new.select('x1','x2','x1_buckets','x2_buckets').head(5).show()
            x1            x2 x1_buckets x2_buckets
0  -0.131301907  -3.152383354          1          1
1  -0.966931088  -0.427920835          1          1
2   1.257744217  -2.050358546          2          1
3  -0.777228042  -2.621604715          1          1
4  -0.669571385   0.606404768          1          2
class dataframe.features.OneHotEncoder[source]

This class implements the OneHotEncoder method for causal inference.

Parameters

colslist, default=None

The columns to be one-hot encoded.

Methods

fit(dataframe):

Apply the OneHotEncoder method to the input dataframe.

Example

import fast_causal_inference
import fast_causal_inference.dataframe.features as Features
df = fast_causal_inference.readClickHouse('test_data_small')
one_hot = Features.OneHotEncoder()
df_new = one_hot.fit(df, cols=['x_cat1'])
df_new.printSchema()