In PySpark, the withColumn()
function is used to add a new column or replace an existing column in a Dataframe. It allows you to transform and manipulate data by applying expressions or functions to the existing columns.
The syntax of the withColumn() function:
df.withColumn(colName, col)
How to use withColumn() function in PySpark:
from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark = SparkSession.builder.getOrCreate() # Create a DataFrame data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)] df = spark.createDataFrame(data, ["Name", "Age"]) # Add a new column with a constant value df_new = df.withColumn("Gender", lit("Female")) df_new.show()
Output:
+-------+---+------+ | Name|Age|Gender| +-------+---+------+ | Alice| 25|Female| | Bob| 30|Female| |Charlie| 35|Female| +-------+---+------+
Transforming existing columns:
from pyspark.sql.functions import col # Add a new column by applying a transformation to an existing column df_new = df.withColumn("Age_in_5_years", col("Age") + 5) df_new.show()
Output:
+-------+---+--------------+ | Name|Age|Age_in_5_years| +-------+---+--------------+ | Alice| 25| 30| | Bob| 30| 35| |Charlie| 35| 40| +-------+---+--------------+
Replacing an existing column:
from pyspark.sql.functions import when # Replace the Age column based on a condition df_new = df.withColumn("Age", when(col("Age") < 30, "Young").otherwise("Old")) df_new.show()
Output:
+-------+-----+ | Name| Age| +-------+-----+ | Alice|Young| | Bob|Young| |Charlie| Old| +-------+-----+
In the examples above, we create a DataFrame and then use the withColumn()
function to add a new column, transform existing columns, or replace an existing column.
We can provide the name of the new column as a string and the corresponding expression or value for the column. The modified DataFrame is returned, and we can further manipulate or perform operations on it.
If you wish to rename an existing column, then you should use withColumnRenamed() Function.
Note that the withColumn()
function does not modify the original DataFrame; it creates a new DataFrame with the desired changes.