Filtering data in PySpark allows you to extract specific rows from a DataFrame based on certain conditions. You can use the filter()
or where()
methods to apply filtering operations. Here’s an explanation of how to filter data in PySpark:
Using the filter() method:
The filter()
method allows you to specify the filtering condition as a Boolean expression. It returns a new DataFrame that contains only the rows satisfying the specified condition
filtered_df = df.filter(df.column1 > 10)
In the above example, the DataFrame df is filtered to include only the rows where the value of “column1” is greater than 10. The resulting DataFrame filtered_df will contain those rows.
Using the where() method:
The where()
method is an alternative to filter()
and provides the same functionality. It also allows you to specify the filtering condition as a Boolean expression.
filtered_df = df.where(df.column2.isNull())
Here, the DataFrame df is filtered to include only the rows where the value of “column2” is null. The resulting DataFrame filtered_df will contain those rows.
Combining multiple conditions:
You can use logical operators like and, or, and not to combine multiple filtering conditions.
filtered_df = df.filter((df.column1 > 10) & (df.column2.isNull()))
In this example, the DataFrame df is filtered to include only the rows where “column1” is greater than 10 and “column2” is null.
Using functions for filtering:
PySpark provides a wide range of built-in functions that you can use for filtering. These functions can be applied to columns within the filtering condition.
from pyspark.sql.functions import col filtered_df = df.filter(col("column1").contains("abc"))
Here, the DataFrame df is filtered to include only the rows where “column1” contains the substring “abc”. The col()
function is used to reference the column within the filtering condition.