How to Filter Data From PySpark Dataframe

Filtering data in PySpark allows you to extract specific rows from a DataFrame based on certain conditions. You can use the filter() or where() methods to apply filtering operations. Here’s an explanation of how to filter data in PySpark:

Using the filter() method:

The filter() method allows you to specify the filtering condition as a Boolean expression. It returns a new DataFrame that contains only the rows satisfying the specified condition

filtered_df = df.filter(df.column1 > 10)

In the above example, the DataFrame df is filtered to include only the rows where the value of “column1” is greater than 10. The resulting DataFrame filtered_df will contain those rows.

Using the where() method:

The where() method is an alternative to filter() and provides the same functionality. It also allows you to specify the filtering condition as a Boolean expression.

filtered_df = df.where(df.column2.isNull())

Here, the DataFrame df is filtered to include only the rows where the value of “column2” is null. The resulting DataFrame filtered_df will contain those rows.

Combining multiple conditions:

You can use logical operators like and, or, and not to combine multiple filtering conditions.

filtered_df = df.filter((df.column1 > 10) & (df.column2.isNull()))

In this example, the DataFrame df is filtered to include only the rows where “column1” is greater than 10 and “column2” is null.

Using functions for filtering:

PySpark provides a wide range of built-in functions that you can use for filtering. These functions can be applied to columns within the filtering condition.

from pyspark.sql.functions import col

filtered_df = df.filter(col("column1").contains("abc"))

Here, the DataFrame df is filtered to include only the rows where “column1” contains the substring “abc”. The col() function is used to reference the column within the filtering condition.

How to Filter Data From PySpark Dataframe

Using the filter() method:

Using the where() method:

Combining multiple conditions:

Using functions for filtering:

You may find these useful: