In PySpark, there are various ways to write filter criteria for filtering data in a DataFrame. Here are some common approaches:
Using Comparison Operators:
You can use comparison operators such as >
, <
, >=
, <=
, ==
, and !=
to write filter criteria.
filtered_df = df.filter(df.column1 > 10)
In this example, the DataFrame df is filtered to include only the rows where the value of “column1” is greater than 10.
Using Logical Operators:
PySpark supports logical operators like and
, or
, and not
for combining multiple filter criteria.
filtered_df = df.filter((df.column1 > 10) & (df.column2 == "value"))
Here, the DataFrame df is filtered to include only the rows where “column1” is greater than 10 and “column2” is equal to “value”.
Using Functions:
PySpark provides a wide range of built-in functions that can be used in filter criteria. These functions allow you to perform various operations on columns.
from pyspark.sql.functions import col filtered_df = df.filter(col("column1").contains("abc"))
In this example, the DataFrame df is filtered to include only the rows where “column1” contains the substring “abc”.
Using Regular Expressions:
PySpark supports regular expressions for more advanced pattern matching in filter criteria.
filtered_df = df.filter(df.column1.rlike("^A.*"))
Here, the DataFrame df is filtered to include only the rows where “column1” starts with the letter “A”.
Using SQL Expressions:
PySpark allows you to write filter criteria using SQL-like expressions using the filter()
or where()
methods.
filtered_df = df.filter("column1 > 10")
In this example, the DataFrame df is filtered using the SQL expression “column1 > 10”.
These are some of the ways to write filter criteria in PySpark. You can choose the approach that best fits your filtering requirements, whether it’s using comparison operators, logical operators, functions, regular expressions, or SQL expressions. Filter criteria allow you to selectively extract specific rows from a DataFrame based on your data analysis or processing needs.