To sort data in PySpark DataFrame, you can use the orderBy()
method. It allows you to specify one or more columns by which you want to sort the data, along with the sort order (ascending or descending).
Here’s an example of sorting data using the orderBy()
method:
sorted_df = df.orderBy("column1", df.column2.desc())
In this example:
The DataFrame df is sorted primarily by the “column1” in ascending order.
If there are ties in “column1”, the secondary sort is applied on “column2” in descending order.
You can also specify the sort order using the asc()
or desc()
functions. Here’s an example:
from pyspark.sql.functions import desc sorted_df = df.orderBy(desc("column1"), "column2")
In this case, the “column1” is sorted in descending order using the desc()
function, and “column2” is sorted in ascending order.
Additionally, you can sort by multiple columns by passing them as separate arguments to the orderBy()
method. For example:
sorted_df = df.orderBy("column1", "column2", desc("column3"))
In this example, the DataFrame is sorted primarily by “column1” in ascending order, then by “column2” in ascending order, and finally by “column3” in descending order.
By default, the orderBy()
method sorts the DataFrame in ascending order. To sort in descending order, you can use the desc()
function or specify the sort order as desc
.
Sorting the data in a PySpark DataFrame using the orderBy()
method allows you to organize the data in a specific order based on one or more columns, facilitating analysis and downstream processing.