PySpark Data Manipulation with Example

Data manipulation in PySpark involves performing various transformations and actions on RDDs or DataFrames to modify, filter, aggregate, or process the data. PySpark provides a wide range of functions and operations for data manipulation. Here are some commonly used techniques:

Transformations:

a. map(): Applies a function to each element in the RDD or DataFrame and returns a new RDD or DataFrame.
b. filter(): Filters elements based on a condition and returns a new RDD or DataFrame.
c. groupBy(): Groups the data based on a specified key or keys.
d. join(): Joins two RDDs or DataFrames based on a common key or keys.
e. orderBy(): Sorts the data based on specified columns.
f. select(): Selects specific columns from a DataFrame.
g. withColumn(): Adds a new column or modifies an existing column in a DataFrame.

Actions:

a. collect(): Retrieves all the elements from the RDD or DataFrame to the driver program.
b. count(): Returns the total number of elements in the RDD or DataFrame.
c. first(): Returns the first element in the RDD or DataFrame.
d. take(): Returns a specified number of elements from the RDD or DataFrame.
e. show(): Displays a few rows of the DataFrame in a tabular format.

Here’s an example that demonstrates some data manipulation operations using PySpark:

# Creating a DataFrame from a CSV file
df = spark.read.csv("path/to/data.csv", header=True)

# Selecting specific columns
selected_df = df.select("name", "age")

# Filtering data based on a condition
filtered_df = df.filter(df.age > 25)

# Grouping data by a column and calculating aggregates
grouped_df = df.groupBy("gender").agg({"age": "avg", "salary": "max"})

# Joining two DataFrames based on a common key
joined_df = df1.join(df2, "id")

# Sorting data based on a column
sorted_df = df.orderBy("salary")

# Performing a map transformation on an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
squared_rdd = rdd.map(lambda x: x ** 2)

# Retrieving elements from an RDD or DataFrame
collected_data = df.collect()
count = df.count()

PySpark Data Manipulation with Example

Transformations:

Actions:

You may find these useful: