Here’s a guide to verify the PySpark installation by running a simple script that counts the number of lines in a text file:
Prepare a Text File:
Create a text file with some content. For example, you can create a file named example.txt and add a few lines of text.
Write the PySpark Script:
Open a text editor and create a new file. Save it with a .py extension, such as line_count.py.
In the file, write the following PySpark script:
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder \ .appName("Line Count") \ .getOrCreate() # Read the text file into an RDD lines_rdd = spark.sparkContext.textFile("path/to/example.txt") # Count the number of lines in the RDD line_count = lines_rdd.count() # Print the result print("Number of lines:", line_count)
Replace “path/to/example.txt” with the actual path to the text file you created in step 1.
Run the PySpark Script:
Open a terminal or command prompt.
Navigate to the directory where you saved the line_count.py file.
Run the following command to execute the PySpark script:
spark-submit line_count.py
If everything is set up correctly, you should see the output displaying the number of lines in the text file.