Read/Write From External File

Loading data from files into PySpark can be done using various data sources, such as CSV, JSON, Parquet, Avro, and more. Here’s a guide on how to load data from a file using different formats in PySpark:

CSV File:

To load data from a CSV file, you can use the read.csv() method. Specify the file path, set the header parameter to True if the file has a header row, and optionally set other options like inferSchema to automatically infer the data types.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data from a CSV file
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

JSON File:

To load data from a JSON file, use the read.json() method. Specify the file path and, similar to CSV, set options like header and inferSchema.

# Load data from a JSON file
df = spark.read.json("path/to/data.json", header=True, inferSchema=True)

Parquet File:

Parquet is a columnar storage format commonly used in big data environments. To load data from a Parquet file, use the read.parquet() method.

# Load data from a Parquet file
df = spark.read.parquet("path/to/data.parquet")

Other File Formats:

PySpark supports various other file formats like Avro, ORC, JDBC, etc. You can use the respective read methods to load data from those formats. For example, to load Avro data, use the read.format("avro") method.

# Load data from an Avro file
df = spark.read.format("avro").load("path/to/data.avro")

CSV File:

JSON File:

Parquet File:

Other File Formats:

You may find these useful: