Loading data from files into PySpark can be done using various data sources, such as CSV, JSON, Parquet, Avro, and more. Here’s a guide on how to load data from a file using different formats in PySpark:
CSV File:
To load data from a CSV file, you can use the read.csv()
method. Specify the file path, set the header parameter to True if the file has a header row, and optionally set other options like inferSchema to automatically infer the data types.
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.getOrCreate() # Load data from a CSV file df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
JSON File:
To load data from a JSON file, use the read.json()
method. Specify the file path and, similar to CSV, set options like header
and inferSchema
.
# Load data from a JSON file df = spark.read.json("path/to/data.json", header=True, inferSchema=True)
Parquet File:
Parquet is a columnar storage format commonly used in big data environments. To load data from a Parquet file, use the read.parquet()
method.
# Load data from a Parquet file df = spark.read.parquet("path/to/data.parquet")
Other File Formats:
PySpark supports various other file formats like Avro, ORC, JDBC, etc. You can use the respective read
methods to load data from those formats. For example, to load Avro data, use the read.format("avro")
method.
# Load data from an Avro file df = spark.read.format("avro").load("path/to/data.avro")