pyspark.sql.DataFrameReader.schema#

DataFrameReader.schema(schema)[source]#

Specifies the input schema.

Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
schemapyspark.sql.types.StructType or str

a pyspark.sql.types.StructType object or a DDL-formatted string (For example col0 INT, col1 DOUBLE).

Examples

>>> spark.read.schema("col0 INT, col1 DOUBLE")
<...readwriter.DataFrameReader object ...>

Specify the schema with reading a CSV file.

>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="schema") as d:
...     spark.read.schema("col0 INT, col1 DOUBLE").format("csv").load(d).printSchema()
root
 |-- col0: integer (nullable = true)
 |-- col1: double (nullable = true)