pyspark.pandas.DataFrame.to_parquet#

DataFrame.to_parquet(path, mode='w', partition_cols=None, compression=None, index_col=None, **options)[source]#

Write the DataFrame out as a Parquet file or directory.

Parameters
pathstr, required

Path to write to.

modestr

Python write mode, default ‘w’.

Note

mode can accept the strings for Spark writing mode. Such as ‘append’, ‘overwrite’, ‘ignore’, ‘error’, ‘errorifexists’.

  • ‘append’ (equivalent to ‘a’): Append the new data to existing data.

  • ‘overwrite’ (equivalent to ‘w’): Overwrite existing data.

  • ‘ignore’: Silently ignore this operation if data already exists.

  • ‘error’ or ‘errorifexists’: Throw an exception if data already exists.

partition_colsstr or list of str, optional, default None

Names of partitioning columns

compressionstr {‘none’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘lz4’, ‘zstd’}

Compression codec to use when saving to file. If None is set, it uses the value specified in spark.sql.parquet.compression.codec.

index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default the index is always lost.

optionsdict

All other options passed directly into Spark’s data source.

Notes

pandas API on Spark writes Parquet files into the directory, path, and writes multiple part files in the directory unlike pandas. pandas API on Spark respects HDFS’s property such as ‘fs.default.name’.

Examples

>>> df = ps.DataFrame(dict(
...    date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')),
...    country=['KR', 'US', 'JP'],
...    code=[1, 2 ,3]), columns=['date', 'country', 'code'])
>>> df
                 date country  code
0 2012-01-31 12:00:00      KR     1
1 2012-02-29 12:00:00      US     2
2 2012-03-31 12:00:00      JP     3
>>> df.to_parquet('%s/to_parquet/foo.parquet' % path, partition_cols='date')
>>> df.to_parquet(
...     '%s/to_parquet/foo.parquet' % path,
...     mode = 'overwrite',
...     partition_cols=['date', 'country'])