Python Data Source#

DataSource.name()

Returns a string represents the format name of this data source.

DataSource.reader(schema)

Returns a DataSourceReader instance for reading data.

DataSource.schema()

Returns the schema of the data source.

DataSource.streamReader(schema)

Returns a DataSourceStreamReader instance for reading streaming data.

DataSource.writer(schema, overwrite)

Returns a DataSourceWriter instance for writing data.

DataSourceReader.partitions()

Returns an iterator of partitions for this data source.

DataSourceReader.read(partition)

Generates data for a given partition and returns an iterator of tuples or rows.

DataSourceRegistration.register(dataSource)

Register a Python user-defined data source.

DataSourceStreamReader.commit(end)

Informs the source that Spark has completed processing all data for offsets less than or equal to end and will only request offsets greater than end in the future.

DataSourceStreamReader.initialOffset()

Return the initial offset of the streaming data source.

DataSourceStreamReader.latestOffset()

Returns the most recent offset available.

DataSourceStreamReader.partitions(start, end)

Returns a list of InputPartition given the start and end offsets.

DataSourceStreamReader.read(partition)

Generates data for a given partition and returns an iterator of tuples or rows.

DataSourceStreamReader.stop()

Stop this source and free any resources it has allocated.

DataSourceWriter.abort(messages)

Aborts this writing job due to task failures.

DataSourceWriter.commit(messages)

Commits this writing job with a list of commit messages.

DataSourceWriter.write(iterator)

Writes data into the data source.