pyspark.sql.functions.parse_url#

pyspark.sql.functions.parse_url(url, partToExtract, key=None)[source]#

URL function: Extracts a specified part from a URL. If a key is provided, it returns the associated query parameter value.

New in version 3.5.0.

Parameters
urlColumn or str

A column of strings, each representing a URL.

partToExtractColumn or str

A column of strings, each representing the part to extract from the URL.

keyColumn or str, optional

A column of strings, each representing the key of a query parameter in the URL.

Returns
Column

A new column of strings, each representing the value of the extracted part from the URL.

Examples

Example 1: Extracting the query part from a URL

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...   [("https://spark.apache.org/path?query=1", "QUERY")],
...   ["url", "part"]
... )
>>> df.select(sf.parse_url(df.url, df.part)).show()
+--------------------+
|parse_url(url, part)|
+--------------------+
|             query=1|
+--------------------+

Example 2: Extracting the value of a specific query parameter from a URL

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...   [("https://spark.apache.org/path?query=1", "QUERY", "query")],
...   ["url", "part", "key"]
... )
>>> df.select(sf.parse_url(df.url, df.part, df.key)).show()
+-------------------------+
|parse_url(url, part, key)|
+-------------------------+
|                        1|
+-------------------------+

Example 3: Extracting the protocol part from a URL

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...   [("https://spark.apache.org/path?query=1", "PROTOCOL")],
...   ["url", "part"]
... )
>>> df.select(sf.parse_url(df.url, df.part)).show()
+--------------------+
|parse_url(url, part)|
+--------------------+
|               https|
+--------------------+

Example 4: Extracting the host part from a URL

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...   [("https://spark.apache.org/path?query=1", "HOST")],
...   ["url", "part"]
... )
>>> df.select(sf.parse_url(df.url, df.part)).show()
+--------------------+
|parse_url(url, part)|
+--------------------+
|    spark.apache.org|
+--------------------+

Example 5: Extracting the path part from a URL

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...   [("https://spark.apache.org/path?query=1", "PATH")],
...   ["url", "part"]
... )
>>> df.select(sf.parse_url(df.url, df.part)).show()
+--------------------+
|parse_url(url, part)|
+--------------------+
|               /path|
+--------------------+