📄 JSON and XML Parsing in PySpark

Parsing Pipeline

Architecture Diagram

Detailed Explanation

Fundamental Challenge

Converting unstructured text into typed, queryable DataFrames while preserving semantic meaning.

Key Takeaway: PySpark provides robust support for JSON/XML, but complexity lies in handling nested structures, variable schemas, and performance optimization.

Schema Inference vs Explicit Schema

Approach	Method	Recommendation
Schema Inference	Sampling-based, automatic	Exploration only
Explicit Schema	`StructType`/`StructField` definition	Production workloads

Benefits of Explicit Schema:

Enables predicate pushdown
Enables column pruning
Significantly improves query performance
Avoids incorrect schema from non-representative samples

Malformed Record Handling

Mode	Behavior
PERMISSIVE (default)	Sets corrupt records to null
DROPMALFORMED	Silently drops corrupt records
FAILFAST	Throws exceptions on corrupt records

Nested Data Flattening

Operation	Purpose	Consideration
explode	Transform array elements to rows	Increases row count
select (dot notation)	Extract nested fields	e.g., `col("customer.name")`

Warning: Each explode operation increases row count—careful management prevents data explosion.

XML Processing

Requires spark-xml library with additional complexities:

Attributes vs child elements (key-value pairs in tags)
Mixed content (text interspersed with elements)
Namespace handling

Option	Purpose
`rowTag`	Specifies element defining each row
`attributePrefix`	Distinguishes attributes from child elements

Schema Evolution

The mergeSchema option enables automatic schema merging across files/batches.

Change Type	Support
Additive (new columns)	Well supported
Destructive (column removal/type changes)	May fail

Best Practice: Enable mergeSchema cautiously—test with small sample first.

Mathematical Foundations

Definition: Schema Evolution

Schema evolution transforms a record $R$ from schema $S_i$ to $S_j$ via a migration function $M_{i \rightarrow j}(R)$ , where:

M_{i \rightarrow j}(R) = \begin{cases} R & \text{if } S_j \subseteq S_i \text{ (narrowing)} \\ R \cup \{d_k: k \in S_j \setminus S_i\} & \text{if additive} \\ \text{error} & \text{if incompatible} \end{cases}

Parsing Complexity

For JSON document of depth $d$ and branching factor $b$ , parsing complexity is:

T_{\text{parse}} = O(b^d) \quad \text{(worst case)}, \qquad T_{\text{streaming}} = O(n) \text{ where } n = \text{token count}

Streaming parsers reduce space from $O(b^d)$ to $O(d)$ stack depth.

Schema Correctness Theorem

A schema $S$ is valid for document $D$ if and only if every value at path $p$ in $D$ matches the type constraint $S(p)$ :

\forall p \in \text{paths}(D): \text{type}(D(p)) \in S(p).\text{allowed\_types}

Inference from sample $D_{\text{sample}}$ overestimates types: $S_{\text{inferred}} \supseteq S_{\text{true}}$ .

Columnar Conversion Cost

Converting nested JSON to flat columns with $k$ levels and $n$ leaf fields:

C_{\text{flatten}} = n \times m \times k \quad \text{(total field extractions)}

where $m$ is the number of records.

Compression Ratio

After schema enforcement, columnar storage compression improves by:

\text{Ratio} = \frac{|R_{\text{untyped}}|}{|R_{\text{typed}}|} \approx \frac{\sum_i \text{bits}(R_i.\text{string})}{\sum_i \text{bits}(R_i.\text{typed})}

Typical improvement: 3-10x for numeric columns.

Key Insight

from_json with explicit schema is 2-5x faster than schema_of_json inference because it avoids a sampling pass. For nested data, use schema_of_json once to generate the schema, then hardcode it.

Summary

JSON/XML parsing on Spark benefits from streaming parsers for memory efficiency, explicit schemas for performance, and type enforcement for compression. Schema evolution requires careful compatibility analysis. Nested-to-flat conversion cost scales linearly with record count and nesting depth.

Key Concepts Table

Concept	Description	Configuration
Schema Inference	Automatically determine data types from sample	`inferSchema=true` (default for JSON)
Explicit Schema	Define schema programmatically	`schema=my_schema`
PERMISSIVE Mode	Set corrupt records to null	`mode=PERMISSIVE`
DROPMALFORMED Mode	Silently drop corrupt records	`mode=DROPMALFORMED`
FAILFAST Mode	Throw exception on corrupt records	`mode=FAILFAST`
from_json	Parse JSON string column to struct	`from_json(col, schema)`
get_json_object	Extract value using JSON path	`get_json_object(col, path)`
explode	Transform array to multiple rows	`explode(col("array_col"))`
mergeSchema	Merge schemas across files/batches	`mergeSchema=true`
columnNameOfCorruptRecord	Store malformed records in column	`columnNameOfCorruptRecord="_corrupt"`

Code Examples

JSON Parsing with Explicit Schema

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json

spark = SparkSession.builder \
    .appName("JSONXMLParsing") \
    .getOrCreate()

# Define explicit schema for JSON data
order_schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("timestamp", StringType(), True),
    StructField("customer", StructType([
        StructField("id", StringType(), False),
        StructField("name", StringType(), True),
        StructField("email", StringType(), True),
        StructField("address", StructType([
            StructField("street", StringType(), True),
            StructField("city", StringType(), True),
            StructField("state", StringType(), True),
            StructField("zip", StringType(), True),
        ]), True),
    ]), False),
    StructField("items", ArrayType(StructType([
        StructField("sku", StringType(), False),
        StructField("name", StringType(), True),
        StructField("quantity", IntegerType(), True),
        StructField("price", DoubleType(), True),
        StructField("attributes", MapType(StringType(), StringType()), True),
    ])), True),
    StructField("payment", StructType([
        StructField("method", StringType(), True),
        StructField("amount", DoubleType(), True),
        StructField("currency", StringType(), True),
    ]), True),
    StructField("metadata", MapType(StringType(), StringType()), True),
])

# Read JSON with explicit schema
json_data = [
    ('{"order_id":"ORD-001","timestamp":"2024-01-15T10:30:00Z","customer":{"id":"CUST-001","name":"Alice","email":"alice@email.com","address":{"street":"123 Main St","city":"New York","state":"NY","zip":"10001"}},"items":[{"sku":"A1","name":"Widget","quantity":5,"price":29.99,"attributes":{"color":"blue","size":"M"}},{"sku":"B2","name":"Gadget","quantity":3,"price":49.99,"attributes":{"color":"red","size":"L"}}],"payment":{"method":"credit","amount":249.93,"currency":"USD"},"metadata":{"source":"web","campaign":"summer"}}',),
    ('{"order_id":"ORD-002","timestamp":"2024-01-15T11:00:00Z","customer":{"id":"CUST-002","name":"Bob","email":"bob@email.com","address":{"street":"456 Oak Ave","city":"Boston","state":"MA","zip":"02101"}},"items":[{"sku":"C3","name":"Gizmo","quantity":2,"price":99.99,"attributes":{"color":"black"}}],"payment":{"method":"debit","amount":199.98,"currency":"USD"},"metadata":{"source":"mobile","campaign":"winter"}}',),
]

df = spark.read \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(order_schema) \
    .json(spark.sparkContext.parallelize([row[0] for row in json_data]))

# Flatten nested JSON structure
flattened_df = df \
    .select(
        col("order_id"),
        col("timestamp"),
        col("customer.id").alias("customer_id"),
        col("customer.name").alias("customer_name"),
        col("customer.email").alias("customer_email"),
        col("customer.address.city").alias("customer_city"),
        col("customer.address.state").alias("customer_state"),
        col("payment.method").alias("payment_method"),
        col("payment.amount").alias("payment_amount"),
        col("metadata").alias("order_metadata"),
    )

flattened_df.show(truncate=False)

# Explode items array
items_df = df \
    .select(
        col("order_id"),
        col("customer.id").alias("customer_id"),
        explode(col("items")).alias("item")
    ) \
    .select(
        col("order_id"),
        col("customer_id"),
        col("item.sku"),
        col("item.name"),
        col("item.quantity"),
        col("item.price"),
        (col("item.quantity") * col("item.price")).alias("line_total"),
    )

items_df.show(truncate=False)

Advanced JSON Operations

# Using from_json for string columns
json_strings_df = spark.createDataFrame([
    ('{"a": 1, "b": "hello"}',),
    ('{"a": 2, "b": "world", "c": 3.14}',),
], ["json_string"])

schema = StructType([
    StructField("a", IntegerType()),
    StructField("b", StringType()),
    StructField("c", DoubleType()),
])

parsed_df = json_strings_df \
    .withColumn("parsed", from_json(col("json_string"), schema)) \
    .select(
        col("parsed.a").alias("a"),
        col("parsed.b").alias("b"),
        col("parsed.c").alias("c"),
    )

parsed_df.show(truncate=False)

# Using get_json_object for JSON path queries
paths_df = df \
    .withColumn("customer_city", 
        get_json_object(col("_corrupt_record") if "_corrupt_record" in df.columns 
                       else to_json(struct("*")), "$.customer.address.city")
    )

# Handle nested arrays with posexplode
items_with_position = df \
    .select(
        col("order_id"),
        posexplode(col("items")).alias("position", "item")
    ) \
    .select(
        col("order_id"),
        col("position"),
        col("item.sku"),
        col("item.quantity"),
    )

items_with_position.show(truncate=False)

# Aggregate nested data
order_summary = df \
    .withColumn("item_count", size(col("items"))) \
    .withColumn("total_item_value", 
        expr("aggregate(items, 0D, (acc, item) -> acc + item.quantity * item.price)")
    ) \
    .withColumn("unique_skus", 
        size(array_distinct(expr("transform(items, x -> x.sku)")))
    ) \
    .select(
        col("order_id"),
        col("customer.name"),
        col("item_count"),
        col("total_item_value"),
        col("unique_skus"),
    )

order_summary.show(truncate=False)

XML Parsing

# Install spark-xml: spark.jars.packages=com.databricks:spark-xml_2.12:0.17.0

# Create sample XML data
xml_data = """<?xml version="1.0" encoding="UTF-8"?>
<catalog>
    <book id="bk101">
        <author>Gambardella, Matthew</author>
        <title>XML Developer's Guide</title>
        <genre>Computer</genre>
        <price>44.95</price>
        <publish_date>2000-10-01</publish_date>
        <description>An in-depth look at creating applications with XML.</description>
    </book>
    <book id="bk102">
        <author>Ralls, Kim</author>
        <title>Midnight Rain</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2000-12-16</publish_date>
        <description>A former architect battles corporate zombies.</description>
    </book>
</catalog>"""

# Write XML to file for reading
xml_path = "/tmp/catalog.xml"
spark.sparkContext.parallelize([xml_data]).saveAsTextFile(xml_path)

# Read XML
xml_df = spark.read \
    .option("rowTag", "book") \
    .option("attributePrefix", "attr_") \
    .option("valueTag", "_value") \
    .xml(xml_path)

xml_df.show(truncate=False)

# Parse XML from string column
xml_strings = spark.createDataFrame([
    ('<person><name>Alice</name><age>30</age></person>',),
    ('<person><name>Bob</name><age>25</age></person>',),
], ["xml_string"])

# Using from_xml (requires spark-xml)
from pyspark.sql.types import schema_of_json

person_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
])

xml_parsed = xml_strings \
    .withColumn("parsed", from_xml(col("xml_string"), person_schema)) \
    .select(
        col("parsed.name").alias("name"),
        col("parsed.age").alias("age"),
    )

xml_parsed.show(truncate=False)

# Handle XML attributes
xml_with_attrs = """<order id="ORD-001" status="complete">
    <customer id="CUST-001">Alice</customer>
    <total currency="USD">249.95</total>
</order>"""

# Extract attributes using regex or UDF
@udf(returnType=MapType(StringType(), StringType()))
def extract_xml_attributes(xml_string):
    """Extract XML attributes from root element."""
    import re
    attrs = {}
    matches = re.findall(r'(\w+)="([^"]*)"', xml_string)
    for key, value in matches:
        attrs[key] = value
    return attrs

# Combine with element extraction
xml_df = spark.createDataFrame([(xml_with_attrs,)], ["xml"])
result = xml_df \
    .withColumn("attributes", extract_xml_attributes(col("xml"))) \
    .withColumn("order_id", col("attributes")["id"]) \
    .withColumn("status", col("attributes")["status"])

result.show(truncate=False)

Performance Metrics

Operation	1K Records	100K Records	10M Records	100M Records
JSON Read (schema inference)	< 1 sec	2-5 sec	30-60 sec	5-10 min
JSON Read (explicit schema)	< 1 sec	1-3 sec	10-20 sec	2-5 min
JSON String Parse (from_json)	< 1 sec	1-2 sec	8-15 sec	1-3 min
JSON Path Extract (get_json_object)	< 1 sec	2-5 sec	20-40 sec	3-8 min
Array Explode	< 1 sec	1-3 sec	10-20 sec	2-5 min
Deep Nesting Extraction (3+ levels)	< 1 sec	3-8 sec	45-90 sec	8-15 min
XML Read	< 1 sec	3-8 sec	45-90 sec	8-15 min
Schema Merge (mergeSchema)	< 1 sec	2-5 sec	20-40 sec	3-8 min
Malformed Record Handling	< 1 sec	1-3 sec	10-20 sec	2-5 min
Nested Aggregation	< 1 sec	2-5 sec	20-40 sec	3-8 min

Best Practices

Always define explicit schemas for production JSON/XML parsing to avoid schema inference overhead and ensure type consistency
Use PERMISSIVE mode with columnNameOfCorruptRecord to capture and investigate malformed records instead of losing data
Limit schema inference samples with spark.sql.columnNameOfCorruptRecord and spark.sql.json.filterNullRecords for large datasets
Flatten incrementally—extract top-level fields first, then explode arrays, to maintain data lineage and avoid row explosion
Cache parsed DataFrames when performing multiple operations on the same semi-structured data to avoid re-parsing
Use get_json_object for targeted extraction of specific fields instead of parsing the entire JSON structure
Enable mergeSchema cautiously—only use when schema evolution is expected and test with a small sample first
Partition XML/JSON files by a logical key (date, region) to enable predicate pushdown and reduce scan volume
Avoid explode on large arrays without filtering—use filter before explode to reduce the number of generated rows
Use from_json instead of get_json_object when extracting multiple fields from the same JSON to avoid repeated parsing
Implement custom UDFs for complex XML parsing when spark-xml doesn't support your specific XML structure
Monitor corrupt record counts in production—high corruption rates indicate upstream data quality issues that should be addressed at the source

Mathematical Foundation Summary

JSON/XML parsing complexity follows O(n × d) where n is document size and d is nesting depth. Schema inference uses statistical sampling where sample_size = min(N, max_samples) to infer types with confidence p = 1 - (1-c)^sample_size. The explode operator produces |output| = Σ|array_length(col_i)| rows, making pre-filtering critical. Column pruning reduces memory by Pruned Columns / Total Columns ratio, while predicate pushdown reduces I/O by filtering at the source level before parsing overhead.

See also: Geospatial Data (27), Bucketing Strategies (29), ML Feature Engineering (32)

JSON and XML Parsing in PySpark