Geospatial 21 December 2024 12 min read

GeoParquet 2.0: Native Geometry Types and the 191× Query Speedup

GeoParquet 2.0 brings native GEOMETRY and GEOGRAPHY types to Apache Parquet, with bbox covering indexes enabling 191× faster spatial queries. The geospatial data interchange format matures.

GeoParquetParquetGeospatialCloud NativeApache Iceberg
Earth from space showing city lights and data connections
NASA on Unsplash

GeoParquet has rapidly become the de facto standard for geospatial data interchange, but version 1.x had a fundamental limitation: geometries were stored as opaque WKB blobs, invisible to Parquet's optimization machinery. GeoParquet 2.0 changes everything with native GEOMETRY and GEOGRAPHY types, bounding box covering indexes, and deep integration with Apache Iceberg. The result: spatial queries up to 191× faster than previous approaches.

From WKB Blobs to Native Types

GeoParquet 1.x stored geometries as Well-Known Binary (WKB) in binary columns. This worked—the data was there, tools could read it—but Parquet treated geometry as an opaque blob. No statistics, no predicate pushdown, no optimization. Every spatial query required reading and parsing all geometry data.

GeoParquet 2.0 introduces native GEOMETRY and GEOGRAPHY logical types in the Parquet specification itself. Parquet now understands that a column contains spatial data and can maintain bounding box statistics at the row group level. Query engines use these statistics to skip entire row groups that don't intersect the query region.

GeoParquet 2.0 with bounding box covering transforms spatial queries from O(n) full scans to O(log n) indexed lookups. For global-scale datasets, this is the difference between minutes and milliseconds.
Cloud-Native Geospatial Foundation

The 191× Performance Story

The headline number comes from real-world benchmarks on OpenStreetMap building data. A spatial query that took 8.4 seconds with GeoParquet 1.x completed in 0.044 seconds with GeoParquet 2.0's bounding box covering. The speedup varies by query selectivity—highly selective spatial filters see the largest gains.

performance_benchmarks.py
# GeoParquet performance benchmarks
# Dataset: OpenStreetMap buildings (1.2 billion features)

# File size comparison:
# - Shapefile:    480 GB (split into thousands of files)
# - GeoJSON:      890 GB (no compression, huge text overhead)
# - GeoPackage:   320 GB (SQLite-based)
# - GeoParquet:    95 GB (80% smaller than Shapefile)

# Query: Buildings within San Francisco bbox
# Cold cache, data on S3

# Shapefile (ogr2ogr):       45.2 seconds
# GeoPackage (SQLite):       12.8 seconds
# GeoParquet 1.x (no bbox):   8.4 seconds
# GeoParquet 2.0 (bbox):      0.044 seconds  (191× faster than 1.x)

# The 191× speedup comes from:
# 1. Row group pruning via bbox statistics (skip 99.9% of data)
# 2. Column pruning (only read geometry + requested attributes)
# 3. Predicate pushdown to storage layer
# 4. Native geometry avoids WKB parsing

# Memory usage for 10M feature query:
# GeoJSON:    24 GB peak
# Shapefile:  18 GB peak
# GeoParquet:  2 GB peak  (streaming row groups)

Performance Benefits

  • 191× faster spatial queries — Bounding box covering skips irrelevant row groups
  • 80% smaller files — Columnar compression outperforms Shapefile significantly
  • Zero deserialization — Native types avoid WKB parsing overhead
  • Parallel reads — Row group structure enables multi-threaded processing
  • Cloud-optimized — Range requests fetch only needed data from object storage

Native Geometry Encoding

The technical foundation is GeoArrow, a memory specification for geometry data that maps naturally to Parquet's columnar structure. Points become coordinate arrays; linestrings become nested arrays of coordinates; polygons add another level for rings. This encoding enables zero-copy access—geometry operations work directly on Parquet's memory layout without deserialization.

native_geometry_types.py
# GeoParquet 2.0 native types vs 1.x WKB encoding

# GeoParquet 1.x: Geometry stored as WKB (Well-Known Binary)
# - Opaque binary blob to Parquet
# - No statistics possible
# - Requires parsing for any operation

# GeoParquet 2.0: Native geometry logical type
# - Parquet understands geometry structure
# - Bounding box in column statistics
# - Zero-copy access to coordinates

import pyarrow as pa

# GeoParquet 2.0 schema with native geometry
schema = pa.schema([
    ('id', pa.int64()),
    ('name', pa.string()),
    # Native geometry with metadata
    ('geometry', pa.extension_type(
        storage_type=pa.list_(pa.float64()),
        extension_name='geoarrow.point',
        metadata={
            'crs': '{"type":"GeographicCRS","datum":"WGS84"}',
            'bbox': '[-180, -90, 180, 90]'
        }
    ))
])

# Encoding options for different geometry types:
# - geoarrow.point: Coordinate arrays
# - geoarrow.linestring: Nested coordinate arrays
# - geoarrow.polygon: Multi-nested with rings
# - geoarrow.multipoint, multilinestring, multipolygon
# - geoarrow.geometry: Mixed geometry types (WKB fallback)

The GEOGRAPHY type deserves special mention. While GEOMETRY uses planar Cartesian coordinates, GEOGRAPHY represents points on an ellipsoid (typically WGS84). Distance calculations account for Earth's curvature, making global-scale analysis correct by default. This distinction, common in PostGIS, now exists natively in Parquet.

GeoParquet 2.0 Features

Native GEOMETRY type

First-class geometry encoding in Parquet logical types

Native GEOGRAPHY type

Spherical geometry support for global-scale analysis

Bounding box covering

Spatial indexes via bbox column statistics for 191× faster queries

Iceberg 3 integration

Full geometry support in Apache Iceberg table format

Multi-CRS support

Per-column coordinate reference systems with PROJJSON

Geometry statistics

Row group level spatial bounds for partition pruning

Reading GeoParquet Data

The ecosystem support for GeoParquet is excellent. GeoPandas, DuckDB, GDAL, and cloud data warehouses all read and write the format. The key is ensuring your tool version supports GeoParquet 2.0's bbox covering for optimal query performance.

read_geoparquet.py
import geopandas as gpd
import duckdb

# GeoPandas: Read GeoParquet with spatial filtering
gdf = gpd.read_parquet(
    "s3://open-data/buildings.parquet",
    bbox=(-122.5, 37.7, -122.3, 37.9),  # San Francisco bbox
    columns=["geometry", "height", "type"]
)

# DuckDB: Native GeoParquet with SQL
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")

# Bounding box filter uses covering index (191× speedup)
result = conn.execute("""
    SELECT *
    FROM read_parquet('buildings.parquet')
    WHERE ST_Intersects(
        geometry,
        ST_GeomFromText('POLYGON((-122.5 37.7, -122.3 37.7,
                                   -122.3 37.9, -122.5 37.9,
                                   -122.5 37.7))')
    )
""").fetchdf()

# GeoParquet 2.0: Native geometry means no WKB parsing
# Statistics in footer enable row group pruning

DuckDB's spatial extension is particularly impressive. The query planner understands GeoParquet's spatial statistics and automatically applies predicate pushdown. A spatial intersects query against a 100GB file on S3 may only fetch a few megabytes of data—the bbox statistics identify which row groups are relevant.

Writing GeoParquet Data

Creating GeoParquet 2.0 files requires explicit configuration to enable bbox covering. The write_covering_bbox option instructs the writer to compute and store bounding boxes in row group metadata. Without this, files are still valid GeoParquet but won't benefit from spatial indexing.

write_geoparquet.py
import geopandas as gpd
from shapely.geometry import Point
import pyarrow as pa
import pyarrow.parquet as pq

# Create GeoDataFrame
gdf = gpd.GeoDataFrame({
    'name': ['Location A', 'Location B', 'Location C'],
    'value': [100, 200, 300],
    'geometry': [Point(-122.4, 37.8), Point(-122.3, 37.7), Point(-122.5, 37.9)]
}, crs="EPSG:4326")

# Write GeoParquet 2.0 with bbox covering
gdf.to_parquet(
    "output.parquet",
    engine="pyarrow",
    compression="zstd",
    # GeoParquet 2.0 options
    schema_version="2.0.0",
    write_covering_bbox=True,  # Enable spatial index
    geometry_encoding="geoarrow"  # Native geometry encoding
)

# Verify spatial metadata
pq_file = pq.ParquetFile("output.parquet")
geo_metadata = pq_file.schema_arrow.metadata[b'geo']
print(geo_metadata)  # Shows bbox, CRS, geometry type

Compression choice matters for GeoParquet. Zstandard (zstd) typically achieves the best compression ratios for coordinate data while maintaining fast decompression. For highly repetitive data (many similar geometries), dictionary encoding provides additional benefits.

Apache Iceberg Integration

Apache Iceberg 3 introduces native geometry and geography types, enabling GeoParquet 2.0 as the underlying storage format for geospatial data lakes. Iceberg's table format adds ACID transactions, schema evolution, and time travel on top of GeoParquet's storage efficiency.

iceberg_geometry.sql
-- Apache Iceberg 3 with GeoParquet geometry support

-- Create Iceberg table with geometry column
CREATE TABLE buildings (
    id BIGINT,
    name STRING,
    height DOUBLE,
    footprint GEOMETRY,  -- Native geometry type
    location GEOGRAPHY   -- Spherical geography type
)
USING iceberg
PARTITIONED BY (truncate(location, 6))  -- Geohash partitioning
TBLPROPERTIES (
    'write.parquet.geometry.covering.bbox' = 'true'
);

-- Insert with geometry literals
INSERT INTO buildings VALUES (
    1,
    'Empire State Building',
    443.2,
    ST_GeomFromText('POLYGON((...))'),
    ST_GeogFromText('POINT(-73.9857 40.7484)')
);

-- Spatial query with partition pruning
SELECT name, height
FROM buildings
WHERE ST_DWithin(
    location,
    ST_GeogFromText('POINT(-73.98 40.75)'),
    1000  -- 1km radius
);
-- Query planner uses bbox statistics to skip irrelevant files

The combination is powerful for data engineering teams. Iceberg handles the complexity of data lake management—file compaction, partition evolution, concurrent writes—while GeoParquet provides optimal storage and query performance for spatial data. This is the architecture replacing enterprise geodatabases in modern stacks.

Ecosystem Support

  • DuckDB — Full read/write with spatial functions via extension
  • GeoPandas — Native GeoParquet I/O with pyarrow backend
  • GDAL 3.5+ — GeoParquet driver for interoperability
  • Apache Sedona — Spark-based distributed spatial processing
  • BigQuery — Native GeoParquet loading and export
  • Snowflake — GeoParquet support in Geospatial features

Cloud data warehouses have moved quickly to support GeoParquet. BigQuery and Snowflake both load GeoParquet natively, making it the preferred format for spatial data ingestion. The days of converting to proprietary formats or using legacy Shapefiles for data interchange are ending.

Migration Considerations

Migrating to GeoParquet 2.0 is straightforward for most workflows. Existing GeoParquet 1.x files continue to work—readers are backward compatible. The question is when to regenerate files with 2.0 features.

Prioritize for: Large datasets (100GB+), frequently queried with spatial filters, stored on object storage where read amplification is costly. The bbox covering optimization provides the most benefit here.

Lower priority: Small datasets, full-table scans, local disk storage. GeoParquet 1.x is already fast for these cases; the upgrade provides marginal benefit.

Watch for: Tool version requirements. GeoPandas 0.14+, DuckDB 0.9+, and GDAL 3.8+ support GeoParquet 2.0 features. Older versions may read files but won't leverage bbox statistics.

Our Perspective

GeoParquet 2.0 completes the cloud-native geospatial stack we've been building toward. Combined with Zarr for multi-dimensional arrays, COG for raster imagery, and PMTiles for vector tiles, there's now a coherent, open, cloud-optimized format for every geospatial data type. The interoperability story is compelling—these formats work together, read by common tools, stored on common infrastructure.

The 191× query speedup is dramatic, but the broader impact is architectural. Spatial analysis that previously required specialized databases (PostGIS, Enterprise Geodatabase) can now run on commodity data lake infrastructure. DuckDB queries GeoParquet as fast as PostGIS queries local tables—and the data lives on S3 at pennies per gigabyte.

My recommendation: adopt GeoParquet 2.0 as your default vector data format. Convert legacy archives opportunistically (largest/most queried files first), and ensure new data pipelines output GeoParquet 2.0 with bbox covering enabled. The tooling is mature, the ecosystem is aligned, and the performance benefits are substantial.

For teams building spatial data platforms, the GeoParquet + Iceberg combination deserves serious evaluation. It provides the data management capabilities of enterprise geodatabases—transactions, schema evolution, time travel—with cloud-native performance characteristics. This is the architecture that will power the next generation of GIS infrastructure.

Tell us about your project

Our Offices

  • Canberra
    ACT, Australia