PostGIS Meets pgvector: Building Geospatial AI Applications at Scale
PostgreSQL with PostGIS and pgvector is becoming the unified database for location-aware AI. Learn how Earth Index manages 3 billion vectors for planetary-scale environmental monitoring.
PostgreSQL has always been the spatial data workhorse, but 2024 marked a turning point. With pgvector reaching production maturity, organizations can now run vector similarity search alongside PostGIS spatial operations—in the same database, in the same query. The result is a unified stack for location-aware AI applications that would otherwise require three separate systems.
The Vector Database Explosion—and PostgreSQL's Answer
The AI boom created demand for vector databases—systems optimized for storing and searching high-dimensional embeddings. Pinecone, Weaviate, Milvus, and others emerged as specialized solutions. But for teams already running PostgreSQL, adding another database means operational complexity and data synchronization headaches.
pgvector closes this gap. It's an open-source PostgreSQL extension that adds vector data types and approximate nearest neighbor (ANN) search. Install it alongside PostGIS, and suddenly your spatial database speaks the language of embeddings.
PostgreSQL and pgvector are now faster than Pinecone, 75% cheaper, and 100% open source. For datasets under 100 million vectors, general-purpose databases with vector extensions outperform specialized alternatives.
The Intersection of Spatial and Semantic
The power emerges when you combine spatial and semantic queries. Consider these use cases:
Hybrid Query Examples
- Find properties similar to this listing AND within 5km of a specific school
- Retrieve satellite imagery embeddings for areas that match this deforestation pattern
- Search for restaurants semantically similar to "cozy Italian bistro" within a polygon
- Find the nearest 10 sites with environmental conditions matching this reference
- Cluster geographic features by both location and semantic embedding
Each query combines ST_ spatial functions with vector similarity operators, executed in a single round-trip. No application-level joins between separate databases, no data consistency issues, no operational overhead of multiple systems.
Case Study: Earth Index at 3 Billion Vectors
Earth Index provides the most impressive validation of this approach at scale. Their planetary environmental monitoring system manages 3 billion vectors in PostgreSQL, combining VectorChord (a high-performance pgvector fork) with PostGIS.
The cost comparison is stark: approximately $12,000 per month for their PostgreSQL setup versus an estimated $237,000 monthly for a comparable managed cloud vector service. They achieve this by keeping vector embeddings alongside geospatial metadata—tile locations, administrative boundaries, proximity to features like rivers.
-- Find similar satellite embeddings within a geographic region
SELECT tile_id, similarity
FROM satellite_embeddings
WHERE ST_Intersects(
tile_geom,
ST_MakeEnvelope(-122.5, 37.5, -122.0, 38.0, 4326)
)
AND embedding <-> query_embedding < 0.5
ORDER BY embedding <-> query_embedding
LIMIT 100;pgvector 0.8.0: The Performance Leap
The 0.8.0 release in late 2024 delivered dramatic improvements:
pgvector 0.8.0 Improvements
9× faster query processing
With iterative index scans
100× more relevant results
Through improved recall
3-5× query throughput
Over previous versions
Binary quantization
32× memory reduction with 95% accuracy
40-80% cost reduction
Compared to specialized vector databases
Iterative index scans are particularly important for combined spatial-vector queries. Previously, the index might return candidates that pass the vector similarity threshold but fail the spatial filter. Now, the index automatically scans deeper when needed, ensuring result quality.
Indexing Strategies
pgvector offers two indexing approaches, each with tradeoffs:
Index Comparison
- HNSW (Hierarchical Navigable Small World) — Better query performance, slower builds, more memory. Best for read-heavy workloads.
- IVFFlat (Inverted File Index) — Faster builds, less memory, requires periodic re-indexing. Better for frequently updated datasets.
For geospatial applications, HNSW is typically preferred. Satellite embeddings, location descriptions, and spatial feature vectors don't change frequently once computed. The build time investment pays off in query performance.
pgvectorscale: Beyond pgvector
Timescale's pgvectorscale extension builds on pgvector with innovations for larger scale: StreamingDiskANN stores part of the index on disk rather than requiring everything in memory, making billion-vector deployments economically viable.
Our Perspective
For geospatial AI applications, the PostGIS + pgvector combination is now my default recommendation. The operational simplicity of a single database, combined with the ability to express complex spatial-semantic queries in SQL, outweighs the marginal performance benefits of specialized vector databases.
The exception is truly massive scale—hundreds of billions of vectors—where purpose-built systems still have the edge. But for the vast majority of geospatial applications, PostgreSQL is enough, and enough is better than over-engineered.
References & Further Reading
pgvector GitHub Repository
Official pgvector extension for PostgreSQL
https://github.com/pgvector/pgvector
3 Billion Vectors in PostgreSQL to Protect the Earth
Earth Index case study on planetary-scale vector search
https://blog.vectorchord.ai/3-billion-vectors-in-postgresql-to-protect-the-earth
PostgreSQL as a Vector Database: Complete Guide
Timescale's comprehensive pgvector tutorial
https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector
pgvector and PostGIS: Unlocking Advanced PostgreSQL Use Cases
Practical guide to combining pgvector with PostGIS
https://blogs.vultr.com/PG-Vector-PostGIS
pgvector with PostGIS Example Repository
Code examples for hybrid spatial-vector queries
https://github.com/scitus-ca/pgvector_with_postgis