Geospatial AI 10 December 2024 12 min read

The Dawn of Geospatial Foundation Models: Why Clay Changes Everything

Foundation models are revolutionizing how we analyze Earth observation data. Clay and similar GeoFMs represent a paradigm shift from task-specific models to versatile, pre-trained systems that understand our planet.

Machine LearningRemote SensingFoundation ModelsClay
Satellite view of Earth showing cloud patterns and landmasses
NASA on Unsplash

The geospatial industry is experiencing its GPT moment. Just as large language models transformed natural language processing by learning general representations from vast text corpora, geospatial foundation models (GeoFMs) are now doing the same for Earth observation data. At the center of this revolution is Clay—an open-source foundation model that represents a fundamental shift in how we approach satellite imagery analysis.

What Makes Foundation Models Different

Traditional machine learning for remote sensing has followed a predictable pattern: collect labeled data for a specific task (building detection, land cover classification, crop type mapping), train a model from scratch, and hope it generalizes to new regions. This approach is expensive, time-consuming, and brittle. A model trained on European agricultural fields often fails spectacularly on African landscapes.

Foundation models flip this paradigm. By pre-training on massive amounts of unlabeled satellite imagery using self-supervised learning, these models develop rich internal representations of what Earth looks like across seasons, geographies, and spectral bands. The result is a model that can be fine-tuned for downstream tasks with far less labeled data—sometimes just a few dozen examples.

GeoFMs offer immediate value without training. They represent an emerging research field and are a type of pre-trained vision transformer specifically adapted to geospatial data sources.
ACM SIGSPATIAL GeoAI 2024

Clay: Open Foundation Model for Earth

Clay emerged from the team behind Microsoft's Planetary Computer and operates as a fiscally sponsored project under Radiant Earth. Unlike proprietary alternatives, Clay is fully open-source, allowing researchers and practitioners to inspect, modify, and build upon the model.

The architecture is a Vision Transformer (ViT) adapted to understand geospatial and temporal relationships in Earth observation data. Clay uses a Masked Autoencoder (MAE) approach for self-supervised learning—the model learns by predicting masked portions of satellite images, developing robust feature representations in the process.

Key Clay Capabilities

Multi-spectral input

Works with all Sentinel-2 bands, though commonly uses RGB and NIR

Location-aware

Incorporates geographic coordinates as input features

Temporal understanding

Processes time series data to understand seasonal patterns

768-dimensional embeddings

Rich representations for downstream tasks

Flexible inference

Can accept varying image sizes, resolutions, and band combinations

The GeoFM Landscape in 2024

Clay isn't alone. The GeoFM space has exploded with alternatives, each with different design philosophies:

Notable Foundation Models

  • Prithvi-100M (IBM/NASA) — Trained on Harmonized Landsat-Sentinel data, strong on climate applications
  • SatMAE — Pioneering work on masked autoencoders for satellite imagery
  • SpectralGPT — Focuses on hyperspectral data with spectral-aware pretraining
  • DOFA — Dynamic One-For-All architecture for multi-sensor fusion
  • SatVision-Base — Microsoft's contribution optimized for high-resolution imagery

What sets Clay apart is its practical focus on deployment and its emphasis on similarity search. Clay-powered systems can detect emerging deforestation patterns before they expand into large-scale clearing operations—essentially enabling "reverse image search" for the planet.

Current Limitations and Challenges

Foundation models aren't magic. Research from ACM SIGSPATIAL notes that on multimodal geospatial tasks—those requiring fusion of satellite imagery with POI data, street-level photos, or tabular attributes—existing FMs still underperform task-specific models.

Pixel-level precision remains challenging. Transformer architectures reduce feature resolution 4-5x, sacrificing fine-grained spatial details needed for precise segmentation or sub-meter change detection. This is where hybrid approaches combining GeoFMs with specialized segmentation heads (like SAM2) become essential.

What's Next: The LLM-GeoFM Convergence

The most exciting frontier is the integration of large language models with GeoFMs. Imagine querying a satellite archive with natural language: "Show me all locations where solar panel installations increased by more than 20% between 2020 and 2024" and receiving not just coordinates, but explanatory analysis grounded in the imagery.

This convergence is already happening. AWS's geospatial FM service combines Prithvi with Claude for natural language interaction. Development Seed's work on semantic search using Clay embeddings points toward a future where geospatial analysis is accessible to domain experts without ML expertise.

Our Perspective

Having worked with enterprise GIS systems for nearly a decade, I see GeoFMs as the most significant shift since cloud-native geospatial formats. The ability to extract meaningful features from imagery without manual labeling campaigns changes the economics of satellite analytics entirely.

However, I'm skeptical of claims that GeoFMs will replace domain expertise. The real value lies in augmentation—combining foundation model capabilities with deep understanding of specific geographies, sensor characteristics, and application requirements. The teams that will succeed are those building on Clay and similar models while maintaining rigorous validation against ground truth.

Tell us about your project

Our Offices

  • Canberra
    ACT, Australia