HPC / Machine Learning 2021

UncoverML Geoscience Pipeline

Deployed machine learning toolkit for applying ML to geoscience datasets on the National Computational Infrastructure, enabling distributed multi-core, multi-node processing for compute-intensive geological analysis.

Client

Geoscience Australia

UncoverML Geoscience Pipeline - Data analytics visualization

Key Results

Multi-node

HPC Deployment

Weeks→Hours

Analysis Time

National

Team Usage

Open Source

Contribution

The Challenge

Geoscientists needed to apply machine learning to massive geological datasets but lacked infrastructure and tools optimized for high-performance computing environments.

Key challenges included:

  • Massive datasets too large for single-machine processing
  • Complex HPC environment with specific requirements
  • Need for distributed processing across multiple nodes
  • Integration with existing geoscience workflows
  • Legacy Python 2 codebase requiring migration

National Computational Infrastructure

NCI is Australia's national facility providing high-performance computing services to researchers. Optimizing workflows for NCI enables researchers to tackle problems previously infeasible.

Gadi Supercomputer

Multi-petaflop computing power

Massive Datasets

Petabytes of geoscience data

Our Solution

Designed and deployed distributed infrastructure on NCI, implementing feature extraction, hyperparameter optimization, prediction mapping, and model exploration capabilities. Refactored codebase to Python 3 for future compatibility.

MPI Distribution

Implemented MPI-based parallelization for distributing workloads across multiple compute nodes efficiently.

Feature Extraction

Built scalable feature extraction pipelines for processing large geospatial raster datasets.

Hyperparameter Optimization

Automated hyperparameter tuning leveraging HPC resources for exhaustive search.

Prediction Mapping

Generated prediction maps at scale, producing interpretable outputs for geoscientists.

Python 3 Migration

Refactored entire codebase from Python 2 to Python 3 for long-term maintainability.

scikit-learn Integration

Deep integration with scikit-learn for access to wide range of ML algorithms.

Technology Stack

Python
NCI/HPC
MPI
scikit-learn
GDAL
NumPy
Rasterio
PBS

Project Impact

Research Acceleration

  • Reduced analysis time from weeks to hours
  • Enabled previously infeasible large-scale analyses
  • Used by geoscience teams nationally
  • Supports mineral exploration and mapping

Open Source Contribution

  • Contributed improvements back to open source project
  • Python 3 migration benefits entire community
  • Documentation for reproducible research
  • Foundation for ongoing geoscience ML work