Ensuring Quality: Data Refinement Work

An important update from the CORNERSTONE team: Dr. Stefano Scardigli has completed a critical analysis to ensure the reliability of our solar flare prediction dataset.

The Challenge of “Clean” Data

Machine learning algorithms are only as good as the data they learn from. While we’ve assembled years of solar observations, not all measurements are created equal. Dr. Scardigli’s recent work focused on identifying and addressing quality issues that could compromise prediction accuracy.

What Was Discovered?

Through meticulous analysis of the magnetic field features (SHARP parameters) extracted from NASA’s Solar Dynamics Observatory data, Dr. Scardigli identified several critical issues:

Redundant Information: Some of the 17 magnetic field parameters were highly correlated with each other—meaning they provided overlapping information. This redundancy can confuse machine learning algorithms and reduce prediction accuracy.

Parameter Criticalities: Certain measurements, particularly those related to the number of reliable pixels in an observation (CMASK parameter), showed variability that could create false correlations. These needed special normalization procedures.

Data Quality Variations: Analysis revealed differences between “near real-time” (NRT) data—available quickly for operational forecasting—and “definitive” (DEF) data—the higher-quality version processed weeks later.

The Solution: Cleaned Datasets

Working closely with the CORNERSTONE team and project coordination, Dr. Scardigli developed refined, “clean” versions of both the near real-time and definitive datasets. This work involved:

Removing redundant features that don’t add predictive value
Implementing normalization strategies to handle problematic parameters
Establishing quality filters to exclude unreliable observations
Creating separate, optimized datasets for both operational (NRT) and research (DEF) applications

Why This Matters

For machine learning applications in space weather forecasting, data quality is paramount. Predictions need to be both accurate and timely—forecasters use near real-time data even though it has lower quality than definitive data, because operational forecasting requires immediate information.

Dr. Scardigli’s quality control work ensures that:

Machine learning algorithms train on reliable, non-redundant information
Prediction models can work with both real-time and high-quality retrospective data
The dataset provides physically meaningful features without spurious correlations

The Path Forward

These cleaned datasets now serve as the foundation for developing and testing machine learning algorithms within CORNERSTONE. By investing time in rigorous data preparation and quality control, we’re building prediction systems that forecasters can trust for protecting critical infrastructure from solar storms.

This meticulous attention to data quality exemplifies the importance of database handling expertise in modern scientific research—where the ability to prepare and validate data is just as crucial as the algorithms that analyze it.

CORNERSTONE is funded under MUR – PRIN 2022 PNRR (P2022RKXH9 – CUP: E53D23021410001)

Ensuring Quality: Data Refinement Work

Didariodelmoro

The Challenge of “Clean” Data

What Was Discovered?

The Solution: Cleaned Datasets

Why This Matters

The Path Forward

Di dariodelmoro

Articoli correlati

CORNERSTONE: A Journey Completed, a Path Forward

CORNERSTONE at the UN/Costa Rica Workshop on ML for Space Weather

Physics-Informed Features for Fair and More Interpretable Solar Flare Forecasting

Lascia un commento Annulla risposta

You missed

CORNERSTONE: A Journey Completed, a Path Forward

CORNERSTONE at the UN/Costa Rica Workshop on ML for Space Weather

Physics-Informed Features for Fair and More Interpretable Solar Flare Forecasting

CORNERSTONE at ESWW 2025: Dr. Chierichini Presents Attention-Based Flare Forecasting

CORNERSTONE