An important update from the CORNERSTONE team: Dr. Stefano Scardigli has completed a critical analysis to ensure the reliability of our solar flare prediction dataset.
The Challenge of “Clean” Data
Machine learning algorithms are only as good as the data they learn from. While we’ve assembled years of solar observations, not all measurements are created equal. Dr. Scardigli’s recent work focused on identifying and addressing quality issues that could compromise prediction accuracy.
What Was Discovered?
Through meticulous analysis of the magnetic field features (SHARP parameters) extracted from NASA’s Solar Dynamics Observatory data, Dr. Scardigli identified several critical issues:
Redundant Information: Some of the 17 magnetic field parameters were highly correlated with each other—meaning they provided overlapping information. This redundancy can confuse machine learning algorithms and reduce prediction accuracy.
Parameter Criticalities: Certain measurements, particularly those related to the number of reliable pixels in an observation (CMASK parameter), showed variability that could create false correlations. These needed special normalization procedures.
Data Quality Variations: Analysis revealed differences between “near real-time” (NRT) data—available quickly for operational forecasting—and “definitive” (DEF) data—the higher-quality version processed weeks later.
The Solution: Cleaned Datasets
Working closely with the CORNERSTONE team and project coordination, Dr. Scardigli developed refined, “clean” versions of both the near real-time and definitive datasets. This work involved:
- Removing redundant features that don’t add predictive value
- Implementing normalization strategies to handle problematic parameters
- Establishing quality filters to exclude unreliable observations
- Creating separate, optimized datasets for both operational (NRT) and research (DEF) applications
Why This Matters
For machine learning applications in space weather forecasting, data quality is paramount. Predictions need to be both accurate and timely—forecasters use near real-time data even though it has lower quality than definitive data, because operational forecasting requires immediate information.
Dr. Scardigli’s quality control work ensures that:
- Machine learning algorithms train on reliable, non-redundant information
- Prediction models can work with both real-time and high-quality retrospective data
- The dataset provides physically meaningful features without spurious correlations
The Path Forward
These cleaned datasets now serve as the foundation for developing and testing machine learning algorithms within CORNERSTONE. By investing time in rigorous data preparation and quality control, we’re building prediction systems that forecasters can trust for protecting critical infrastructure from solar storms.
This meticulous attention to data quality exemplifies the importance of database handling expertise in modern scientific research—where the ability to prepare and validate data is just as crucial as the algorithms that analyze it.
CORNERSTONE is funded under MUR – PRIN 2022 PNRR (P2022RKXH9 – CUP: E53D23021410001)
