Montag, 18. August 2025

On Synthetic Dataset Creation

This story begins in Philadelphia, 2022. We were attending the IEEE Photovoltaic Specialist Conference (PVSC), where we gave a talk about one of our papers. At this event we got to know a few people, including a researcher at University of Cyprus. We kept some contact after the event and eventually had the idea of co-authoring a paper, which we planned to present at 2023 PVSC. The idea was to take sample measurement data from solar power plants and train a model to generate realistic synthetic data from it. We got a good chunk of the work done at had submitted an abstract, which was accepted, but due to time conflicts we were eventually unable to make it there.  It's been some time since then, and we now feel like blogging about our approach and the preliminary results we got.


Photo of Philadelphia city center, close the the convention center where 2022 PVSC was held. It is here where the authors of the work had first met.

The core idea was to generate synthetic data with the same characteristics and behavior or real measurements. This can be used for a variety of applications, like digital twinning, assessing potential of power plant configurations, and more. We took publicly available measurement data from sites in Germany and United States as our basis. The approach was to derive characteristics using automated timeseries feature selection [1] and stochastic Markov chains; here we can quote our abstract directly:

First the process calculated a large set (over several thousand) of features that are known to be relevant dataset in time dependent data; in a second step, the features were weighed due to their correlation with the data, which resulted in a new list of several hundred features that are correlated to the data. These two steps represent the automated feature selection. Instead of synthesizing irradiance directly, we computed the clearsky indices (the ratio of the actual global horizontal irradiance to the theoretical global horizontal irradiance at ground level under cloudless conditions) using the clearsky model by Ineichen [2] as implemented in Python pvlib [3]; in the third step a Markov chain process was used to create a set of synthetic features for a given observation based on a Markov probability matrix; finally, a new series of clear-sky indices was generated using a random forest regression model. These artificial clear-sky indices, which have the same value and frequency range as the real measurements, can be converted into irradiance. The random forest hyperparameters and the number of features used in the Markov probability matrices were fine-tuned to minimize the difference between the features of the synthetic and measured data.

Overview of the pipeline of the proposed methodology to obtain complete GHI synthetic timeseries with correlated independent features from the original data. Figure was taken from the abstract.

Our preliminary results looked promising, which is why we went ahead with submitting the abstract. The following figures show examples of synthetically generated data for clearsky, mixed and overcast situations, which look quite realistic. A direct comparison of recreating the training data through the generation method shows a reasonable agreement with a 7.5% RMSE wrt. mean measurement.

 

Comparison of synthetically predicted and original timeseries for a selected day: (a) clear-sky index; and (b) GHI.


Sample of the synthetically generated daily GHI profiles at 15-minute sampling for (a) mostly clear-sky day, (b) mixed-skies day, and (c) overcast day.

The motivation for our work was manifold. There was scientific motivation, since good synthetic data was something, we could have used in many previous projects and even though there had been other ideas published about synthetic data creation [4,5,6], but none of those had used automated feature selection before. There was the motivation to get published again and have a reason to visit the next conference as speakers again. And lastly, since the authors had gotten to know each other quite recently, there was the motivation to simply do something with this newfound connection.
Even though we ended up not presenting at 2023 PVSC, we still think the idea was good and maybe there will be another chance to do something about it in the future.

References

[1] Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package). Neurocomputing 307 (2018) 72-77, doi: 10.1016/j.neucom.2018.03.067.

[2] Ineichen, P.. "A broadband simplified version of the Solis clear sky model." Solar Energy 82.8 (2008): 758-762.

[3] Holmgren, W., Hansen, C., and Mikofski, M., “pvlib python: a python package for modeling solar energy systems.” Journal of Open Source Software, 3(29), 884, (2018). https://doi.org/10.21105/joss.00884

[4] Polo, J., Zarzalejo, L. F., Marchante, R., and Navarro, A. A., “A simple approach to the synthetic generation of solar irradiance time series with high temporal resolution,” Solar Energy, vol. 85, no. 5, pp. 1164–1170, 2011, doi: 10.1016/j.solener.2011.03.011.

[5] Rayati, M., De Falco, P., Proto, D., Bozorg, M., and Carpita, M. 2021. "Generation Data of Synthetic High Frequency Solar Irradiance for Data-Driven Decision-Making in Electrical Distribution Grids" Energies 14, no. 16: 4734.

[6] I. L. Carreño et al., "SoDa: An Irradiance-Based Synthetic Solar Data Generation Tool," 2020 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Tempe, AZ, USA, 2020, pp. 1-6, doi: 10.1109/SmartGridComm47815.2020.9302941.


Keine Kommentare:

Kommentar veröffentlichen