Data standards for energy modeling for policy support

Introduction

This topic reviews the intertwined questions of data architectures, data specifications, and data sources in the context of energy modeling for policy support (EMoPS).

The issues surrounding data standards range from high‑level to specific. And please be warned in advance: most of the accompanying matters are poorly resolved, be they overarching objectives or implementation details. The view presented here is that analysts should aim for good practice and, whenever applicable, contribute to wider agendas, but that very often compromises are unavoidable and expedient solutions are necessary.

One important emerging theme is the development of data systems :arrow_heading_down: to create coherent datasets, facilitate information sharing between similarly conceived system models, and assist reproducibility.

Another important activity are communal efforts in improve data semantics :arrow_heading_down: and related high‑level agendas.

This posting attempts to indicate the type of data sources available and where data catalogs :arrow_heading_down: might be found. This posting does not seek to present or maintain a comprehensive list of information sources.

Genuine open data with suitable data‑capable open licenses should be favored whenever possible. More on adding open licenses and on license choice.

Data architectures

The term data architecture is used here to cover the structure of data without reference to other aspects. Data architectures range in purpose and sophistication. The following breakdown is applied:

Construct Example Comment URL or citation
spreadsheets ubiquitous but difficult to integrate
frictionless data packages semi‑formal standard for simple packaging Karev and Winfree (2020)
model‑specific nested datasets Flomb/Calliope‑Italy Calliope project data instance Lombardi (2020)
model‑specific databases
data systems PowerSystems.jl a framework, other projects curate data
community backends Open Energy Platform
linked open data (LOD) LOD‑GEOSS based on DBpedia Databus

 
The first two entries, ad‑hoc spreadsheets and frictionless data packages, are mostly used to collect, store, and exchange data. Model‑specific nested datasets and databases are used to support different modeling frameworks and are specific to each framework. The so‑called data systems attempt to provide more general solutions and can potentially interact with any number of semantically‑similar frameworks. Community backend solutions offer dedicated infrastructure that is specific to this domain. Linked open data projects are developing very broad functionality that operates across the web and accounts for the strengths and drawbacks of web‑sourced information.

This posting does not considered brokered data platforms like the United Kingdom Icebreaker One (IB1) Open Energy project or the Open Subsurface Data Universe (OSDU) being developed by oil majors and allied servicing companies. Brokered data platforms like these are useful when dealing with both data protected under personal privacy and data that is legally‑encumbered in some way.

Spreadsheets and frictionless data packages

Spreadsheets are widely used for general data management. More recently, online spreadsheets and file hosting services have become popular. Spreadsheets tend not to be used by energy system modelers as they are more difficult than other data architectures to place under version control and to process using scripts written in languages like Python, Julia, and R.

Frictionless data packages (FDP) are a concept promoted by the London‑based Open Knowledge Foundation. The FDP concept combines CSV files for collected data and JSON files for the accompanying metadata. CSV stands for comma‑separated values and stores numerical information as text strings (technically character encoded). As such, CSV files are human readable, space‑inefficient, and widely portable. CSV files and FDP packages are more amenable to version control and scripted processing than spreadsheets. FDP packages are preferred over CSV files however due to their better support for metadata, including type information.

Some data portals also serve SQLite files alongside spreadsheets and FDPs. One example is the OPSD portal :arrow_heading_down:.

Framework-specific solutions

The term “framework” is used by system modelers to distinguish individual models from the application software that supports them. Under this usage, the application software is a “model framework” and a particular model becomes a “framework instance”. Model frameworks include Calliope, oemof, and PyPSA — to cite three that are written in the Python language and share relatively similar underlying designs.

The data requirements for most models are naturally tree structured. Some modeling frameworks therefore deploy data files located within a predefined hierarchy of subdirectories and files. These sets of files are often hosted on git repositories, including public git portals like GitHub. The datasets from specific studies can then be zipped and archived on sites like zenodo. One such example from Calliope (screenshot below) is provided by Lombardi (2020). Other projects instead interface to a fully‑fledged relational database or similar.

Diagram: Zenodo interface to the Flomb/Calliope‑Italy study dataset (Lombardi 2020).

Data systems

The admittedly novel term “data system” is used here to describe prepackaged data and supporting software which is coherent, internally consistent, and complete within its stated scope. Data systems embrace the notion of a digital twin in that regard. In contrast to CSV solutions, data systems often prioritize the efficient storage and rapid recovery of information. They are designed to interact with suitable model frameworks via a readily customizable translation layer or overloaded APIs called by the framework itself. Data systems can also normally perform integrity checks, receive and store results, and offer standardized reporting. Some systems can additionally manage key parts of the data processing pipeline and thereby promote repeatability. These systems, taken overall, should improve both reproducibility and productivity. Examples include PowerSystems.jl, PowerGenome, and Spine Toolbox.

Community backend services

Community portals are portals that cater specifically to the needs of a particular research domain and normally offer API access. In the case of energy system modeling, the Open Energy Platform (OEP) provides such features. In addition, the OEP can manage scenarios for use across any number of suitable model frameworks. Cross‑framework comparisons based on identical scenario specifications are likely to play an important role in terms of EMoPS validation going forward.

Linked open data

Linked open data (LOD) is an internet architecture whereby a smart “bus” interfaces between the user and various web API‑exposed data sources. Work in this area is currently at research stage only. But the techniques being explored have the prospect of revolutionizing data handling and can support automatic data updating. One project is LOD‑GEOSS based on fundamental work by the DBpedia Databus project.

Data specifications

The term data specification refers collectively to the various standards, norms, protocols, guidelines, and conventions that a particular research community has adopted in relation to collecting, exchanging, processing, and verifying numerical and non‑numerical information. Data specifications fall into two camps. High‑level specifications cover concepts like semantics, metadata, collection protocols, and relationships. While low‑level specifications include prosaic matters like data representations, encodings, and timestamping and specialist formats for geodata and such like.

Semantics

Semantics refers to the meaning of data. Sometimes that meaning remains implicit and sometimes it is subject to formal treatments and ontologies. The Open Energy Ontology (OEO) is currently under development by energy system modelers to service the needs of EMoPS processes (Booshehri et al 2021).

Metadata, collection standards

Metadata provides information on the circumstances of collection, downstream usage, and legal context. Data collection standards inform metadata and the two concepts are tightly coupled. Metadata is sometimes described as data about data. Metadata. Again, like semantics, these concepts are not well resolved. The EERAdata metadata project is currently advancing these issues (Wierling et al under review).

Relationships between objects

Extending these concepts, the relationships between objects in a domain collectively create a labeled directed graph with objects mapped to nodes and relationships to edges. Such objects need not be tangible. This information is often captured as web‑based RDF triples. Hence, an informing ontology together with dataset‑specific metadata and semantic triples merge to form a larger canvas onto which automated reasoning can be applied. Such reasoning may include rule‑based validation, simple inference, and semantic search methods. These concepts combined also underpin the framework‑specific solutions, data systems, and linked open data architectures (all described earlier), albeit at varying degrees of formalism.

Low-level protocols

Geodata and geographic information more generally require special consideration. Hinz and Bill (2018) provide background.

Open licensing

Public licenses provide users with a set of permissions and obligations without the need to form bilateral agreements with data providers or their brokers. Open licenses are a subset of public licenses that meet community expectations for free use. More on adding open licenses and on license choice. In the absence of other considerations, collected or processed data and object attributes should be licensed Creative Commons CC‑BY‑4.0 and metadata and semantic triples should be licensed CC0‑1.0.

Sources of data

Sources of numerical data for EMoPS have historically focused on the electricity sector. This is largely because energy system models tend to start with the electricity sector and later bridge to other sectors. But also because a given electricity subsystem, in contrast to other subsystems, requires highly detailed engineering analysis to resolve its design.

As indicated, this posting does not attempt to provide a complete and current listing. Rather it gives a flavor of what exists and indicates where one might seek further information.

As the table below indicates, there are quite a range of primary and derived sources of data at hand. Clearly, sources that have been community‑curated are normally preferable to those that have not.

Construct Example Comment URL or citation
primary sources ENTSO‑E Transparency Platform statutory reporting Hirth et al (2018)
subject‑specific data repositories JRC ENSPRESO database Ruiz et al (2019)
community‑curated data portals OPSD, Open Energy Outlook
general data portals WRI Power Explorer
data system stock libraries PowerGenome
proprietary databases S&P Global Platts legally‑encumbered
citizen science projects mapping PV installations in UK Stowell et al (2020)
AI‑based projects predictive mapping of global power system Arderne et al (2020)

Primary sources

Primary sources are often mandated under statutory reporting, with the European ENTSO‑E Transparency Platform being one such portal. One problem with information made public under statutory reporting, in Europe at least, is that the underpinning legislation is often silent on licensing (Hirth 2020). The Energy Information Administration (EIA) provides primary information in the United States.

Subject-specific data repositories

Subject‑specific data repositories cover specific areas like wind energy. One such example is the European Commission JRC ENSPRESO dataset (Ruiz et al 2019).

Community-curated data portals

Community‑curated data portals are exactly that: data portals stocked and maintained by energy system modelers. Examples include the OPSD portal in Europe (Wiese et al 2019) and the Open Energy Outlook database (DeCarolis 2020) and PUDL projects in the United States.

Data portals more generally

There are a number of data portals which mostly assemble and publish primary data from other sources. Examples include energydata.info from the World Bank Group, OpenEI from the US Department of Energy, and Power Explorer from the World Resources Institute (WRI).

Data system stock libraries

Data systems often bundle stock libraries to service the more widespread needs of their userbases. These system will also output selected data in common text‑based interchange formats for the purposes of archiving and exchange. As mentioned earlier, any upstream data dependencies will need to be established by each user. Examples include PowerGenome and the stock library for PowerSystems.jl known as PowerSystemCaseBuilder.jl.

Proprietary databases

Several companies sell access to commercial databases, controlled by contract. This information is normally both legally‑encumbered and expensive and therefore a last choice for EMoPS processes. These databases also tend to be less well resolved for the global south.

Citizen science projects

There are a number of citizen science projects, often based on OpenStreetMap, that collect fine‑grained information. Examples include information on electricity transmission assets within Europe and solar‑PV installations within the United Kingdom (Stowell et al 2020). As of 2021, the latter project utilizes 300 volunteers and has mapped 250k installations.

AI-based projects

Information projects combining aerial imagery and artificial intelligence (AI) techniques represent an emerging field. Examples include the machine recognition of power lines (Arderne et al 2020) and of renewables installations.

Data catalogs

The following catalogs of relevance to EMoPS processes may be of help — otherwise the various data portals can be inspected in turn:

Validation and provenance

Validation, provenance tracking, and other forms of quality assurance remain vitally important. These processes have only mostly been undertaken informally to date. Looking forward, such features can be naturally built into both data systems and linked open data systems. Moreover as indicated, these two data architectures can potentially support recall‑and‑reissue functionality, thereby relieving researchers of tedious and typically error‑prone hand maintenance.

Closure

Data management for EMoPS is now a sophisticated area in its own right. That said, many high‑level matters remain works‑in‑progress — including data semantics and metadata standards. Those handling data for EMoPS processes will need to navigate their way around the landscape indicated in this posting and make suitable and ofttimes difficult choices and compromises.

One theme here is the degree of formalism required. Often less formalism seems sensible but, as the domain matures, more formalism is required. Another related theme is the natural fit between data models and ontologies, on the one hand, and model framework design, including object‑oriented analysis, on the other.

Much of the effort in obtaining data for EMoPS is simply replicating what the European Commission calls “privately held information [of] public interest”. It remains an open question as to whether strengthening statutory reporting provisions might offer a better approach, with suitable anonymization techniques applied where needed.

As indicated, this posting does not attempt to catalog data sources. The EMoPS field is too specific, too diverse, and too volatile to make the listing of sources here remotely useful or feasible.

Useful links

References

Arderne, Christopher, C Zorn, C Nicolas, and EE Koks (15 January 2020). “Predictive mapping of the global power system using open data”. Scientific Data. 7 (1): 19. ISSN 2052‑4463. doi:10.1038/s41597‑019‑0347‑4. Creative Commons CC‑BY‑4.0 license.

Booshehri, Meisam, Lukas Emele, Simon Flügel, Hannah Förster, Johannes Frey, Ulrich Frey, Martin Glauer, Janna Hastings, Christian Hofmann, Carsten Hoyer‑Klick, Ludwig Hülk, Anna Kleinau, Kevin Knosala, Leander Kotzur, Patrick Kuckertz, Till Mossakowski, Christoph Muschner, Fabian Neuhaus, Michaja Pehl, Martin Robinius, Vera Sehn, and Mirjam Stappel (1 September 2021). “Introducing the Open Energy Ontology: enhancing data interpretation and interfacing in energy systems analysis”. Energy and AI. 5: 100074. ISSN 2666-5468. doi:10.1016/j.egyai.2021.100074. Creative Commons CC‑BY‑4.0 license.

DeCarolis, Joe (24 December 2020). An Open Energy Outlook for the United States powered by TEMOA. Raleigh, North Carolina, USA: NC State University. YouTube video 15:16. Creative Commons CC‑BY‑4.0 license.

Hinz, Matthias and Ralf Bill (2018). Mapping the landscape of open geodata. 21th AGILE Conference on Geographic Information Science.

Hirth, Lion, Jonathan Mühlenpfordt, and Marisa Bulkeley (1 September 2018). “The ENTSO-E Transparency Platform: a review of Europe’s most ambitious electricity data platform”. Applied Energy. 225: 1054–1067. ISSN 0306-2619. doi:10.1016/j.apenergy.2018.04.048. Creative Commons CC-BY-4.0 license.

Hirth, Lion (1 January 2020). “Open data for electricity modeling: legal aspects”. Energy Strategy Reviews. 27: 100433. ISSN 2211-467X. doi:10.1016/j.esr.2019.100433. Creative Commons CC‑BY‑4.0 license.

Karev, Evgeny and Lilly Winfree (8 October 2020). Announcing the new frictionless framework. Open Knowledge Foundation blog.

Lombardi, Francesco (22 June 2020). Flomb/Calliope-Italy: v0.2.2 — Direct and indirect electrification options. Europe: Zenodo. doi:10.5281/zenodo.3903089. Zipped dataset. DOI resolves to the latest version. Creative Commons CC‑BY‑4.0 license.

Reder, Klara, Carsten Pape, Mirjam Stappel, Hannah Förster, Lukas Emele, Christian Winger, Ludwig Hülk, Christian Hofmann, Editha Kötter, Martin Glauer, and Till Mossakowski (2019). Scenario data on the Open Energy Platform (SzenarienDB on the OEP): a web-platform to improve transparency and reproducibility of energy system analyses — Poster. Kassel, Germany: Fraunhofer Institute for Energy Economics and Energy System Technology (IEE).

Reder, Klara, Mirjam Stappel, Christian Hofmann, Hannah Förster, Lukas Emele, Ludwig Hülk, and Martin Glauer (24 January 2020). “Identification of user requirements for an energy scenario database”. International Journal of Sustainable Energy Planning and Management. 25: 95–108. ISSN 2246-2929. doi:10.5278/ijsepm.3327.

Ruiz, Pablo, Wouter Nijs, Dalius Tarvydas, Alessandra Sgobbi, Andreas Zucker, Roberto Pilli, Klas Jonsson, Andrea Camia, Christian Thiel, Carsten Hoyer-Klick, Francesco Dalla Longa, Tom Kober, Jake Badger, Patrick Volker, Berien S Elbersen, Andre Brosowski, and Daniela Thrän (1 November 2019). “ENSPRESO — an open, EU‑28 wide, transparent and coherent database of wind, solar and biomass energy potentials”. Energy Strategy Reviews. 26: 100379. ISSN 2211-467X. doi:10.1016/j.esr.2019.100379. Creative Commons CC‑BY‑4.0 license.

Stowell, Dan, Jack Kelly, Damien Tanner, Jamie Taylor, Ethan Jones, James Geddes, and Ed Chalstrey (13 November 2020). “A harmonised, high-coverage, open dataset of solar photovoltaic installations in the UK”. Scientific Data. 7 (1): 394. ISSN 2052-4463. doi:10.1038/s41597-020-00739-0. Creative Commons CC‑BY‑4.0 license.

Wierling, August, Valeria Jana Schwanitz, Sebnem Altinci, Maria Balazińska, Michael J Barber, Mehmet Efe Biresselioglu, Christopher Burger-Scheidlin, Massimo Celino, Muhittin Hakan Demir, Richard Dennis, Nicolas Dintzner, Adel el Gammal, Carlos M Fernández-Peruchena, Winston Gilcrease, Pawel Gladysz, Carsten Hoyer-Klick, Kevin Josho, Mariusz Kruczek, David Lacroix, Malgorzata Markowska, Rafael Mayo-García, Robbie Morrison, Manfred Paier, Giuseppe Peronato, Mahendranath Ramakrishnan, Janeita Reid, Alessandro Sciullo, Berfu Solak, Demet Suna, Wolfgang Süß, Astrid Unger, Maria Luisa Fernandez_Vanoni, and Nikola Vasiljevic (under review). “Advancing FAIR metadata standards for low carbon energy research”. Energy Strategy Reviews. Code: ESR‑D‑21‑00036.

Wiese, Frauke, Ingmar Schlecht, Wolf-Dieter Bunke, Clemens Gerbaulet, Lion Hirth, Martin Jahn, Friedrich Kunz, Casimir Lorenz, Jonathan Mühlenpfordt, Juliane Reimann, and Wolf‑Peter Schill (15 February 2019). “Open Power System Data: frictionless data for electricity system modelling”. Applied Energy. 236: 401–409. ISSN 0306-2619. doi:10.1016/j.apenergy.2018.11.097. Postprint.