# Evaluation

We propose an homogeneous evaluation of the submitted solutions to predict the load flow in a power grid using the LIPS (Learning Industrial Physical Systems Benchmark suite) platform. The evaluation is performed through 3 categories that cover several aspects of augmented physical simulations namely:

  • ML-related: standard ML metrics (e.g. MAE, MAPE, etc.) and speed-up with respect to the reference solution computational time;
  • Physical compliance: respect of underlying physical laws (e.g. local and global energy conservation);
  • Application-based context: out-of-distribution (OOD) generalization to extrapolate over minimal variations of the problem depending on the application.

In the ideal case, one would expect a solution to perform equally well in all categories but there is no guarantee of that. In particular, even though a solution may perform well in standard machine-learning related evaluation, it is required to assess whether the solution also properly respects the underlying physics.

# Criteria

For each above mentioned category, specific criteria related to the power grid load flow prediction task are defined:

# ML criteria

It should be noted the following metrics are used to evalute the accuracy for the various output variabels:

  • MAPE90: the MAPE criteria computed on %10 highest quantile of the distribution (used for currents and powers);
  • MAE: mean absolute error (used for voltages).

# Physics compliance criteria

Various metrics are provided to examine the physics compliance of proposed solutions:

  • Current Positivity: Proportion of negative current
  • Voltage Positivity: Proportion of negative voltages
  • Losses Positivity: Proportion of negative energy losses
  • Disconnected lines: Proportion of non-null values for disconnected power lines
  • Energy Loss: energy losses range consistency
  • Global Conservation: Mean energy losses residual MAPE()
  • Local Conservation: Mean active power residual at nodes MAPE()
  • Joule Law: MAPE($\sum_{\ell=1}^L (\hat{p}^\ell_{ex} + \hat{p}^\ell_{or}) - R \times \frac{\sum_{\ell=1}^L (\hat{a}^\ell_{ex} + \hat{a}^\ell_{or})}{2} $)

# Practical computation of score

To evaluate properly the above mentioned evaluation criteria categories, in this competition, two test datasets are provided:

  • test_dataset: representing the same distribution as the training dataset (only one disconnected power line);
  • test_ood_dataset: representing a slightly different distribution from the training set (with two simultaneous disconnected power lines).

The ML-related (accuracy measures) and Physical compliance criteria are computed separately on these datasets. The speed-up is computed only on test dataset as the inference time may remains same throughout the different dataset configurations.

Hence, the global score is calculated based on a linear combination formula of the above evaluation criteria categories on these datasets:

$$\text{Score} = \alpha_{test}× \text{Score}_ + \alpha_{ood} × \text{Score}_ + \alpha_{speed-up} × \text{Score}_{speed-up}$$

where , , and are the coefficients to calibrate the relative importance related to metrics computed on test, out-of-distribution test dataset and speed-up wrt the physical solver obtained using the test dataset.

We explain in the following how to calculate each of the three sub-scores.


This sub-score is calculated based on a linear combination of 2 categories, namely: ML-related and Physics compliance.

$$\text{Score}_ = \alpha_{ML} × \text{Score}_ + \alpha_{Physics} × \text{Score}_$$
where $\alpha_{ML}$ and $\alpha_{Physics}$ are the coefficients to calibrate the relative importance of ML-related and Physics compliance categories respectively for test dataset.


For each quantity of interest, the ML-related sub-score is calculated based on two thresholds that are calibrated to indicate if the metric evaluated on the given quantity gives unacceptable/acceptable/great result. It corresponds to a score of 0 point / 1 point / 2 points, respectively. Within the sub-cateogry, Let :

  • , the number of unacceptable results overall (number of red circles)
  • , the number of acceptable results overall (number of orange circles)
  • , the number of great results overall (number of green circles)

Let also , given by .The score expression is given by:

$$\text{Score}_ = \frac{1}{2N} (2 \times Ng + 1 \times No + 0 \times Nr)$$

A perfect score is obtained if all the given quantities provide great results. Indeed, we would have and which implies .


For Physics compliance score , the score is also calibrated based on 2 thresholds and gives 0/1/2 points, similarly to , depending on the result provided by various considered metrics mentioned earlier.


Exactly the same procedure as above for computation of is used to compute the score on the out-of-distribution dataset using two evaluation categories which are : ML-related and Physics compliance. Hence, the ood is obtained by:

$$\text{Score}_ = \alpha_{ML} × \text{Score}_ + \alpha_{Physics} × \text{Score}_$$

where and are the coefficients to calibrate the relative importance of ML-related and Physics compliance categories respectively for out-of-distribution test dataset.


For the speed-up criteria, we calibrate the score using the function by using an adequate threshold of maximum speed-up to be reached for the task, meaning

$$Score_{Speed}= \min \left(\frac{\log_{10}(SpeedUp)}{\log_{10}(SpeedUpMax)}, 1\right)$$
where • $SpeedUp$ is given by
$$Score_{SpeedUp}= \frac{time_{ClassicalSolver}}{time_{Inference}}$$
• $SpeedUpMax$ is the maximal speed up allowed for the load flow prediction • $time_{ClassicalSolver}$, the elapsed time to solve the physical problem using the classical solver • $time_{Inference}$, the inference time obtained by the submitted solutions

In particular, there is no advantage in providing a solution whose speed exceeds , as one would get the same perfect score () for a solution such that .

Note that, while only the inference time appears explicitly in the score computation, it does not mean the training time is of no concern to us. In particular, if the training time overcomes a given threshold, the proposed solution will be rejected. Thus, it would be equivalent to a null global score.

# Practical example

Using the notation introduced in the previous subsection, let us consider the following configuration:

In order to illustrate even further how the score computation works, we provide in Table 2 examples for the load flow prediction task.

As it is the most straightforward to compute, we start with the global score for the solution obtained with 'Grid2Op', the physical solver used to produce the data. It is the reference physical solver, which implies that the accuracy is perfect but the speed-up is lower than the expctation. For illustration purpose, we use the speedup obtained by security analysis (explained in the begining of Notebook 5) which was faster than the Grid2op solver. Therefore, we obtain the following subscores:

Then, by combining them, the global score is , therefore 77%.

The procedure is similar with LeapNet architecture. The associated subscores are:

Then, by combining them, the global score is , therefore 45%.