# Evaluation

We propose an homogeneous evaluation of the submitted solutions to learn the airfoil design task using the LIPS (Learning Industrial Physical Systems Benchmark suite) platform. The evaluation is performed through 3 categories that cover several aspects of augmented physical simulations namely:

ML-related: standard ML metrics (e.g. MAE, RMSE, etc.) and speed-up with respect to the reference solution computational time;
Physical compliance: respect of underlying physical laws (e.g. Navier-Stokes equations);
Application-based context: out-of-distribution (OOD) generalization to extrapolate over minimal variations of the problem depending on the application; speed-up; In the ideal case, one would expect a solution to perform equally well in all categories but there is no guarantee of that. In particular, even though a solution may perform well in standard machine-learning related evaluation, it is required to assess whether the solution also properly respects the underlying physics.

For each category, specific criteria related to the airfoils design task are defined. The global score is calculated based on a linear combination formula of the three evaluation criteria categories scores:

where

α

α

, and

α

are the coefficients to calibrate the relative importance of ML-Related, Application-based OOD, and Physics Compliance categories respectively.

We explain in the following how to calculate each of the three sub-scores for each category.

#

This sub-score is calculated based on a linear combination form of 2 sub-criteria, namely: Accuracy and SpeedUp.

where

and

are the coefficients to calibrate the relative importance of accuracy and SpeedUp respectively.

For each quantity of interest, the accuracy sub-score is calculated based on two thresholds that are calibrated to indicate if the metric evaluated on the given quantity gives unacceptable/acceptable/great result. It corresponds to a score of 0 point / 1 point / 2 points, respectively. Within the sub-cateogry, Let :

• , the number of unacceptable results overall
• , the number of acceptable results overall
• , the number of great results overall
Let also , given by .The score expression is given by:

A perfect score is obtained if all the given quantities provide great results. Indeed, we would have and which implies .

For the speed-up criteria, we calibrate the score using the function by using an adequate threshold of maximum speed-up to be reached for the task, meaning

where
•

is given by

•

is the maximal speed up allowed for the airfoil use case
•

, the elapsed time to solve the physical problem using the classical solver
•

, the inference time

In particular, there is no advantage in providing a solution whose speed exceeds , as one would get the same perfect score () for a solution such that .

Note that, while only the inference time appears explicitly in the score computation, it does not mean the training time is of no concern to us. In particular, if the training time overcomes a given threshold, the proposed solution will be rejected. Thus, it would be equivalent to a null global score.

#

This sub-score will evaluate the capability of the learned model to predict OOD dataset. In the OOD testset, the input data are from a different distribution than those used for training. The computation of this sub-score is similar to and is also based on two sub-criteria: accuracy and speed-up.

#

For the Physics compliance sub-score, we evaluate the relative errors of physical variables. For each criterion, the score is also calibrated based on 2 thresholds and gives 0/1/2 points, similarly to , depending on the result provided by the metric considered.

# Practical example

Using the notation introduced in the previous subsection, let us consider the following configuration:

•
•
•
•
•
•

In order to illustrate even further how the score computation works, we provide in Table 2 examples for the airfoil task.

As it is the most straightforward to compute, we start with the global score for the solution obtained with 'OpenFOAM', the physical solver used to produce the data. It is the reference physical solver, which implies that the accuracy is perfect but the speed-up is only equal to 1 (no acceleration). Therefore, we obtain the following subscores:

•
•
•

Then, by combining them, the global score is , therefore 82.5%.

The procedure is similar with 'FC'; the associated subscores are:

•
•
•

Then, by combining them, the global score is , therefore 44.57%.

Table 1: Scoring Table for the 3 tasks under 3 categories of evaluation criteria for the considered configuration. The performances are reported using three colors computed on the basis of two thresholds. Colors meaning: UnacceptableAcceptable Great.