Towards Audit Requirements for AI-Based Systems in Mobility

Applications

Devi Padmavathi Alagarswamy

, Christian Berghoff

, Vasilios Danos

, Fabian Langer

Thora Markert

3,∗

, Georg Schneider

, Arndt von Twickel

and Fabian Woitschek

4,∗

ZF Friedrichshafen AG, System House Autonomous Mobility Systems, Friedrichshafen, Germany

German Federal Ofﬁce for Information Security (BSI), Bonn, Germany

UV Informationstechnik GmbH, IT Security Hardware Evaluation, Essen, Germany

ZF Friedrichshafen AG, Artiﬁcial Intelligence Lab, Saarbr

ucken, Germany

Keywords:

Artiﬁcial Intelligence, Neural Networks, Security, Safety, Trustworthiness, Regulation, AD, ADAS.

Abstract:

Various mobility applications like advanced driver assistance systems increasingly utilize artiﬁcial intelligence

(AI) based functionalities. Typically, deep neural networks (DNNs) are used as these provide the best perfor-

mance on the challenging perception, prediction or planning tasks that occur in real driving environments.

However, current regulations like UNECE R 155 or ISO 26262 do not consider AI-related aspects and are

only applied to traditional algorithm-based systems. The non-existence of AI-speciﬁc standards or norms pre-

vents the practical application and can harm the trust level of users. Hence, it is important to extend existing

standardization for security and safety to consider AI-speciﬁc challenges and requirements. To take a step

towards a suitable regulation we propose 50 technical requirements or best practices that extend existing reg-

ulations and address the concrete needs for DNN-based systems. We show the applicability, usefulness and

meaningfulness of the proposed requirements by performing an exemplary audit of a DNN-based trafﬁc sign

recognition system using three of the proposed requirements.

1 INTRODUCTION

Artiﬁcial intelligence (AI) -based systems are increas-

ingly used as part of mobility applications like au-

tonomous driving (AD) or advanced driver assistance

systems (ADAS). Especially, deep neural networks

(DNNs) achieve an impressive performance on most

tasks and are the most promising solution to achieve

higher levels of automated driving. At the same time,

different manufacturers already use DNN-based solu-

tions as part of ADASs with partial automation (SAE

L2 (SAE J3016, 2014)) (Karpathy, 2021) that are op-

erating on public roads or for highly automated shut-

tles (SAE L4) (Waymo, 2021) operating in limited

public areas. However, current DNNs introduce new

and speciﬁc vulnerabilities into the systems which

can impact the performance and trustworthiness of

AD/ADAS systems negatively. This requires a de-

tailed analysis of existing vulnerabilities and poten-

tial mitigation strategies. To still enable the usage of

such DNN-based solutions for high-risk applications,

∗

Main Contribution.

like highly or fully automated driving (SAE L4/L5),

clear guidelines and regulations are required. This as-

sures that systems with a high degree of autonomy are

trustworthy with respect to use case relevant aspects

like safety, security, robustness or explainability and

include mitigation strategies to known vulnerabilities.

However, currently no homologation regulations

or standards exist that are tailored towards the use

of AI-based systems in mobility applications and in-

clude AI-speciﬁc vulnerabilities (Radlak et al., 2020).

There are no uniformly acknowledged principles and

practices that the development, testing or deployment

of AI-based systems must fulﬁll. This limits the fu-

ture deployment of AI-based systems to low-risk ap-

plications. Furthermore, it represents a major chal-

lenge for industry, auditors and regulators and poten-

tially leads to a lower level of user trust.

To provide guidance for future regulations, in this

work we explore how auditing guidelines can ensure

the security and safety of AI-based systems in high-

risk applications. Thereby, we focus on the appli-

cation of AI-based systems for mobility applications

Alagarswamy, D., Berghoff, C., Danos, V., Langer, F., Markert, T., Schneider, G., von Twickel, A. and Woitschek, F.

Towards Audit Requirements for AI-Based Systems in Mobility Applications.

DOI: 10.5220/0011619500003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 339-348

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

339

like functionalities for AD/ADAS. Hence, we base

our work on existing standards for road vehicles, like

the functional safety standard ISO 26262 (ISO 26262,

2018), which are relevant for mobility applications,

and provide the following contributions:

• We present an overview of existing and develop-

ing standards relevant for auditing AI-based sys-

tems in mobility applications.

• We introduce a list of generic requirements which

focus on speciﬁc needs arising when auditing AI-

based systems.

• We compare different use cases for AI-based sys-

tems in mobility applications to select the most

suitable use case which is used for testing and re-

ﬁning the introduced audit requirements in prac-

tice.

• We demonstrate the applicability of the most rele-

vant audit requirements for the selected use case.

2 RELATED WORK

During an audit the compliance to industry stan-

dards mandated by regulators is evaluated. How-

ever, traditional standards for systems in mobility ap-

plications do not contain speciﬁc guidelines in case

AI-based systems are utilized instead of traditional

algorithm-based systems. This includes both safety

standards like (ISO 26262, 2018), (ISO 21448, 2022)

or (ANSI/UL 4600, 2022) and security standards like

(ISO/SAE 21434, 2021) or (UNECE R 155, 2021).

Since currently no standards for the auditing of

AI-based systems exist, there are approaches to de-

velop an appropriate standardization. Best known

here is the (EU AI Act, 2021) which tries to lay down

uniform regulations for AI-based systems. It presents

a horizontal regulatory approach with necessary re-

quirements to address different risks and challenges

when AI is used, without focusing on the needs for

speciﬁc application areas. Similarly, (ISO/IEC TR

24028, 2020) focuses on the trustworthiness of AI-

based systems without considering a concrete appli-

cation domain. It surveys different generic threats

and risks and also covers existing mitigation strate-

gies. Additionally, (ISO/IEC TR 24029-1, 2021) pro-

vides background information on existing methods to

assess the robustness of generic DNNs.

In addition to the already published drafts for stan-

dardization of AI-based systems mentioned above,

there are also ongoing standardization activities. This

includes (ISO/IEC DTR 5469, 2022), which covers

aspects of functional safety speciﬁc for AI-based sys-

tems. In addition, (ISO/IEC PRF TS 4213, 2022) fo-

cuses on methods to assess the performance of ML-

based classiﬁcation systems. Furthermore, there are

standardization activities on a national level which in-

clude the (DIN Roadmap AI, 2022). Here, require-

ments and challenges as well as standardization needs

for seven topics around AI are discussed.

In contrast to these horizontal regulatory ap-

proaches, there are also vertical regulatory ap-

proaches. These aim to develop standards for con-

crete application areas and multiple standards are in

development for the usage of AI in (high-risk) mo-

bility applications. Here, (ISO/AWI TS 5083, 2022)

gives guidance of the steps for developing and val-

idating safe AD/ADAS systems. It covers the SAE

levels L3/L4 and the impact of using AI-based sys-

tems as part of larger AD/ADAS systems. Addition-

ally, (ISO/AWI PAS 8800, 2022) focuses speciﬁcally

on the interaction between safety and AI. It deﬁnes

risk factors for vulnerabilities in the behavior of AI

within mobility applications.

Finally, there are only very few publications avail-

able that focus on the auditing of AI systems in prac-

tice. Here, (Raji et al., 2020) introduces a framework

for auditing AI-based systems throughout the internal

development lifecycle. This results in a series of doc-

uments which form an overall audit report that can be

used by auditors for a formal audit.

After discussing available and developing stan-

dards for AI-based systems in general and speciﬁc

to mobility applications, we now shortly present an

overview of related publications which focus on spe-

ciﬁc vulnerabilities or challenges that exist for AI-

based systems. In (Li et al., 2022) the authors

present a summary of important aspects for trustwor-

thy systems which they deem necessary to be audited.

(Berghoff et al., 2020) focuses on discussing known

vulnerabilities of current AI-based systems and how

they compare with traditional algorithm-based sys-

tems. Similarly, (Mohseni et al., 2020) ﬁrst discusses

current vulnerabilities speciﬁc to AI-based systems

and then present possible mitigation strategies.

Concretely, there are multiple new security chal-

lenges for AI-based systems. These include model

extraction attacks (Papernot et al., 2017; Orekondy

et al., 2019), where an adversary attempts to copy

the functionality of a victim AI model. Next, evasion

attacks, also known as adversarial attacks (Szegedy

et al., 2014; Madry et al., 2018), are carefully

perturbed input samples (adversarial examples) that

change the prediction of AI-based systems according

to the will of an adversary. This imposes a threat on

the integrity of the system. Lastly, data poisoning at-

tacks (Goldblum et al., 2020; Schwarzschild et al.,

2021) describe the injection of poisoned data samples

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

340

in the training dataset of an AI-based system. This de-

grades the behavior of the resulting system depending

on the speciﬁc goals of the adversary.

In addition to security related AI-speciﬁc vulnera-

bilities there are also challenges regarding the robust-

ness against natural perturbations (Hendrycks and Di-

etterich, 2019; Geirhos et al., 2020). Concretely, out-

of-domain data describes the presence of data sam-

ples that deviate from the exact training distribution

used during training of an AI-based system. This ef-

fect occurs naturally when systems are deployed in

the real-world outside of a completely supervised en-

vironment and presents a challenge on the generaliza-

tion of such systems.

Also, the existing black-box character of AI-based

systems, combined with the complexity and number

of parameters of DNNs, complicates the possibility

to explain the behavior of a system. It is largely un-

clear how a system arrives at its predictions and which

features of a data sample are most important for a con-

crete prediction. Therefore, the need arises for meth-

ods that can explain the decision or general behavior

of a system that is learned from data. These methods

are published under the term of explainable artiﬁcial

intelligence (Gilpin et al., 2018; Guidotti et al., 2018).

3 GENERIC AUDIT

REQUIREMENTS

In the following, generic requirements are derived

based on a detailed analysis of established security

and safety standards. As stated in section 2 cur-

rently there are no existing AI-speciﬁc certiﬁcation

standards, norms or regulations for systems in mobil-

ity applications. The available standards and norms

are designed for traditional algorithm-based automo-

tive systems. We extract their AI-relevant aspects and

complement them with AI-speciﬁc formulations.

3.1 Requirements Elicitation

Due to the special characteristics of AI components

(e.g. high data complexity, non-linearity or lack of

interpretability), some of the existing standards may

not apply to such components or have to be adjusted

to also ensure the safety and security of AI-based sys-

tems. Hence, the generic requirements are formulated

to address the technical aspects performance, robust-

ness, explainability, external monitoring and the doc-

umentation of the entire mobility system and its AI

subsystems. Furthermore, we consider requirements

along the entire lifecycle of such systems. Fairness

and privacy are out of scope for this work and should

be addressed in future research. In total we formulate

50 generic requirements, which are available in the

project report at (AIMobilityAuditPrep, 2022). In the

following, we present our general approach to derive

the requirements and discuss three exemplary require-

ments which can be applied to most AI-based systems

and are highly relevant.

The (ISO 26262, 2018) introduces the automo-

tive safety level integrity (ASIL), which is a risk

level based classiﬁcation of recommendations for au-

tomotive systems. In this classiﬁcation scheme the

system’s risk level is categorized in four risk lev-

els, through the possible exposure to hazards, the

controllability of a hazard and the severity of pos-

sible injuries to the driver and passengers stemming

from the hazard. The four ASIL levels range from

“ASIL A” associated with the lowest degree of risk

to “ASIL D” associated with the highest degree of

risk. The ISO 26262 associates safety requirements

to recommendations for each risk level. These recom-

mendations are described as “highly recommended”

(++), “recommended” (+) and “not recommended”

(o), where “highly recommended” indicates a need

for implementation of the associated requirement for

an application associated with the corresponding risk

level. Since the ASIL is a well-deﬁned classiﬁcation

scheme, we follow this risk level based categorization

approach to classify each of the requirements accord-

ing to their risk level deﬁnition. This allows for a risk

based selection of a set of requirements for each indi-

vidual mobility application.

During a homologation process, the integration of

vehicle components is evaluated at each integration

step. Therefore, the functional safety and security of

the entire mobility system must be addressed during

the requirements elicitation. Accordingly, we catego-

rize the requirements whether they apply to the entire

mobility system or the AI subsystem.

3.2 Entire System Requirements

To provide more insight into the requirements elic-

itation process, we show an example requirement

catered towards the entire system and its ASIL clas-

siﬁcation. REQ. 7 from Table 1 ensures that the per-

formance of the entire system is not affected under

worst-case conditions. This can either encompass

natural phenomena such as weather or lighting con-

ditions, but also security threats for example by ad-

versarial attacks or side-channel attacks. It is derived

from the ASIL recommendation which states that the

system shall be tested against worst-case errors. An

example deﬁnition of a worst-case error is provided

in subsection 5.3.

Towards Audit Requirements for AI-Based Systems in Mobility Applications

341

Table 1: Three exemplary generic requirements and their ASIL risk level classiﬁcation.

Identiﬁer Requirement ASIL A ASIL B ASIL C ASIL D

Req. 7 The performance shall be compliant to the allowed

worst-case error.

++ ++ ++ ++

Req. 30 The training, test and evaluation datasets shall be

independent from each other.

++ ++ ++ ++

Req. 33 The model’s decisions shall be explained to aid the

comparison between the modelling of the system and

the trained model.

++ ++ ++ ++

Table 2: Detailed analysis of the exemplary generic requirements from Table 1.

Identiﬁer Applicability Concretization Testability Test Procedure

Req. 7 Complex Major High Metric-based

Req. 30 Simple Minor High Evidence-based

Req. 33 Complex Minor Medium Metric-based & Evidence-based

3.3 AI Subsystem Requirements

REQ. 33 from Table 1 is targeted towards the AI sub-

system within the mobility system. This requirement

is derived from an ASIL recommendation that states

that the modelling of the system shall be compared to

the resulting system. It is modiﬁed to ﬁt the AI sub-

system by stating that the model’s decisions shall be

explained. This is because AI models are similar to

black-boxes and there is low to no insight into how

the model’s decisions are made. Therefore, speciﬁc

explainability methods shall be used to gain insight

into the correct functionality.

Analogously, REQ.30 states that the training,

evaluation and testing datasets used during the devel-

opment shall be independent from each other. This

ensures that performance or training issues can be de-

tected during the training phase. The testing proce-

dure for this requirement depends on the size and for-

mat of the datasets at hand. An example evaluation of

this requirement is explained in subsection 5.3.

3.4 Testability and Applicability

The above-formulated requirements must be speciﬁed

for each individual use case. To determine the effort

needed to transfer a requirement between different

mobility use cases we perform an analysis to deter-

mine the applicability and testability of each require-

ment. Additionally, we also provide indicators on the

concretization effort between use cases and the type

of test procedure. Table 2 presents the categorization

for the 3 example requirements from subsection 3.2

and subsection 3.3.

4 AI-BASED SYSTEMS IN

MOBILITY APPLICATIONS

To assess the suitability of the generic audit require-

ments proposed in section 3, we aim to perform prac-

tical tests for a concrete AI-based system. To achieve

universally applicable results, this system should be

representative for different use cases in mobility ap-

plications and at the same time enable efﬁcient initial

tests. Thus, to select the most suitable exemplary use

case we ﬁrst introduce categories, which help to as-

sess the suitability of different use cases. Following,

we present a summary of possible AI-based use cases

in mobility applications. Based on the introduced cat-

egories we then analyze all collected use cases to de-

cide which use case is best suited for the initial prac-

tical tests of the audit requirements.

4.1 Analysis Categories

To assess the suitability of different AI-based use

cases in mobility applications for the practical audit

requirements tests we choose ﬁve categories. These

cover important aspects regarding the feasibility of

the tests and the meaningfulness of the achieved re-

sults. In Table 3 these are later applied to individual

use cases in mobility applications.

First, the relevance of each use case for the safety

of the entire mobility system is rated as high, medium,

low or none. To achieve results for the auditing of

the most critical use cases it is preferable to select a

use case where a certain amount of safety relevance

is given and a high relevance is ideal. This allows to

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

342

develop audit criteria for critical tasks where an audit

is important and required.

Next, the complexity and auditability of each use

case is categorized as complex, medium or simple.

This category covers the test effort required to derive

the residual risk of an AI-based system implement-

ing a use case. For the initial practical tests, it is

preferable to select a use case with rather low com-

plexity. This enables the most feasible development

and to perform more extensive tests.

Third, the applicability of potential (adversarial)

attacks on the AI component of each use case is rated

as unrealistic, complex, medium or simple. The most

important factors that inﬂuence the categorization are

the scalability of an attack, the availability of litera-

ture or demonstrations of an attack and the required

access interface of an adversary. Here, it is impor-

tant that a direct attack interface to the AI component

exists and it is best when attacks are comparatively

simple to execute in practice.

Additionally, the required resources for imple-

menting an exemplary system of each use case are

rated as high, medium or low. Multiple factors like the

availability of open-source datasets or representative

implementations, the model size of involved AI com-

ponents and the required computational resources for

training or inference are considered. It is most suit-

able when datasets and implementations are publicly

available and only low resources are required.

Lastly, the generalizability of the results of the

practical audit requirements tests to other use cases

is categorized as high, medium or low. Most impor-

tantly, this categorization considers which sensors are

used, which perception components are involved and

whether the use case impacts the planning or control

of a vehicle. Use cases are preferable when some of

these characteristics are shared with other use cases.

4.2 Potential Use Cases

Selecting a representative use case from the sheer

number of potential use cases in AI-based mobility

applications (Yurtsever et al., 2020; Ziebinski et al.,

2016) is a difﬁcult task. We tackle this by ﬁrst collect-

ing a list of ten high-level use cases which commonly

occur in the speciﬁc areas of AD/ADAS. For such use

cases the need for audit requirements and procedures

is highest, as associated systems are safety-relevant

and are increasingly tested on public roads.

Concretely, we start by considering AD/ADAS

use cases which have a direct impact on the control

of a vehicle. For all use cases we include the re-

quired perception of the respective road users or ob-

jects, meaning we do not only consider the ﬁnal con-

trol algorithms. Here, the ﬁrst use case is termed colli-

sion avoidance, which includes all functionalities that

react to potential obstacles in the driving path of a ve-

hicle, by initiating deceleration and/or steering mo-

tions. Notably, collision avoidance also contains (au-

tomatic) emergency breaking. Next, we consider the

lane keeping use case, which includes all function-

alities that keep a vehicle in the current driving lane.

Here, mainly steering motions are performed to tackle

the given task. Further, we consider the lane changing

use case, which includes all functionalities that lead to

a change in the driving lane of the vehicle. Like lane

keeping, steering motions are most important but also

acceleration or deceleration motions are required to

be able to merge in between two vehicles. Fourth, the

adaptive cruise control use case includes functional-

ities that manage the distance to a vehicle driving in

front of the ego vehicle. Here, deceleration and accel-

eration motions are important to control the distance

to the leading vehicle adaptively based on its driving

maneuvers and speed.

After presenting use cases that directly affect the

control of a vehicle, we now discuss additional use

cases which have no direct control impact but are im-

portant to obtain a list of the most important AI-based

AD/ADAS use cases. Therefore, the ﬁfth use case

is global path planning. This includes functionali-

ties to plan the global route/path of a vehicle, which

consists of the rough path from the starting location

to the target location. This route is updated online

during driving depending on the current occupancy

of roads or the probability of trafﬁc jams. Next, the

trafﬁc sign assistant use case includes all functional-

ities that show currently relevant trafﬁc signs to the

driver. However, this purely acts as an assistance fea-

ture and for example does not adapt the speed of a ve-

hicle to the detected speed limit automatically. Addi-

tionally, we consider the driver monitoring use case.

Here, the goal is to detect drowsiness or distraction

of a driver and provide a warning to the driver. This

can reduce the number and criticality of accidents and

is important when the driver must monitor assistance

functionalities and be able to intervene rapidly.

In addition to the use cases that describe a spe-

ciﬁc functionality, we also consider basic use cases

which provide information for various AD/ADAS-

related functionalities. Here, the map-based localiza-

tion use case includes functionalities to determine the

current position of a vehicle as it navigates through

the environment. Typically, a map of the environment

is used which can either be created dynamically or is

created a priori. Next, the road user detection use case

is considered, which includes functionalities to detect

dynamic trafﬁc participants like pedestrians, vehicles

Towards Audit Requirements for AI-Based Systems in Mobility Applications

343

Table 3: Overview of considered AI-based use cases in mobility applications and their suitability for the practical testing of

audit requirements. For each parameter, the symbol in brackets indicates whether this parameter value is suitable (↑), partially

suitable (o) or unsuitable (↓) for the initial testing of the proposed audit requirements.

Use Case Safety Complexity/ Attack Required Generalizability

Relevance Auditability Applicability Resources

Collision Avoidance High (↑) Complex(o) Medium (o) High (↓) High (↑)

Lane Keeping High (↑) Medium (o) Simple (↑) Medium (o) Medium (o)

Lane Changing High (↑) Complex(o) Medium (o) High (↓) High (↑)

Adaptive Cruise Control High (↑) Medium (o) Complex(o) High (↓) Medium (o)

Global Path Planning None (↓) Simple (↑) Unrealistic (↓) High (↓) Low (o)

Trafﬁc Sign Assistant Low (o) Simple (↑) Simple (↑) Low (↑) Medium (o)

Driver Monitoring Medium (o) Medium (o) Unrealistic (↓) Medium (o) Low (o)

Map-based Localization High (↑) Medium (o) Complex(o) High (↓) Low (o)

Road User Detection High (↑) Complex(o) Medium (o) Medium (o) Medium (o)

Behavior Prediction High (↑) Complex(o) Unrealistic (↓) Medium (o) Low (o)

or cyclists. Lastly, the behavior prediction use case

includes functionalities to identify the behavior and

subsequently the trajectory of trafﬁc participants. All

three use cases have no direct impact on the control

of a vehicle since they only provide information to

functionalities for specialized use cases like collision

avoidance or lane changing.

4.3 Use Case Selection

After collecting a list of use cases in AI-based mo-

bility applications we use the categories from subsec-

tion 4.1 to assess the suitability of each presented use

case in Table 3. It is important to note that the as-

signed values in each category must be seen relative to

each other. For example, the value low only indicates

that the use case is on the lower end when compared

to all other presented use cases.

For the ﬁnal selection of a use case, we start by

dropping all use cases which do dot not fulﬁll the

basic prerequisite for any category. This means that

global path planning is no longer considered since it

has no direct safety relevance and it is unrealistic to

apply attacks which target the AI component directly.

Similarly, driver monitoring and behavior prediction

are also no longer considered as it is very challeng-

ing or unrealistic for an adversary to apply attacks.

Next, the use cases collision avoidance, lane chang-

ing, adaptive cruise control and map-based localiza-

tion are considered as unsuitable because they typi-

cally require larger model sizes and less use case spe-

ciﬁc datasets are available for training and testing.

After dropping the seven unsuitable use cases only

the three use cases lane keeping, trafﬁc sign assis-

tant and road user detection are considered for fur-

ther analysis. All three are in principle suitable and

fulﬁll the basic prerequisites for all categories from

subsection 4.1. The remaining use cases differ with

respect to their safety relevance (higher for lane keep-

ing and road user detection) and their resource re-

quirements/complexity (lower for trafﬁc sign assis-

tant). To be able to test more audit requirements with

the available resources we value the feasibility, i.e.

the lower complexity, higher and select the trafﬁc sign

assistant use case for further practical investigations.

More complex and safety-relevant use cases can be

explored later, once it is shown that the audit require-

ments can be applied and provide useful results. De-

tails on the implementation of a system representing

the selected use case are given later in subsection 5.1.

5 PRACTICAL

IMPLEMENTATION

To test the applicability and expressiveness for real

applications we implement the selected generic audit

requirements from section 3 for the trafﬁc sign assis-

tant use case. Thus, in the following we ﬁrst introduce

the detailed architecture of the exemplary ADAS sys-

tem which represents the trafﬁc sign assistant func-

tionality. Then, we discuss the application of some in-

teresting audit requirements and describe results and

challenges.

5.1 Experimental Setup

For the trafﬁc sign assistant use case, a single outside

facing forward RGB camera sensor is used. Based

on the data of this sensor, the classiﬁcation (and pre-

ceding detection) of trafﬁc signs is performed using a

DNN. This mimics the approach used for real trafﬁc

sign assistants (Lim et al., 2017). For our initial prac-

tical experiments, we only consider a DNN that per-

forms a pure classiﬁcation of trafﬁc signs. The reason

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

344

is that typically there exists a common detector for all

kinds of road users and elements, like trafﬁc signs, ve-

hicles, pedestrian, etc. Based on the detected objects,

the content of the detected bounding boxes is fed to

special classiﬁcation modules that specialize in con-

cretely classifying the object in a box. For the case

of a trafﬁc sign assistant this means that a preceding

road elements detector exists which outputs bounding

boxes around detected signs. Then, a classiﬁer that

focusses on trafﬁc signs is used to determine the ex-

act sign type based on the given subset of the entire

image selected by the preceding detector. The output

of this system is therefore the detected trafﬁc sign in

the given image. In this work we use the classiﬁer to

test the proposed audit requirements in practice.

To train the DNN that classiﬁes trafﬁc signs we

use the German Trafﬁc Sign Recognition Benchmark

(GTSRB) (Stallkamp et al., 2011) as training dataset.

This dataset features 43 classes of German trafﬁc

signs. Using this dataset, we train a ResNet-18 (He

et al., 2016) to represent the trafﬁc sign assistant.

ResNet-18 is selected because this architecture is one

of the most successful architectures in DNN history

and is often used in literature as a sensible baseline

independent of the concrete task and use case. The

training is performed without any augmentations and

using only the standard GTSRB training dataset.

5.2 Generic Toolbox

To allow for an easy expansion of our tests to further

mobility use cases, we design a toolbox in a modu-

lar way. In this work the toolbox is implemented in

an exemplary fashion for the trafﬁc sign assistant use

case and some audit requirements discussed in sub-

section 5.3. The goal is to continuously expand this

toolbox

and incorporate more audit requirements and

use cases over time.

5.3 Application of Requirements

The generic catalogue of requirements that is elicited

in section 3 enables a simple selection of require-

ments based on the risk level of the speciﬁc use case.

We assume the trafﬁc sign assistant to be an assistance

system which is not able to gain automated control of

the vehicle on its own. However, as input to an AD

system that for example regulates the vehicle’s speed

in a certain range based on the detected trafﬁc signs

and additional parameters, the use case might be clas-

siﬁed as ASIL A during homologation. Due to this

assumption, we select and specify requirements that

An overview of the toolbox is available at www.bsi.

bund.de/dok/1079914.

are “highly recommended” (++) for the ASIL A risk

level from our requirement catalogue (AIMobilityAu-

ditPrep, 2022).

After the requirements are selected, the evaluation

process consists of the following three steps for each

requirement:

1. Parameter selection: If the requirement requires

the speciﬁcation of parameters, the parameters

are chosen according to the use case (if nec-

essary by domain experts). Moreover, a ratio-

nale/justiﬁcation how these parameters are de-

rived is provided.

2. Description of the audit procedure: The audit

procedure for the requirement is described. For

“metric-based” requirements the technical evalu-

ation/tests that are performed shall be described.

For “evidence-based” requirements the procedu-

ral evaluation of evidence is described.

3. Verdict: The test results of “metric-based” tests or

ﬁndings of “evidence-based” evaluations are as-

sessed and a “pass” or “fail” verdict is given de-

scribing whether the requirement is fulﬁlled.

In the following, we schematically show these steps

for the exemplary requirements introduced in sec-

tion 3 with the trafﬁc sign assistant use case described

in subsection 5.1.

5.3.1 Requirement 7

The performance shall be compliant to the allowed

worst-case error.

To fulﬁll this requirement the “performance” and “al-

lowed worst-case error” have to be speciﬁed. In the

case of the trafﬁc sign assistant use case, we choose

to measure the “performance” through the accuracy

of the system. The “allowed worst-case error” is cho-

sen as an accuracy greater than 90 % and we take

heavy rain as an example of a worst-case error. In

the scope of this work, we schematically assume the

vehicle running the trafﬁc sign assistant is operated

in Germany, where heavy rain is common and a 90 %

accuracy still offers sufﬁcient reliability of the assis-

tance system. This selection for the two parameters

only serves as an example to demonstrate how the re-

quirement can be tested in practice. Depending on the

boundary conditions, operational design domain, mit-

igation strategies or level of automation of the overall

system it could be necessary to adapt the value for the

required accuracy. The rationale for the selection of

these two parameters in a real-world use case requires

a rationale from domain experts.

The audit procedure for this requirement is

“metric-based”, where a dataset (that the model was

Towards Audit Requirements for AI-Based Systems in Mobility Applications

345

(a) GTSRB class 21. (b) GTSRB class 42.

Figure 1: Examples of the heavy rain transformation.

not trained on) containing data samples of all classes

in heavy rain conditions is evaluated by the model.

It is possible to use data samples captured in heavy

rain conditions or transform data samples from clear

weather conditions with a heavy rain simulation. If

the accuracy of this evaluation is greater than 90 %,

the requirement is fulﬁlled.

Our toolbox implements a heavy rain transforma-

tion using the albumentations library (Buslaev et al.,

2020). This allows to test the worst-case error on any

suitable trafﬁc sign dataset under heavy rain condi-

tions. We transform images from GTSRB using their

heavy rain transformation, which for example results

in images depicted in Figure 1. On 2580 data samples

the system reaches an accuracy of ∼ 79 %. Since the

accuracy from the evaluation under heavy rain trans-

formation is 79 %, which is not greater than 90 %, the

requirement is not fulﬁlled and fails this evaluation.

For real-world use cases an assessment of domain ex-

perts is required regarding the representativity of dif-

ferent parameters of the used heavy rain transforma-

tion like the strength or structure of the rain drops.

Alternatively, a worst case could be represented

by an adversarial attack to the system. As an example

we take a PGD attack (Madry et al., 2018) with a per-

turbation budget of 0.3. We repeat the outlined audit

process but execute a PGD attack instead of applying

a heavy rain transformation. Against this attack the

system reaches an accuracy of ∼ 21 % which means

REQ. 7 is also not fulﬁlled using this second speci-

ﬁcation. Note that PGD represents an attack in the

digital domain. In reality physical attacks which are

applied in the environment itself pose a larger threat

and testing against such attacks is more important.

5.3.2 Requirement 30

The training, test and evaluation datasets shall be

independent from each other.

Since REQ. 30 has no parameters to be set, this step

is skipped. The testing procedure of this require-

ment is classiﬁed as “evidence-based”. Hence, the

dataset documentation, code and contents of each of

the datasets shall be consulted. The documentation

and code of the training procedure gives insight on

how these datasets are generated. In this example, the

evidence showed that the datasets were split before

training the model into three disjoint datasets. Also,

the datasets follow the same underlying distribution

and are independent. Therefore, the requirement is

fulﬁlled. It is important to use independent splits of

the data to get a fair assessment of the quality of a

model. For example, images from a video recording

of a single scene should not be used in different splits.

Instead, images from a different recording like a dif-

ferent scene or in different weather must be used.

5.3.3 Requirement 33

The model’s decisions shall be explained to aid the

comparison between the modelling of the system and

the trained model.

For this requirement, the method used for explaining

the decision and the system modelling the decisions

have to be determined. In the schematic trafﬁc sign

assistant use case, we choose the following exem-

plary functional system requirement: The model de-

cision on a trafﬁc sign image shall depend on the ﬁg-

ure displayed by the trafﬁc sign, the signs coloration

and/or the shape of the sign. In real-world applica-

tions depending on the chosen modelling, it would

also be possible to implement automatic testing de-

tecting whether a certain amount of background infor-

mation is considered for the model’s decisions. In our

case, we choose the GradCam explainability method

(Selvaraju et al., 2017) to explain a random set of 60

images per GTSRB class. Figure 2 presents some ex-

amples of a GradCam explanation on some images of

the GTSRB dataset. It clearly shows that the most

important information for the decision (highlighted in

red) of the model is based on the center of the im-

age. We analyze all 60 images for each class and they

show similar results. Hence, this evaluation is passed

and the requirement is fulﬁlled.

6 CONCLUSION

6.1 Summary

We introduce a list of generic audit requirements,

which are technically relevant to assure the trust-

worthiness, security, safety, robustness, explainabil-

ity, etc. of AI-based systems in mobility applica-

tions. These requirements evolved under attention

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

346

(a) GTSRB class 0. (b) GTSRB class 2.

Figure 2: Examples of the GradCam explanation. The infor-

mation with the highest inﬂuence on the model’s decision is

highlighted in red.

to existing regulations, norms, guidelines and an ex-

tensive literature review. Additionally, we implement

tools for exemplary audit requirements to demonstrate

the applicability using a selected mobility applica-

tion. For this, we perform a comparison of different

AD/ADAS use cases, based on various categories like

complexity, auditability, available resources. Using

this analysis, we determined the trafﬁc sign assistant

use case to be best suited for the initial practical test-

ing of the audit requirements. Thus, we examine two

exemplary DNNs trained on German trafﬁc signs us-

ing the implemented audit requirements. We ﬁnd that

the generic audit requirements can be speciﬁed to pro-

vide meaningful results on the DNN-based trafﬁc sign

assistants for different AI-speciﬁc properties.

6.2 Outlook

As discussed in subsection 5.3 we only use a subset

of all proposed audit requirements for the initial prac-

tical tests. A natural next step is to extend the practi-

cal tests to include all proposed requirements. Addi-

tionally, one can expand the extent of the already im-

plemented requirements. Some of these requirements

are quite extensive and can be implemented for prac-

tical tests in different ways. In a follow-up work the

exemplary implementation can be expanded to cover

further aspects of the associated audit requirements.

This enables more extensive audits and increases the

meaningfulness of the obtained results.

Furthermore, it is especially interesting to test

some audit requirements using actual hardware and

test facilities. Instead of performing all tests in a

simulation environment, the most interesting audit re-

quirements should also be tested in reality. Only these

tests enable to properly assess the feasibility and ex-

pressiveness of the proposed audit requirements.

Additionally, the complexity of the audited system

should be increased. Instead of using only a DNN-

based classiﬁer, the system should be extended to be

more representative of systems used in reality. Ide-

ally, this is complemented by the application of the

audit requirements to industry systems operating in

practice. This allows judging the applicability under

real-world conditions and limitations.

Our goal is to continue this work and to consider

at least one additional use case in addition to the traf-

ﬁc sign assistant. We are working actively on the out-

lined next steps to further increase the meaningfulness

of our results and reﬁne the proposed requirements

and best practices based on practical insights and lim-

itations. We want to move towards applying the audit

requirements in practice and create a formal techni-

cal guideline. The obtained results could then be used

as a blueprint for standardization activities and should

be introduced to the relevant committees.

ACKNOWLEDGEMENTS

This work was supported by the Federal Ofﬁce for

Information Security (BSI), Germany, in project P538

AIMobilityAuditPrep.

REFERENCES

AIMobilityAuditPrep (2022). AIMobilityAuditPrep: Fi-

nal Results - Documentation. Technical report, Ger-

man Federal Ofﬁce for Information Security. www.

bsi.bund.de/dok/1079912.

ANSI/UL 4600 (2022). Standard for Safety for the Evalu-

ation of Autonomous Products. Standard, American

National Standards Institute, Underwriters Laborato-

ries.

Berghoff, C., Neu, M., and von Twickel, A. (2020). Vulner-

abilities of Connectionist AI Applications: Evaluation

and Defence. Frontiers in Big Data, 3.

Buslaev, A., Parinov, A., Khvedchenya, E., Iglovikov, V.,

and Kalinin, A. (2020). Albumentations: Fast and

Flexible Image Augmentations. Information, 11.

DIN Roadmap AI (2022). German Standardization

Roadmap on Artiﬁcial Inteligence. Draft, German In-

stitute for Standardization.

EU AI Act (2021). Regulation of the European Parlia-

ment and of the Council Laying Down Harmonised

Rules on Artiﬁcial Intelligence and Amending Certain

Union Legislative Acts. Draft, European Commission.

Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R., Brendel,

W., Bethge, M., and Wichmann, F. A. (2020). Shortcut

Learning in Deep Neural Networks. Nature Machine

Intelligence, 2:665–673.

Gilpin, L., Bau, D., Yuan, B., and Bajwa, A. (2018). Ex-

plaining Explanations: An Overview of Interpretabil-

ity of Machine Learning. In International Conference

on Data Science and Advanced Analytics, Turin, Italy.

Towards Audit Requirements for AI-Based Systems in Mobility Applications

347

Goldblum, M., Tsipras, D., Xie, C., and Chen, X. (2020).

Dataset Security for Machine Learning: Data Poison-

ing, Backdoor Attacks, and Defenses. IEEE Transac-

tions on Pattern Analysis & Machine Intelligence.

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Gian-

notti, F., and Pedreschi, D. (2018). A Survey Of Meth-

ods for Explaining Black Box Models. ACM Comput-

ing Surveys, 51:1–42.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-

ual Learning for Image Recognition. In Conference on

Computer Vision and Pattern Recognition, Las Vegas,

USA.

Hendrycks, D. and Dietterich, T. (2019). Benchmarking

Neural Network Robustness to Common Corruptions

and Perturbations. In International Conference on

Learning Representations, New Orleans, USA.

ISO 21448 (2022). Road vehicles - Safety of the intended

funtionality. Standard, International Organization for

Standardization.

ISO 26262 (2018). Road vehicles - Functional safety. Stan-

dard, International Organization for Standardization.

ISO/AWI PAS 8800 (2022). Road vehicles - Safety and

artiﬁcial intelligence. Standard, International Organi-

zation for Standardization.

ISO/AWI TS 5083 (2022). Road vehicles - Safety for auto-

mated driving systems - Design, veriﬁcation and vali-

dation. Standard, International Organization for Stan-

dardization.

ISO/IEC DTR 5469 (2022). Artiﬁcial intelligence - Func-

tional safety and AI systems. Standard, International

Organization for Standardization, International Elec-

trotechnical Commission.

ISO/IEC PRF TS 4213 (2022). Artiﬁcial intelligence -

Assessment of machine learning classiﬁcation perfor-

mance. Standard, International Organization for Stan-

dardization, International Electrotechnical Commis-

sion.

ISO/IEC TR 24028 (2020). Artiﬁcial intelligence -

Overview of trustworthiness in artiﬁcial intelligence.

Standard, International Organization for Standardiza-

tion, International Electrotechnical Commission.

ISO/IEC TR 24029-1 (2021). Artiﬁcial intelligence - As-

sessment of the robustness of neural networks - Part

1: Overview. Standard, International Organization for

Standardization, International Electrotechnical Com-

mission.

ISO/SAE 21434 (2021). Road vehicles - Cybersecurity en-

gineering. Standard, International Organization for

Standardization, SAE International.

Karpathy, A. (2021). Workshop on Autonomous Driving

- Tesla Keynote. In Conference on Computer Vision

and Pattern Recognition, Nashville, USA.

Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., Yi, J., and

Zhou, B. (2022). Trustworthy AI: From Principles to

Practices. ACM Computing Surveys.

Lim, K., Hong, Y., Choi, Y., and Byun, H. (2017). Real-time

Trafﬁc Sign Recognition based on a General Purpose

GPU and Deep Learning. PLOS ONE, 12:1–22.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and

Vladu, A. (2018). Towards Deep Learning Models

Resistant to Adversarial Attacks. In International

Conference on Learning Representations, Vancouver,

Canada.

Mohseni, S., Pitale, M., Singh, V., and Wang, Z. (2020).

Practical Solutions for Machine Learning Safety in

Autonomous Vehicles. In Conference on Artiﬁcial In-

telligence: Workshop on Safe AI, New York, USA.

Orekondy, T., Schiele, B., and Fritz, M. (2019). Knockoff

Nets: Stealing Functionality of Black-Box Models. In

Conference on Computer Vision and Pattern Recogni-

tion, Long Beach, USA.

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,

B., and Swami, A. (2017). Practical Black-Box At-

tacks against Machine Learning. In ACM Asia Con-

ference on Computer and Communications Security,

Abu Dhabi, United Arab Emirates.

Radlak, K., Szczepankiewicz, M., Jones, T., and Serwa,

P. (2020). Organization of Machine Learning based

Product Development as per ISO 26262 and ISO/PAS

21448. In Paciﬁc Rim International Symposium on

Dependable Computing, Perth, Australia.

Raji, I. D., Smart, A., White, R., Mitchell, M., Gebru,

T., Hutchinson, B., Smith-Loud, J., Theron, D., and

Barnes, P. (2020). Closing the AI Accountability Gap:

Deﬁning an End-to-End Framework for Internal Al-

gorithmic Auditing. In ACM Conference on Fairness,

Accountability, and Transparency, Barcelona, Spain.

SAE J3016 (2014). Levels of Driving Automation. Stan-

dard, SAE International.

Schwarzschild, A., Goldblum, M., Gupta, A., Dickerson,

J., and Goldstein, T. (2021). Just How Toxic is Data

Poisoning? A Uniﬁed Benchmark for Backdoor and

Data Poisoning Attacks. In International Conference

on Machine Learning, Vienna, Austria.

Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh,

D., and Batra, D. (2017). Grad-CAM: Visual Expla-

nations from Deep Networks via Gradient-based Lo-

calization. In International Conference on Computer

Vision, Venice, Italy.

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C.

(2011). Man vs. Computer: Benchmarking Machine

Learning Algorithms for Trafﬁc Sign Recognition. In

International Joint Conference on Neural Networks,

San Jose, USA.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,

D., Goodfellow, I., and Fergus, R. (2014). Intriguing

Properties of Neural Networks. In International Con-

ference on Learning Representations, Banff, Canada.

UNECE R 155 (2021). Uniform provisions concerning the

approval of vehicles with regards to cyber security and

cyber security management system. Standard, United

Nations Economic Commission for Europe.

Waymo (2021). How we’ve built the World’s Most Experi-

enced Urban Driver. Waypoint.

Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K.

(2020). A Survey of Autonomous Driving: Common

Practices and Emerging Technologies. IEEE Access,

8:58443–58469.

Ziebinski, A., Cupek, R., Erdogan, H., and Waechter, S.

(2016). A Survey of ADAS Technologies for the Fu-

ture Perspective of Sensor Fusion. In International

Conference on Computational Collective Intelligence,

Halkidiki, Greece.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

348