TY - JOUR
T1 - AI apology
T2 - interactive multi-objective reinforcement learning for human-aligned AI
AU - Harland, Hadassah
AU - Dazeley, Richard
AU - Nakisa, Bahareh
AU - Cruz, Francisco
AU - Vamplew, Peter
N1 - Publisher Copyright:
© 2023, The Author(s).
PY - 2023/8
Y1 - 2023/8
N2 - For an Artificially Intelligent (AI) system to maintain alignment between human desires and its behaviour, it is important that the AI account for human preferences. This paper proposes and empirically evaluates the first approach to aligning agent behaviour to human preference via an apologetic framework. In practice, an apology may consist of an acknowledgement, an explanation and an intention for the improvement of future behaviour. We propose that such an apology, provided in response to recognition of undesirable behaviour, is one way in which an AI agent may both be transparent and trustworthy to a human user. Furthermore, that behavioural adaptation as part of apology is a viable approach to correct against undesirable behaviours. The Act-Assess-Apologise framework potentially could address both the practical and social needs of a human user, to recognise and make reparations against prior undesirable behaviour and adjust for the future. Applied to a dual-auxiliary impact minimisation problem, the apologetic agent had a near perfect determination and apology provision accuracy in several non-trivial configurations. The agent subsequently demonstrated behaviour alignment with success that included up to complete avoidance of the impacts described by these objectives in some scenarios.
AB - For an Artificially Intelligent (AI) system to maintain alignment between human desires and its behaviour, it is important that the AI account for human preferences. This paper proposes and empirically evaluates the first approach to aligning agent behaviour to human preference via an apologetic framework. In practice, an apology may consist of an acknowledgement, an explanation and an intention for the improvement of future behaviour. We propose that such an apology, provided in response to recognition of undesirable behaviour, is one way in which an AI agent may both be transparent and trustworthy to a human user. Furthermore, that behavioural adaptation as part of apology is a viable approach to correct against undesirable behaviours. The Act-Assess-Apologise framework potentially could address both the practical and social needs of a human user, to recognise and make reparations against prior undesirable behaviour and adjust for the future. Applied to a dual-auxiliary impact minimisation problem, the apologetic agent had a near perfect determination and apology provision accuracy in several non-trivial configurations. The agent subsequently demonstrated behaviour alignment with success that included up to complete avoidance of the impacts described by these objectives in some scenarios.
KW - AI apology
KW - AI safety
KW - Human alignment
KW - Impact minimisation
KW - Multi-objective reinforcement learning
UR - https://www.scopus.com/pages/publications/85153283220
U2 - 10.1007/s00521-023-08586-x
DO - 10.1007/s00521-023-08586-x
M3 - Article
AN - SCOPUS:85153283220
SN - 0941-0643
VL - 35
SP - 16917
EP - 16930
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 23
ER -