model retraining without specific data points, several 
ideas  have  emerged  on  how  to  legally  demonstrate 
that the information has been removed from the DNN. 
The  strongest  guarantees  come  from  the 
mathematical field of differential privacy (DP). These 
techniques apply and track noise during the training 
process.  This  noise  both  restricts  the  amount  of 
information the model  learns  from  any single  point 
while  also  acting  as  a  regularization  optimization 
term, allowing it to generalize better to new data. This 
DP process is applied to every training point and the 
model  can  often  suffer  significant  loss  in 
performance, making it no longer useful. 
Lui  (Lui  and  Tsaftaris,  2020)  introduces  the 
concept  of  applying  statistical  distributional  tests 
after  model  training  to  determine  if  a  model  has 
forgotten  information  related  to  a  set  of  points.  It 
hinges  on  having  enough  new  data  to  train  another 
model  to  a  similar  task  accuracy,  from  which 
similarity measures between output distributions can 
be  utilized.  Such  a  test  would  be  used  by  an 
independent  auditor  to  assess  compliance.  While 
effective, it more directly assesses whether data has 
not been used in model training. 
Chen  (Chen  et  al.,  2020)  introduces  explicitly 
leveraging  the  MI  attack  to  directly  measure  how 
much privacy information has been degraded. Chen 
also introduces two privacy metrics that measure the 
difference  of  the  membership  inference  confidence 
levels of a target point between two models. 
We agree with this approach; however, they again 
use model retaining and shadow models to compute 
this statistic. In our work, we advance their approach 
in  a  key  way  that  will  support  operational 
deployments  of  large,  distributed  DNNs.  Our 
approach leverages incremental retraining of a target 
model. It does not rely on full retraining of either the 
deployed  model  or  a  new  model  for  statistical 
comparisons.  With  this  redaction  technique,  data 
owners can evolve a model and alter a point’s attack 
confidence to a desired level within a ranked listed of 
possible training points. It is also possible to make it 
appear  with  high  confidence  that  the  point  was  not 
used  to  train  the  deployed  model,  when  evaluated 
against  many  other  membership  inference  attack 
models. 
Note that we don’t use the MI attack models other 
than  as  a  compliance mechanism.  That is, we don’t 
use  loss  or  other  information  of  the  attack  models 
during our re-training optimization. The advantage of 
this  is  that  it  makes  the  redactions  less  dependent 
upon the specific attack model and resilient to other 
types of attacks.  
Also,  we  only  train  evaluation  attack  models  to 
determine  the  effectiveness  of  the  Class  Clown 
technique.  Our  results  show  that  reducing  attack 
confidence in one attack model reduces confidence in 
all  attack  models.  However,  such  a  step  is  not 
necessary within operational spaces. 
2  CLASS CLOWN: SURGICAL 
DATA EXCISION THROUGH 
LABEL POISONING DURING 
INCREMENTAL RETRAINING 
It is an open question as to how exactly deep neural 
networks are storing and leaking privacy information 
on  specific  data  points.  However,  all  of  the  attacks 
rely upon observing  shifts  in the output based upon 
known the shifts in the input. For the vast majority of 
attacks,  this  means  exploiting  shifts  in  the  output 
confidence  vectors.  The  easiest  attack  is  the  case 
where  there  is  no  overlap  between  training  data 
output  and  new  data  output,  for  instance,  a  highly 
overfit model, as these can be readily differentiated. 
Even  Shokri’s  original  paper  indicated  that 
restricting the model output to the label is not enough 
to prevent this attack. Mislabelled predictions and the 
differences  of  these  misclassifications  can  be 
exploited as well. This is highlighted in a recent label-
only attack (Choquette Choo et al., 2020). 
These  shifts  in  output  are  the  result  of  many 
aggregated computations across the network’s layers 
that ultimately define the class decision boundary in 
the embedded loss space. However, in the vicinity of 
a  point,  there  is  a  relationship  between  model 
confidence and the distance to its decision boundary. 
We leverage this and seek to alter the embedded 
loss space of the target model only in the vicinity of 
the  points  that  we  need  to  redact.  By  altering  the 
point’s  local  decision  boundary,  we  can  shift  the 
target model confidence outputs, thereby tricking any 
membership  inference  attack  model  into  believing 
that  the  point  was  not  used  in  training.  We  use  a 
mechanism  that  does  so  gently  without  largely 
affecting the accuracy or network weights. 
We achieve this in an incremental manner starting 
from  the  existing  deployed  (target)  model.  For 
simplicity, we hone the technique in the context of a 
single  point,  and  then  extend  to  multiple  redaction 
points via an arrival queue representing irregular data 
redaction requests. 
2.1  Class Label Poisoning 
In our approach, we intentionally poison the label of 
the point to be redacted in ensuing retraining epochs. 
In our experiments, we randomly chose the label to 
poison with once, and then use that in every epoch. 
Intuitively, this mislabelling decreases the