Calculating summary statistics, such as a *count* of data points, a *sum*, or an *average* of values within a dataset
cannot be too bad from a privacy perspective, right?
These statistics are calculated over a whole dataset at the same time and not on a specific individual,
so what could go wrong?
In today’s blog post I am going to show you how summary statistics can disclose individual privacy and how the risk can be
mitigated with a mathematical framework called Differential Privacy (DP) [1] that formulates privacy guarantees for data analyses methods.
For a start, imagine you have a medical dataset that contains 100 data points about diabetes and non-diabetes patients.
Since such medical data is highly sensitive, you need to be sure that none of the individuals will suffer any privacy b
reach while performing any kind of analysis on them. Bearing that in mind you want to calculate the percentage of each group.
Yet, already this simple analysis is not as harmless as it seems when it comes to privacy.

Suppose you start with a patient dataset $D_1$ that contains 100 data points about healthy patients. Then, you replace exactly one of the data points with a data point $x$ of an individual who has diabetes. Let’s call this new dataset $D_2$. $D_1$ and $D_2$ are identical for the 99 other data points, but differ in that one data point $x$ that you’ve replaced. When you calculate the percentage of diabetes patients on both datasets now, you will observe the value 0.0 on $D_1$ and 0.01 on $D_2$.

This difference between the results over the two datasets allows you to learn that individual $x$ has diabetes. Once somebody knows the original dataset and the new one, she can reliably the health status of the added or exchanged individual. Hence, your analysis method cannot be considered privacy-preserving. If you want to read a more thorough explanation of this intuition and notation, you could check out Chapter 3 in my Master Thesis.

This is where DP comes into play. It addresses the goal of learning nothing about an individual while learning useful information about a whole population or dataset. Loosely formulated, DP expresses the following:

Given two datasets that differ in exactly one data point: In order to preserve privacy, the probability for any outcome of your analysis should be roughly the same over both datasets.

Obviously, the average calculation from the previous example does not meet this requirement because if $x$ is a diabetes patient, the probability for the result 0.01 is 100% on $D_2$ but 0% on $D_1$ (because $D_1$ yields the result 0.00 with a 100% probability).

This suggests that deterministic analysis methods (i.e., methods, that on the same data always yield the same results) cannot achieve DP. Instead, what we need to do to protect the individual’s privacy, is to add a controlled amount of probabilistic noise to the results of our analysis. That is, we need a probabilistic analysis. This noise, which can be drawn from a mathematical distribution, such as the Laplace distribution, can dissimulate the difference between the results on both datasets.

What we see is that depending on the scale of the noise added, the areas below the graphs overlap more. This means that the distribution of results on $D_1$ and $D_2$ get and learning private information about $x$ becomes more difficult.

This intuition leads to the following definition of DP:

A randomized algorithm $\mathcal{K}$ with domain $\mathbb{N}^{| \mathcal{X}| }$ gives $\epsilon$-DP, if for all neighboring databases $ D_1, D_2 \in \mathbb{N}^{| \mathcal{X}| }$ and all $ S \subseteq Image(\mathcal{K}) $

**Definition 1**: $\Pr[\mathcal{K}(D_1)\in S] \leq e^\epsilon \cdot \Pr[\mathcal{K}(D_2)\in S] $

To understand the definition, let’s look into its different parts. The randomized algorithm in our example would be the noisy average function. Neighboring data bases refer to databases that only differ in one data point, in our example in $x$.

The definition states that the probability to obtain any specific analysis results on $D_1$ should be roughly the same as obtaining that same result on $D_2$. The *roughly the same* is expressed by the factor $e^\epsilon$. Alternative formulations refer to $1+\epsilon$, which is a bit more intuitive as it explicitly states that both analysis results should be similar within a factor close to one. However, $e^\epsilon$ has nicer mathematical properties when calculating, for example, with the Laplace distribution.

There also exists a relaxation for $\epsilon$-DP, called $(\epsilon, \delta)$-DP. It is similar to the equation above, but includes a small constant $\delta$ into the formula.

**Definition 2**: $\Pr[\mathcal{K}(D_1)\in S] \leq e^\epsilon \cdot \Pr[\mathcal{K}(D_2)\in S] + \delta$

The constant $\delta$ can be interpreted as the amount of times that the noisy algorithm is allowed to violate the inequality between the probabilities. Literature usually suggests choosing delta as 1 divided by the number of data samples that you are working with.

Now that you know about the concept of DP, you will be ready for its application in the context of ML that I am going to present next week. If you have questions, comments or remarks, feel free to reach out to me.

[1] Dwork, Cynthia, and Aaron Roth. “The algorithmic foundations of differential privacy.” Foundations and Trends in Theoretical Computer Science 9, no. 3-4 (2014): 211-407.

]]>Welcome to the second part of my series about attacks against machine learning (ML) privacy. If you haven’t checked out my last post about model inversion attacks yet, feel free to do so. It might also serve you as a short introduction about the topic of ML privacy in general. For today’s post I am going to assume that you have some understanding of the most common concepts in ML.

All right, let’s get started. The privacy risk I am going to present today is called “membership inference”. We’ll first have a look on what membership inference actually means and how it can be used in order to violate individual privacy. Afterwards, we’ll go into some more details exploring how those attacks work. Then, we’ll use the TensorFlow Privacy library in order to conduct an attack on an ML classifier ourselves. You will see that this powerful tool makes it pretty easy. I will also briefly mention some factors that increase a model’s vulnerability against membership inference attacks and protective measures.

**This blogpost has been updated in December 2021 for TensorFlow Privacy version 0.7.3.**

Membership inference attacks were first described by Shokri et al. [1] in 2017. Since then, a lot of research has been conducted in order to make these attacks more efficient, to measure the membership risk of a given model, and to mitigate the risks.

Let’s first have a superficial look on the topic before diving deep into the algorithmic background and attack structure.

The aim of a membership inference attack is quite straight forward: Given a trained ML model and some data point, decide whether this point was part of the model’s training sample or not. You might not see the privacy risk right away, but think of the following situation:

Imagine you are in a clinical context. There, you may have an ML model that is supposed to predict an adequate medical treatment for cancer patients. This model, naturally, needs to be trained on the data of cancer patients. Hence, given a data point, if you are able to determine that it was indeed part of the model’s training data, you will know that the corresponding patient must have cancer. As a consequence, this patient’s privacy would be disclosed.

I hope that with this example I could convince you of the importance of membership privacy. Basically, in any context where the sheer fact of being included in a sample can be privacy disclosing, membership inference attacks pose a severe risk.

Most membership inference attacks work similar as the original example described by Shokri et al. [1], namely by building a binary meta-classifier $f_{attack}$ that, given a model $f$ and a data point $x_i$, decides whether or not $x_i$ was part of the model’s training sample $X$.

In order to train this meta-classifier $f_{attack}$, $k$ *shadow models* $f_{shaddow}^j$ are constructed.
Those models are supposed to imitate the behaviour of the original ML model $f$.
However, their training data $X’$, i.e. the ground truth $y’$ for the binary classifications, is known to the attacker.

By using the knowledge about the shadow models’ training data, input output-pairs ($x_i’, f_{shaddow}^j; y_i’$) for the meta-classifier can be constructed, such that it learns the task of distinguishing between members and non-members based on an ML model’s behavior on them.

Shokri et al. [1] claim that the more shadow models one uses, the more accurate the attack will be. The authors also describe several methods for creating $X’$, e.g. based on noisy real-world data that is similar to the original $X$, or based on data synthetization with help of $f$ or statistics over $X$. Additionally, they showed that a membership inference attack can even be trained with only black-box access to the target model and without any prior knowledge about its training data.

There are several tools for implementing membership inference attacks. The two that I am most familiar with are the IBM-ART framework that I used in my last blogpost in order to implement model inversion attacks, and TensorFlow Privacy’s Membership Inference. For my purposes (mainly trying to compare privacy between different models), so far, the TensorFlow version has proven more useful, since the attacks were more successful. Therefore, in the following, we are going to take a look at the implementation of membership inference attacks with TensorFlow Privacy . Similar to last time, I’ve uploaded a notebook for you, containing my entire code. The code was updated in December 2021 for Python 3.7, TensorFlow 2.7, and TensorFlow Privacy 0.7.3.

Internally, TensorFlow Privacy’s Membership Inference attack differs slightly from the original attack as described by Shokri et al. [1]. Instead of training several shadow models, it relies on the observation by Salem et al. [5] that using the original model’s predictions on the target points is sufficient to deduce their membership. Thereby, no training of any shadow model that approximates the original model’s behavior is required which makes the attack more sufficient. In this setting, the original model under attack kind of serves itself as a “shadow model” that approximates its own behavior perfectly.

In my opinion, the usability of TensorFlow Privacy’s Membership Inference attack has had its ups and downs in the last months. For a long time, TensorFlow Privacy used to work with TensorFlow version 1 only. For me, this included a lot of hassle by continuously changing between virtual environments with TensorFlow version 1 and 2 in order to take the maximum capabilities out of both versions. Then, by the end of 2020, TensorFlow Privacy was successfully updated to work with Tensorflow 2 and I was all excited about it. However, in my opinion, there are still some ongoing problems if you want to include the membership inference attacks into longer-lasting projects: For example, if I am not mistaken, the interface of TensorFlow Privacy’s membership inference has been updated and changed completely WITHOUT a version number increase so far (stayed 0.5.1 all along). So, don’t be confused if your package 0.5.1 has entirely different code than the one that you find in the online repo, or if you can’t get your old code to run. In order to always stay up to date, the helpful community suggested using the following

```
pip install -U git+https://github.com/tensorflow/privacy
```

and it works.

Update December 2021: Things seem to be getting more and more stable. After a major refactoring in a previous version, the interface in version 0.7.3 (the one that I am referring to in the updated version of this blogpost) has not changed and can be used as described here.

When evaluating membership inference risks, I prefer to work with the CIFAR10 dataset instead of MNIST because, according to my experience, the membership privacy risk of simple models trained with MNIST is usually already quite low. Therefore, one barely sees any changes when trying to mitigate membership privacy risks by implementing counter measures.

For those of you who are not familiar with the CIFAR10 dataset: it’s another dataset that you can directly load through keras:

```
train, test = tf.keras.datasets.cifar10.load_data()
```

It consists of 60000 32x32 colour images in 10 classes, with 6000 images per class (50000 training images and 10000 test images). The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

With its three color channels and details, this dataset is more complex than MNIST. In order to see some privacy risks, we will use a fairly simple architecture here without a lot of regularization (which might mitigate membership privacy risks, see below).

```
def make_simple_model():
""" Define a Keras model without much of regularization
Such a model is prone to overfitting"""
shape = (32, 32, 3)
i = Input(shape=shape)
x = Conv2D(32, (3, 3), activation='relu')(i)
x = MaxPooling2D()(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D()(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D()(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dense(10)(x)
model = Model(i, x)
return model
```

Based on this architecture, we can build our model.

```
model = make_simple_model()
# specify parameters
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
history = model.fit(train_data, train_labels,
validation_data=(test_data, test_labels),
batch_size=128,
epochs=10)
```

In order to use TensorFlow Privacy’s membership inference attack, we need to import:

```
import tensorflow_privacy.privacy.privacy_tests.membership_inference_attack.membership_inference_attack as mia
from tensorflow_privacy.privacy.privacy_tests.membership_inference_attack.data_structures import AttackInputData
from tensorflow_privacy.privacy.privacy_tests.membership_inference_attack.data_structures import SlicingSpec
from tensorflow_privacy.privacy.privacy_tests.membership_inference_attack.data_structures import AttackType
```

The first line imports the membership inference attack itself. The following lines import data structures required in the course of the attack.

The library offers so many different functionalities that - for non-experts - it might be a little difficult to identify the right setting even though the README is very helpful.

However, to find some useful information (e.g., what data you must provide, and which one is optional, or what type this data should have), one needs to dive deep into the code. Therefore, I will give you a summary of my findings here.

**AttackInputData** is used to specify what information $f_{attack}$ will receive. As we have clarified above, $f_{attack}$ is the binary classifier trained to predict, based on a target model $f$’s output given a data point, whether this data point was part of the training data or not. And exactly this output of our model $f$ can be provided through the `AttackInputData`

data structure. You can specify

`train and test loss`

`train and test labels`

(as integer arrays)`train and test entropy`

`train and test logits`

or`train and test probabilities`

over all given classes. As the latter depend on the former, they can only be specified if no logits are provided.

Either labels, logits, losses or entropy should be set to be able to perform the attack. I have never worked with the entropy option so far, but the other values could potentially be obtained from your trained model with the following code.:

```
print('Predict on train...')
logits_train = model.predict(train_data)
print('Predict on test...')
logits_test = model.predict(test_data)
print('Apply softmax to get probabilities from logits...')
prob_train = tf.nn.softmax(logits_train, axis=-1)
prob_test = tf.nn.softmax(logits_test)
print('Compute losses...')
cce = tf.keras.backend.categorical_crossentropy
constant = tf.keras.backend.constant
y_train_onehot = to_categorical(train_labels)
y_test_onehot = to_categorical(test_labels)
loss_train = cce(constant(y_train_onehot), constant(prob_train), from_logits=False).numpy()
loss_test = cce(constant(y_test_onehot), constant(prob_test), from_logits=False).numpy()
```

**SlicingSpec** offers you a possibility to slice your dataset. This makes sense if you want to determine the success of the membership inference attack over specific data groups or classes. According to the code you have the following options that can be set to True:

- entire_dataset: one of the slices will be the entire dataset
- by_class: one slice per class is generated
- by_percentiles: generates 10 slices for percentiles of the loss - 0-10%, 10-20%, … 90-100%
- by_classification_correctness: creates one slice for correctly classified data points, and one for misclassified data points.

**AttackType** gives you different options on how your membership inference attack should be conducted.

- LOGISTIC_REGRESSION = ‘lr’
- MULTI_LAYERED_PERCEPTRON = ‘mlp’
- RANDOM_FOREST = ‘rf’
- K_NEAREST_NEIGHBORS = ‘knn’
- THRESHOLD_ATTACK = ‘threshold’
- THRESHOLD_ENTROPY_ATTACK = ‘threshold-entropy’

The four first options require training a shadow model, the two last options don’t.

Again, as explanations in the library are provided mainly within the code, I’d like to summarize them here:

- In
*threshold attacks*, for a given threshold value, the function counts how many training and how many testing samples have membership probabilities larger than this threshold. The value is usually between 0.5,(random guessing between the two options member and non-member), and 1 (100% certainty of membership) Furthermore, precision and recall values are computed. Based on these values, an interpretable ROC curve can be produced representing how accurate the attacker can predict whether or not a data point was used in the training data. This idea mainly relies on [2], as stated in TensorFlow Privacy. - For
*trained attacks*, the attack flow involves training a shadow model as described above. A comment in the library’s code states that it is currently not possible to calculate membership privacy for all samples as some are used for training the attacker model. Here, the results display an*attacker advantage*based on the idea in [3].

Concerning the interpretability of the results, a library code comment states the following:

```
# Membership score is some measure of confidence of this attacker that
# a particular sample is a member of the training set.
#
# This is NOT necessarily probability. The nature of this score depends on
# the type of attacker. Scores from different attacker types are not directly
# comparable, but can be compared in relative terms (e.g. considering order
# imposed by this measure).
#
# For a perfect attacker,
# all training samples will have higher scores than test samples.
```

Now, that we know what to feed into the attack, we can run it on our model. I decided to use a simple threshold attack and a trained attack based on logistic regression for the example. I, furthermore, decided to do some slicing

```
attack_input = AttackInputData(
logits_train = logits_train,
logits_test = logits_test,
loss_train = loss_train,
loss_test = loss_test,
labels_train = train_labels,
labels_test = test_labels
)
slicing_spec = SlicingSpec(
entire_dataset = True,
by_class = True,
by_percentiles = False,
by_classification_correctness = True)
attack_types = [
AttackType.THRESHOLD_ATTACK,
AttackType.LOGISTIC_REGRESSION
]
attacks_result = mia.run_attacks(attack_input=attack_input,
slicing_spec=slicing_spec,
attack_types=attack_types)
```

The `attack_result`

object can provide us some thorough insight into the attack results by calling:

```
print(attacks_result.summary(by_slices=True))
```

This yields a listing of the maximal successful attacks on each slice in the form of AUC-scores and attacker advantage.

```
Best-performing attacks over all slices
LOGISTIC_REGRESSION (with 2889 training and 2889 test examples) achieved an AUC of 0.69 on slice CORRECTLY_CLASSIFIED=False
LOGISTIC_REGRESSION (with 2889 training and 2889 test examples) achieved an advantage of 0.31 on slice CORRECTLY_CLASSIFIED=False
Best-performing attacks over slice: "Entire dataset"
LOGISTIC_REGRESSION (with 10000 training and 10000 test examples) achieved an AUC of 0.59
LOGISTIC_REGRESSION (with 10000 training and 10000 test examples) achieved an advantage of 0.16
...
```

What is interesting to observe is that the highest advantage is achieved on mis-classified instances. This is a phenomenon that can be observe often in membership inference attacks, especially when overfitting the training data as much as we do with our simple model. It results from the fact that the model classifies training instances quite well, whereas it has issues on previously unseen data. Therefore, the attacker model might learn the simple connection: “if an instance is misclassified, it is likely that the model has not seen it before”, hence, the data was not used for training.

In addition to a written summary, you can also plot the ROC curve of the most successful attack

```
import tensorflow_privacy.privacy.membership_inference_attack.plotting as plotting
plotting.plot_roc_curve(attacks_result.get_result_with_max_auc().roc_curve)
```

In the following weeks, I am planning to write another blogpost about the papers [2] and [3] explaining in more detail how the results can be interpreted. For the time being, you can just take the results as they are and use them to compare between classifiers trained with different parameters and methods. I strongly encourage you to use my notebook and play around with some parameters (training epochs, batch size, model architecture etc.) in order to get a feeling how those parameters might influence membership privacy. You might also want to rebuild the entire example for another dataset, such as MNIST, in order to compare magnitudes of the attack results.

There has been quite some research conducted about factors that encourage membership inference attacks. For example [3] and [4] deal with the question of identifying factors that influence membership inference risks in ML models. Those factors are:

- overfitting,
- classification problem complexity,
- in-class standard deviation
- type of ML model targeted.

For those of you interested in the topic, I suggest going through the papers linked below for a deeper understanding of why these factors might have an influence.

Apart from training models that do not overfit to the training data too much, a method that helps preventing membership inference risk is called Differential Privacy. I would definitely like to introduce you to this concept in one of my future blog posts.

I hope that my post was helpful for you, if you have any further questions, remarks, or suggestions, please get in touch!

[1] Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference attacks against machine learning models.” In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3-18. IEEE, 2017.

[2] Song, Liwei, and Prateek Mittal. “Systematic evaluation of privacy risks of machine learning models.” arXiv preprint arXiv:2003.10595 (2020).

[3] Yeom, Samuel, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. “Privacy risk in machine learning: Analyzing the connection to overfitting.” In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268-282. IEEE, 2018.

[4] Truex, Stacey, Ling Liu, Mehmet Emre Gursoy, Wenqi Wei, and Lei Yu. “Effects of differential privacy and data skewness on membership inference vulnerability.” In 2019 First IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 82-91. IEEE, 2019.

[5] Salem, Ahmed, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes. “Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models.” arXiv preprint arXiv:1806.01246 (2018).

]]>Machine learning (ML) is one the fastest evolving fields out there, and it feels like every day, there are new and very useful tools emerging. For many scenarios, creating functional and expressive ML models might be the most important reason to use such tools. However, there is a large number of tools offering functionalities that go beyond simply building new and more powerful models. Instead they offer new perspectives by focussing on different aspects of ML.

One such aspect is model privacy. The general topic of data privacy seems to receive a lot of attention also outside the tech community in the last years, especially after the introduction of legal frameworks, such as the GDPR. Yet, the topic of privacy in ML seems to be more of a niche.
This bling spot is a pity and also risky since breaking the privacy of ML models can be really simple. In today’s blog post I would like to show you how.
Therefore, I’ll first give a short introduction about privacy risks in ML, then I’ll present a specific attack, namely the *model inversion attack*, and finally I’ll show you how to implement model inversion attacks with the help of IBM’s Adversarial Robustness Toolbox.

Just to briefly provide an understanding of privacy in ML, let’s have a look at general ML workflows. This section only serves to give a very short and informal introduction in order to motivate the main topic of this blog post. For a formal and thorough introduction to ML, you may want to check out other resources.

Imagine, you would like to train a classifier. Usually, you start with some (potentially sensible) training data, i.e. some data features $X$ and corresponding class labels $y$. You pick an algorithm, e.g. a neural network (NN), and then you use your training data to make the model learn to map from $X$ to $y$. This mapping should generalize well, such that your model is also able to predict the correct labels for, so far, unseen data $X’$.

What is less frequently addressed is the fact that the process of turning training data into a good model is not necessarily a one-way street. Think about it: In order to learn a mapping from specific features to corresponding labels, the model needs to “remember” in its parameters some information about the data it was trained on. Otherwise, how would it come to correct conclusions about new and unseen data?

The fact that some information about the training data is stored in the model parameters, might, however, cause privacy problems. This is because it enables someone with access to the ML model to deduct different kinds of information about the training data.

A very popular attack is the so-called *model inversion attack* that was first proposed by Fredrikson et al. in 2015.
The attack uses a trained classifier in order to extract representations of the training data.

Fredrikson et al. use this method, among others, on a face classifier trained on black and white images of 40 different individuals’ faces. The data features $X$ in this example correspond to the individual image pixels that can take continuous values in the range of $[0,1]$. With this large number of different pixel values combinations over an image, it is inefficient to brute-force a reconstruction over all possible images in order to identify the most likely one(s). Therefore, the authors proposed a different approach. Given $m$ different faces and $n$ pixel values per image. The face recognition classifier can be expressed as a function $f: [0,1]^n \longmapsto [0,1]^m$. The output of the classifier is a vector that represents the probabilities of the image to belong to each of the $m$ classes.

The authors define a cost function $c(x)$ concerning $f$ in order to do the model inversion. Starting with a candidate solution image $x_0$, the gradients of the cost function are calculated. Then, *gradient descent* is applied, and $x_0$ is transformed iteratively in each epoch $i$ according to the gradients in order to minimize the cost function. After a minimum is found, the transformed $x_i$ can be returned as the solution of the model inversion.

Let’s have a concrete look at the algorithm specified in the Fredrikson paper with the parameters being: $label$: label of the class we want to inverse, $\alpha$: number of iterations to calculate, $\beta$: a form of patience: if the cost does not decrease within this number of iterations, the algorithm stops, $\gamma$: minimum cost threshold, $\lambda$: learning rate for the gradient descent, and $AUXTERM$: a function that uses any available auxiliary information to inform the cost function. (In the case of simple inversion, there exists no auxiliary information, i.e. $AUXTERM=0$ for all $x$.)

As we can see, in the algorithm the cost function is defined based on $f$’s prediction on $x$. $x_0$ is initialized (e.g. here as a zero vector), then for the number of iterations, the gradient descent step is performed and the new costs are calculated. There are two stopping conditions that interrupt the algorithm: (1) if the cost has not improved for the last $\beta$ epochs, and (2) if the costs are smaller than the predefined threshold $\gamma$.

At the end of the algorithm, the minimal costs and the corresponding $x_i$ are returned. As, in the case of Fredrikson, each individual corresponds to a separate class label, the model inversion can be used in order to reconstruct concrete faces, such as the following:

Now let’s see how to put this theory into practice.

As stated before, nowadays, there exist several great programming libraries for machine learning and the IBM Adversarial Robustness Toolbox (IBM-ART) is definitely one of them.

Unlike what the name suggests, IBM-ART is by far not limited to functionality concerning adversarial robustness, but also contains methods for data poisoning, model extraction, and model privacy.

As most of my research is centred around model privacy, I was very keen on trying out the broad range of functionalities offered for the latter one. Next to membership inference attacks, and attribute inference attacks, the framework also offers an implementation of model inversion attacks from the Fredrikson paper.

IBM-ART offers a broad range of example notebooks to illustrate different functionalities. However, there are no examples of model inversion attacks. Therefore, I thought it might be useful to share my experience on how to use IBM-ART’s model inversion in combination with tensorflow. You can download the entire notebook with my code here.

I decided to use the MNIST dataset which consists of 60.000 training and 10.000 test black and white images of the digits from 0 to 9 with the corresponding labels, such as:

Let’s get started with building our own model inversion: When you are working with tensorflow 2, the first thing to make sure is that you disable the eager execution because otherwise IBM-ART would not work. Therefore, just add the following line after your tensorflow import:

```
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
```

From IBM-ART, you mainly need to import the two following things:

```
from art.attacks.inference import model_inversion
from art.estimators.classification import KerasClassifier
```

The `KerasClassifier`

is a wrapper for your `tf.keras`

model in order to be used by the attacks. Apart from this, the workflow is pretty similar as every other ML workflow.

First, we need to specify a model architecture. I chose for a simple ConvNet architecture, but you are, of course, not limited to that.

```
def make_model():
""" Define a Keras model"""
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(16, 8,
strides=2,
padding='same',
activation='relu',
input_shape=(28, 28, 1)),
tf.keras.layers.MaxPool2D(2, 1),
tf.keras.layers.Conv2D(32, 4,
strides=2,
padding='valid',
activation='relu'),
tf.keras.layers.MaxPool2D(2, 1),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
return model
```

After you’ve built the model, you can compile it with parameters of your choice:

```
model = make_model()
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
```

Then, before training on MNIST training data and labels that I stored in the variables `train_data`

and `train_labels`

after preprocessing, you need to put the model into the wrapper:

```
classifier = KerasClassifier(model=model, clip_values=(0, 1), use_logits=False)
classifier.fit(train_data, train_labels,
batch_size=264, nb_epochs=10)
```

Once the model is trained, we can start attacking it.

```
my_attack = model_inversion.MIFace(classifier)
```

The model inverson attack in the IBM-ART offers you to specify arrays $x$ and $y$. $y$ contains class labels for the classes to be attacked. $x$ contains the initial input to the classifier under attack for each class label. (This corresponds to $x_0$ in our algorithm above.) When $x$ is not specified, a zero array is used as an initial input to the classifier for inversion.

I ran the experiment with different numbers of training epochs and this is the result:

*Note:* The data that is returned by the model inversion attack somehow is an average representation of the data that belongs to the specific classes. In the presented setting, it does not allow for an inversion of individual training data points. In the Fredrikson example, however, every individual within the face classifier represents their own class. Therefore, the attack can be used in order to retrieve information about individuals and break their privacy.

It turns out that especially when the classifier is trained only for one epoch, many of the inversed digit representations are somehow recognizable. I hope that this example was useful for you in order to understand how simply your ML models can be attacked nowadays, and how easily you can implement privacy attacks with help of the right tools.

If you have any questions, comments, suggestions, or corrections, please feel free to get in touch.

[1] Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures.” In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322-1333. 2015.

[2] Nicolae, Maria-Irina, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Martin Wistuba, Valentina Zantedeschi et al. “Adversarial Robustness Toolbox v1. 0.0.” arXiv preprint arXiv:1807.01069(2018).

]]>