Practical Hash-based Anonymity for MAC Addresses

Junade Ali

1 a

and Vladimir Dyo

2 b

Cloudﬂare Inc, London, U.K.

University of Bedfordshire, Luton, U.K.

Keywords:

Intelligent Transportation Systems, Privacy, Tracking, Anonymization, k-anonymity.

Abstract:

Given that a MAC address can uniquely identify a person or a vehicle, continuous tracking over a large

geographical scale has raised serious privacy concerns amongst governments and the general public. Prior

work has demonstrated that simple hash-based approaches to anonymization can be easily inverted due to the

small search space of MAC addresses. In particular, it is possible to represent the entire allocated MAC address

space in 39 bits and that frequency-based attacks allow for 50% of MAC addresses to be enumerated in 31

bits. We present a practical approach to MAC address anonymization using both computationally expensive

hash functions and truncating the resulting hashes to allow for k-anonymity. We provide an expression for

computing the percentage of expected collisions, demonstrating that for digests of 24 bits it is possible to store

up to 168,617 MAC addresses with the rate of collisions less than 1%. We experimentally demonstrate that

a rate of collision of 1% or less can be achieved by storing data sets of 100 MAC addresses in 13 bits, 1, 000

MAC addresses in 17 bits and 10,000 MAC addresses in 20 bits.

1 INTRODUCTION

The ubiquitous use of Wi-Fi and Bluetooth among

drivers, passengers and pedestrians has enabled moni-

toring transportation ﬂows on a massive scale through

deployments of wireless scanners along the roads that

passively scan, track and analyze nearby wireless

devices. In particular, wireless scanner equipment

has been deployed for monitoring pedestrian and cy-

clist journey time (Abedi et al., 2015), monitoring

road conditions in real-time (Abberley et al., 2017)

and passenger movements throughout train stations

(Fearn, 2019). Given the fact that a MAC address can

uniquely identify a person or a vehicle, continuous

tracking over a large geographical scale has raised se-

rious privacy concerns amongst governments and the

general public (Minch, 2015; Fearn, 2019).

The current practice adopted by many operators

and dataset providers is to anonymize MAC addresses

using hashing algorithms such as SHA256. However

(Marx et al., 2018) notes that due to a small pre-image

space for MAC addresses, all SHA256 hashed MAC

addresses could be recovered in 13 minutes 22 sec-

onds. Despite MAC addresses containing a 48-bit

search space, with only 0.1% of OUI manufacturer

https://orcid.org/0000-0002-0180-070X

https://orcid.org/0000-0003-4747-0870

preﬁxes for MAC addresses being allocated (Fux-

jaeger et al., 2016) it is possible to represent all MAC

addresses in a digest of 39 bits. Furthermore, due

to the unequal usage of various OUI preﬁxes (Demir

et al., 2014), it is possible to represent 50% coverage

in 31 bits, 90% coverage in 33 bits and 99% cover-

age in 34 bits respectively. Hence, (Fuxjaeger et al.,

2016) proposes simply removing the OUI manufac-

turer preﬁx from MAC address before storage. How-

ever, (Martin et al., 2016) ﬁnds that contiguous ad-

dress blocks are allocated to speciﬁc devices mean-

ing the residual NIC sufﬁx of the MAC address can

similarly be used for identiﬁcation attacks. (Demir

et al., 2017) contend that it is difﬁcult to control the

anonymity set size when using approximations of the

Birthday Paradox. As shown in our analysis, through

reformulating a Birthday Paradox in (Demir et al.,

2017), we ﬁnd that the probability of there being at

least one collision exceeds 0.5 when the number of

MAC addresses in the dataset exceeds 4,822 and the

hash is stored in a digest of 24 bits or less.

The evaluation of MAC schemes has been mostly

limited to analysing the ”at least one collision” se-

mantics, representing an all-or-nothing approach. In

this paper we argue that this metric alone is not sufﬁ-

cient for a decision maker to choose an anonymizing

solution as there is a large class of applications, such

as popular route analysis or building a driver’s proﬁle,

572

Ali, J. and Dyo, V.

Practical Hash-based Anonymity for MAC Addresses.

DOI: 10.5220/0009825105720579

In Proceedings of the 17th International Joint Conference on e-Business and Telecommunications (ICETE 2020) - SECRYPT, pages 572-579

ISBN: 978-989-758-446-6

where a small fraction of hash collisions is acceptable

as long as does not affect large-scale patterns. In these

situations, a decision maker is often concerned about

minimising the rate of collisions having accepted the

fact that the collisions are inevitable. In other words,

for these applications, a more practical question is

to select an anonymizing solution that provides an

upper bound on the fraction of MAC addresses that

have collisions rather than minimise the probability

of them happening at all.

In this paper we propose a practical approach to

anonymization which seeks to explore the trade-off

between the fraction of collisions and the hashing

mechanism. We ﬁrst develop an expression for cal-

culating the overall rate of collisions given the num-

ber of MAC addresses and a digest size, instead of

merely calculating the approximate probability of be-

ing at least one collision. We demonstrate that with a

database of up to 168,617 unique MAC addresses and

a digest size of up to 24 bits, the proportion of MAC

addresses that have collisions remains below 1%. The

proposed approach builds upon recent work on k-

anonymity concept (Li et al., 2019; Thomas et al.,

2019), whereby hashes are truncated such that queries

can be made to k-anonymous buckets for reducing in-

formation leak when searching for data. We experi-

mentally verify this approach by randomly selecting

MAC Addresses from a search space of 8.4 million

MAC addresses generated by a RedHat script for vir-

tual machines.

By considering an acceptable probability of ad-

dress collisions (instead of the acceptable probabil-

ity of one collision), we provide an improved model

for calculating how many bits a hash digest should be

truncated for anonymization. We demonstrate such

calculations are appropriate using empirical experi-

mentation. Furthermore, through utilising computa-

tionally expensive key derivation functions (Thomas

et al., 2019) instead of simple hash algorithms, we are

able to prevent background knowledge attacks doc-

umented in (Demir et al., 2014) and (Martin et al.,

2016) whilst requiring greater effort to identify the

candidate of pre-image MAC addresses. For a proba-

bility of collision less than 1%, we ﬁnd truncating the

hash digest to 17 bits is suitable for 1,000 MAC ad-

dresses and truncation to 20 bits is suitable for 10,000

MAC addresses. For up to 168,617 MAC addresses,

our approach provides for the number of bits in the di-

gest to be less than or equal to the number of bits ex-

posed in the (Fuxjaeger et al., 2016) approach. Unlike

(Demir et al., 2014), our approach has no dependency

on withholding a secret key and no post-collection ag-

gregation to produce a single identiﬁer.

2 RELATED WORK

Traditionally, the vehicles have been tracked using

Automatic Number Plate Recognition (ANPR) sys-

tems, which require expensive camera infrastructure

and computationally intensive image recognition sys-

tems (Blogg et al., 2010). The trafﬁc volume can also

be measured by inductive loops or magnetic sensors

built-into the road (Velisavljevic et al., 2016), how-

ever the technology is limited to counting and clas-

sifying vehicles and lacks vehicular identiﬁcation ca-

pability. The ubiquitous adoption of Bluetooth and

Wi-Fi devices among drivers, pedestrians and vehi-

cles themselves have enabled low cost tracking and

monitoring pedestrian and motor trafﬁc on a mas-

sive scale. The data has been used for pedestrian

and cyclist journey time (Abedi et al., 2015), mod-

elling road congestion alongside tracking real-time

road trafﬁc conditions (Abberley et al., 2017) and pro-

viding journey time and usage data for public trans-

port (Hidayata et al., 2018). Real world deployments

of such technology include roads in the Highways

England network, whereby digital signage indicates

journey time based on capturing Bluetooth data from

point-to-point. One engineering ﬁrm, Clearview In-

telligence, claims Bluetooth-based vehicular tracking

systems are approximately 10% of the cost of equiv-

alent ANPR solutions (Clearview Intelligence Ltd,

2020). Research for Transport Scotland found Blue-

tooth based sensing was very cost effective due to

avoiding digging costs associated with induction loop

counters (Cragg, 2013). The costs of Bluetooth data

collection are signiﬁcantly cheaper than other data

collection methods (Blogg et al., 2010), as such these

systems are likely to remain as an effective way of

measuring transportation ﬂows.

At the same time, massive deployments of such

technology have raised privacy concerns; (Minch,

2015) notes of a deployment in 2012 where a com-

pany called Renew trialled installation of 12 recycling

bins in the City of London with the capability of cap-

turing wireless MAC address information; on a single

day on the 6th July, 946,016 presences were detected,

which included 106,629 unique MAC addresses. The

company even proposed installing such sensors in re-

strooms, to allow the gender of the user to be inferred.

By August 2013, the City of London Corporation had

requested the trial be halted although the company

claimed they only captured ”extremely limited, en-

crypted, aggregated, and anonymized data.”

Similar privacy concerns have appeared more re-

cently, concerns were raised in May 2019 where

Transport for London (TfL) announced they would

collect MAC address data to understand how passen-

Practical Hash-based Anonymity for MAC Addresses

573

gers move throughout stations (Fearn, 2019). Whilst

TfL claimed the MAC addresses were hashed (with a

salt), privacy advocates argued the approach was not

suitable as there was still a need to link the same MAC

address across observations.

Although hash-based anonymization schemes

have been proposed before, its pitfalls are well docu-

mented in (Demir et al., 2014). It is shown that due to

a small address space, it is possible to reverse hashing

of a MAC address with simple brute force attacks in a

relatively small time frame which the authors identify

as a matter of minutes. The ﬁrst 24 bits of a MAC

address represent the OUI preﬁx allocated to the ven-

dor, with only a small space being allocated and dis-

parity in usage allowing for the search space to be re-

duced. The authors compare the hashing approaches

used by 15 vendors and ﬁnd whilst the majority use

simple hash based approaches, in two instances ran-

dom ”salt” values are added prior to hashing. The

authors propose a hash-chain approach whereby the

next security key is derived from a hash of the previ-

ous key, when the previous entry is discarded it is not

possible to identify the key. Unfortunately, this solu-

tion still causes there to be a key in the system, rely-

ing on the fact the key is stored in a more secure form

than the data itself - an improved solution should pro-

vide for anonymization without any such key. Such a

chaining approach also raises problems for storage in

a decentralised environment.

Such salts do not function like salts in password

hashing algorithms, in password hashing algorithms

unique salts are stored alongside the passwords and

in the event of a data breach, it means dictionaries of

passwords need to be computed for every salt value,

i.e. unique to each stored password, instead of be-

ing able to compare pre-computed hashes with every

value in the database (Ertaul et al., 2016). In this con-

text, the salt represents more of a key, some vendors

may use the same ”salt” for all captures over the pe-

riod of a day, allowing previous records to be found

for comparison over that time period. If the ”salt” is

disclosed, all records can be de-anonymized over the

period of time that key is used. (Demir et al., 2014)

concludes by proposing an improved trade-off be-

tween searching historical records and securing data,

by using an iterative key derived through insertions

as the salt of the hash function. Unfortunately this

still poses a number of challenges; whilst it provides

greater security of a storage system, the key can still

be obtained if the server computing the statistics is

compromised. Furthermore, the chained approach in-

troduces a search difﬁculty of O(n) whereas databases

can be optimised using tree structures with B-trees al-

lowing search in O(logn) time.

(Fuxjaeger et al., 2016) notes hashing algorithms

are also unsuitable: ”the claims concerning non-

reversibility have already been shown to be mislead-

ing since the actually allocated part of the 48bit MAC

address space can be reverse-mapped in a brute-force

way with low computing power.” The authors pro-

pose that the MAC address is truncated to provide

anonymity by removing the OUI (the manufacturer

preﬁx) prior to hashing. Unfortunately, this approach

does not appear to be a suitable anonymization ap-

proach; as (Demir et al., 2014) notes, only 0.1%

of preﬁxes are allocated (when excluded brute force

space can be reduced 1024x) and the OUIs are not

uniformly distributed with some manufacturer used

dramatically more than others. The experimental re-

sults in (Fuxjaeger et al., 2016) by failing to ﬁnd any

collisions in their dataset using this model.

(Demir et al., 2017) notes that limits exists to sim-

ple implementations of hashing algorithms in a vari-

ety of applications and consider the use-case of one-

to-many anonymization of MAC addresses. (Marx

et al., 2018) notes that pre-image space is bounded

for a lot of personally identiﬁable information types,

in the case of MAC addresses by the ﬁrst 24 bits be-

ing the OUI, reducing the search space to 4.1 · 10

With the ability to calculate 6 Giga MD5 hashes and

844 Mega SHA-256 hashes per second the authors are

able to recover 100% of 1 million hashes in 4 min-

utes 1 second for MD5 and 13 minutes 22 seconds for

SHA-256 respectively. This work demonstrates that

even when searching the entire OUI space, it is not

sufﬁcient to use relatively computationally inexpen-

sive hashing functions as adversaries will have signif-

icant capacity.

It is worth noting that both Android and Apple

iOS operating systems allow for devices in a disas-

sociated state to randomise the MAC addresses used

for active scans, allowing for a degree of anonymity

until they connect to an access point. However, the

results of a wide-scale study (Martin et al., 2017), in-

dicate that adoption rates of such technology are sur-

prisingly low and re-identiﬁcation attacks can be used

in 96% of Android phones. The growth in Bluetooth

hands-free equipment and headphones can further re-

duce the effectiveness of anonymization technology

as the device will be in a paired state. In road traf-

ﬁc situations, the presence of Bluetooth hands-free

devices, GPS devices and Bluetooth systems inbuilt

into vehicle provides for other devices to be used for

tracking purposes, even where mobile phones provide

a degree of anonymization.

Recently, a number of advancements have ap-

peared in the realm of identiﬁcation of breached pass-

words in an anonymized way. (Li et al., 2019) pro-

SECRYPT 2020 - 17th International Conference on Security and Cryptography

574

OUI ID

24-bits 24-bits

Figure 1: Structure of a MAC Address.

vides an overview of protocols for compromised cre-

dential checking (C3) services and a mechanism for

empirically evaluating such protocols. Such protocols

are based off the HIBP protocol whereby passwords

are bucketed based on a hash preﬁx, allowing for k-

anonymous veriﬁcation of a breached password as im-

plemented in (Ali, 2017). This bucket is then served

to the user to identify if the password is breached

or not. Other protocols utilising two-way guessing

games are also described, however do require addi-

tional two-way steps in the protocol to achieve such

communication. The optimal approach presented is

known as IDB (identiﬁer based bucketization) where

the preﬁx of the bucket used is based on a different

identiﬁer, like the hash of a username. The authors

provide formalisation of such C3 protocols, discus-

sion on threat models and empirical analysis of such

protocols. (Thomas et al., 2019) adds additional lay-

ers to such an approach, using computationally ex-

pensive hashing and private set intersection alongside

k-anonymity.

3 THREAT MODEL

The MAC address is a unique identiﬁer for Wi-Fi,

Ethernet and Bluetooth devices and is deﬁned by

IEEE 802 standard. The 24 most signiﬁcant bits of a

MAC address form an Organizationally Unique Iden-

tiﬁer (OUI) preﬁx which is registered with the IEEE.

The remaining 24 bits represent the Network Interface

Controller ID, which uniquely identiﬁes a particular

device within the OUI preﬁx, as shown in Figure 1.

The MAC address search space can crudely be

considered to be 48 bits corresponding to 2

combi-

nations. However as (Demir et al., 2014) notes, given

only 0.1% of OUI preﬁxes are allocated, the effec-

tive search space is reduced to 2

× (2

× 0.001)

combinations, which can be expressed in at most



log

× (2

× 0.001)



= 39 bits.

Furthermore, guesswork approaches allow for ei-

ther prioritisation or limiting the search space further

based on the most popular hardware manufacturers.

(Demir et al., 2014) found in an experimental dataset

that 50% coverage was obtained with 87 OUI pre-

ﬁxes, 90% coverage with 361 preﬁxes and 709 pre-

ﬁxes covering 99%. We can accordingly calculate that

50% coverage can be represented in



log

× 87



31 bits, 90% coverage in 33 bits and 99% coverage in

34 bits respectively.

This demonstrates that due to the limited en-

tropy provided by the OUI preﬁx of a MAC address,

anonymity is not possible for all records where a

MAC address is hashed with a digest size of 39 bits or

more. Experimental data from (Demir et al., 2014) ac-

cordingly demonstrates that frequency-based attacks

become possible on 50% or more of the dataset when

the hash is truncated to 31 bits or more.

This attack can also be inverted, due to the small

search space when the OUI is removed as the de-

scribed in (Fuxjaeger et al., 2016). Instead of seek-

ing to identify an individual MAC address from a set,

it also allows for data to extract devices of a spe-

ciﬁc type. This approach may be combined with the

work in (Martin et al., 2016) where contiguous ad-

dress blocks being allocated to speciﬁc devices allows

for ﬁne-grained mapping of the devices in use - al-

lowing the device type to be identiﬁed. This reaches

to the fundamental dogma of anonymization of MAC

address information; attacks may be be deductive in

nature, where seek to identify the existence of a MAC

address from a given dataset or inductive, which seek

to extract enriched data from the dataset, such as in

(Martin et al., 2016). We therefore must consider

these two threats in our anonymity model.

4 ANONYMITY MODEL

4.1 Choice of Hashing Algorithm

It is essential to ensure that when MAC addresses are

placed to anonymized buckets, they are unrelated to

other records contained into those buckets. Whilst

(Fuxjaeger et al., 2016) proposes simply removing the

OUI manufacturer preﬁx from MAC address before

storage thus reducing the search space to 24 bits, such

an approach is still vulnerable to individuation attacks

given (Martin et al., 2016) ﬁnds that contiguous ad-

dress blocks are allocated to speciﬁc devices meaning

the residual NIC sufﬁx of the MAC address can sim-

ilarly be used for identiﬁcation attacks and as (Demir

et al., 2014) notes that as only 0.1% of OUI preﬁxes

are allocated.

Fortunately, hash functions counter this issue. The

Avalanche effect means a small change in the input re-

sults in a completely different output (Ali, 2017), this

has the effect of distributing devices with the same

OUI preﬁx throughout the dataset of resulting hashes

(and contiguous NIC blocks).

(Marx et al., 2018) notes that due to the small

pre-image space of MAC addresses, it is possible

to brute force such values in a matter of minutes

Practical Hash-based Anonymity for MAC Addresses

575

when using the SHA256 algorithm. To compli-

ment our anonymity approach and deter mapping of

anonymized hashes with possible values, we elect

to use computationally expensive hashing functions

to increase the computational complexity required

for brute force attacks, as used with passwords in

(Thomas et al., 2019).

Key derivation functions like Argon2, PBKDF2,

BCrypt or SCrypt may be used with a conﬁgurable

work factor and a salt input. Such that the result-

ing digests can be used for simple comparisons of

the resulting digest, we propose that the salt is used

as a static value which acts as a secret; (Alimpia

et al., 2018) provides more information on how such

a conﬁguration can be used with BCrypt in Expen-

sive Key Setup to generate Message Authentication

Codes. Such a salt can be rotated as desired, (Demir

et al., 2014) provides an overview of upper-bound re-

tention time by some manufacturers which may be

relevant to determining key rotation policies. By us-

ing a computationally expensive key derivation func-

tion instead of a more simplistic hash, data is pro-

tected by means of both a secret and (in the event

the key is compromised) exhaustive search becomes

more difﬁcult dependant on a work factor. Deploy-

ments where a user identiﬁer is available (such as a

registration plate number or an email address) provide

for additional entropy to be added into the salt value

where applicable.

As noted in (Cragg, 2013), it is critical that hash-

ing is performed at the time of detection prior to the

MAC Address being recorded. This approach means

that there is no post-processing work required to form

a hashed identiﬁer for the MAC Addresses as the re-

sulting digest is consistent. This provides the addi-

tional beneﬁt that the work function can be set to a

high value to deter brute force attacks without the

need to compute hashes in bulk for post collection

analysis, as required by (Demir et al., 2014).

The maximum work factor of a hash function is

limited by the resources available compute the hash

and the trafﬁc rate. (Blocki et al., 2018) describes

that it is increasingly trivial for an attacker to gain

vast amounts of computational resources for crack-

ing hashes and accordingly recommends use of mem-

ory hard functions (MHFs) such as SCrypt or Ar-

gon2 to reduce the damage of ofﬂine attacks (in con-

trast to non-memory hard functions such as BCrypt or

PBKDF2). The Argon2 hash algorithm won the Pass-

word Hashing Competition (PHC) run from 2013 to

2015 (Wetzels, 2016) and is described in (Biryukov

et al., 2016). Argon2 exists in two variants, Argon2i

(optimised for resisting side-channel attacks) and Ar-

gon2d (instead optimised for resisting GPU acceler-

ated cracking attacks). Whilst side-channel attacks

are a risk for authentication systems, such threats do

not exist ordinarily for journey time monitoring sys-

tems with no user interaction. Given the limited en-

tropy of MAC Addresses poses a vastly greater risk,

we accordingly select the Argon2d algorithm.

4.2 Bucketization of Hashes

We seek to devise an approach whereby hashes are

placed into anonymous buckets, with an acceptable

upper bound on acceptable collision rate. Whilst there

is a low probability a MAC address is unique in a

given segment of a dataset, over a larger timescale or

greater geographic area there is ambiguity as to the

identity of the user as the probability of collision in-

creases.

(Demir et al., 2017) provides a detailed explana-

tion of the Birthday Paradox and notes an approxima-

tion to identify how many m messages are approxi-

mately required for there to be at least 1 collision with

p probability given n possible digests, this is shown

in Eq. (1). This indicates that with a digest of 30

bits (one less than the 31 bits needed to represent the

87 OUIs that provided 50% coverage in (Demir et al.,

2014)), an approximate 5% probability of at least one

collision is obtained after just 10, 495 unique MAC

addresses and a 50% probability after 38, 581 unique

MAC addresses.

m ≈

2n × ln(

1 − p

) (1)

We reformulate this approximation to provide the

number of n possible digests for a given number of

messages m and the probability p that one digest cor-

responds to 2 or more input values as shown in Eq (2).

The number of bits required to store n digest combi-

nations can be found as bits ≥

log

n ≈

ln(

1−p

)

(2)

For convenience, a set of pre-computed approx-

imations can be found in Table 1. The approxima-

tion demonstrates that 1000 inputs with an acceptable

probability of at least one collision being 0.05 can be

represented in 24 bits and 21 bits where the maximum

acceptable probability of at least one collision is 0.25.

It is worth noting that in instances where m >

1,312 and the desired probability remains 5%, the

number of bits required can exceed the 24 bit size

of the NIC sufﬁx of the MAC address, which can be

more problematic in situations where a small group of

manufacturers produce devices in the sample set. It is

SECRYPT 2020 - 17th International Conference on Security and Cryptography

576

also worth noting that digest sizes of 24 bits or less

still appear to be suitable when the maximum desired

probability is 25% where m ≤ 3,106 and m ≤ 4,822

where the probability is raised to 50%.

Table 1: Approximate number of bits required in the output

of a hash function for ≤ p probability of there being at least

1 collision with m inputs.

p / 0.05 p / 0.25 p / 0.5 p / 0.75

m ≈ 100 17 bits 15 bits 13 bits 12 bits

m ≈ 1, 000 24 bits 21 bits 20 bits 19 bits

m ≈ 10, 000 30 bits 28 bits 27 bits 26 bits

m ≈ 100, 000 37 bits 35 bits 33 bits 32 bits

m ≈ 1, 000,000 44 bits 41 bits 40 bits 39 bits

It is critical to note the approximation approach

described in Eq. (1) and Eq. (2) simply calculate the

probability of there being at least 1 collision but do

not estimate the rate of collisions.

A more crudely modiﬁed Birthday Paradox ap-

proximation was applied for calculating a digest size

for one-to-many hash based anonymization in MAC

addresses in (Demir et al., 2017) where it is also noted

that this can quickly lead to anonymity as the num-

ber of messages increases. (Demir et al., 2017) con-

tend that it is difﬁcult to control the anonymity set

size when using such approximations. As can be seen

in Table 1, this approach rapidly leads to digest sizes

that offer no anonymity. Recall (Minch, 2015) men-

tions a deployment of 12 sensors which, on a single

day on the 6th July, detected 106, 629 unique MAC

addresses, which represents approximately 11.3% of

all (non-unique) detections.

We take a different approach and allow for an ac-

ceptable proportion of MAC addresses to experience

collision, instead of trying to minimise the probabil-

ity of at least one collision in the anonymized dataset.

Below we derive an expression for the rate of colli-

sions p for m messages hashed across n digests (or

buckets). The probability that a message collides

with another, i.e. when there is only one other di-

gest stored, can be computed as 1/n. The probability

that no such collision exists is therefore 1 − 1/n. To

ﬁnd the probability that m−1 addresses do not collide

with a given digest, we can compute (1 − 1/n)

m−1

Simply subtracting such a result from 1 will therefore

provide the probability that a message will have a col-

lision, given a total of m messages and a digest size of

p = 1 − (1 − 1/n)

m−1

(3)

The expected number of collisions within m mes-

sages is thus C = p × m. Eq. (3) provides that with a

24 bit digest (m = 2

), the rate p exceeds 0.01 when

m > 168617. Where m ≤ 2

, we are able to provide

at least, or better, anonymization as the approach de-

tailed in (Fuxjaeger et al., 2016) of removing the the

OUI manufacturer preﬁx from MAC address. How-

ever; due to the hashing approach, we similarly are

able to mitigate risks of attacks described in (Martin

et al., 2016) due to the NIC sufﬁx of the MAC address

being allocated on a contiguous basis. A computed ta-

ble of minimum digest sizes required to provide for a

maximum p rate of collision given m MAC addresses

is provided in Table 2.

In the context of MAC addresses, the value of m

should be set appropriate for the detection volume

over a given time period and the value p should be

set to the desired rate of collision. For example, with

10,000 unique detections between two points in a 40

minute interval (250 vehicles per minute) and accept-

able collision rate of 1%, a hash digest of 20 bits may

be suitable. However, with 360, 000 unique detections

over a 24 hour period, assuming a consistent detection

rate of unique MAC addresses, the collision rate rises

to 29.1%.

Note that the volume of MAC addresses collected

may also vary by sampling rate. A study in Delaware

(Shariﬁ et al., 2011) found the detection rate for Blue-

tooth trafﬁc sensors to vary by hour from a minimum

of 4.3% to a maximum of 6.9%. This corresponded

to a sampling rate (for a journey time calculation) be-

tween 2.02% and 5.39%.

Table 2: Minimum number of bits required in the output of

a hash function for ≤ p rate of collisions in the dataset with

m inputs.

p ≤ 0.01 p ≤ 0.05 p ≤ 0.5 p ≤ 0.75

m = 100 14 bits 9 bits 8 bits 7 bits

m = 1, 000 17 bits 15 bits 11 bits 10 bits

m = 10, 000 20 bits 18 bits 14 bits 13 bits

m = 100, 000

24 bits 21 bits 33 bits 18 bits

m = 1, 000,000 27 bits 25 bits 21 bits 20 bits

5 EVALUATION

5.1 Experiment Setup

The goal of evaluation is to validate the expression for

collision rate using realistic MAC addresses as well as

obtain hash conﬁgurations where the rate of collision

remains less than or equal to 1% at intervals of 100,

1,000 and 10,000 unique MAC addresses. We for-

mulate an experiment to gain empirical metrics on the

rate of collision for given number of MAC addresses

and a given digest size. Such hashes are anonymized

in the manner aforesaid in the previous section; the

MAC address is concatenated with a randomly gener-

ated salt, a hash is computed and truncated to obtain

Practical Hash-based Anonymity for MAC Addresses

577

anonymity.

We randomly generate m unique MAC addresses

in 00:16:3e:00:00:00 and 00:16:3e:7f:ff:ff range (≈

8.4 × 10

possible values), using a script recom-

mended by RedHat for generating randomised MAC

addresses for virtual machines (Red Hat, Inc., 2020).

Such MAC addresses are then hashed together with

a randomly-generated static 68-bit salt. The hash is

then truncated to log

n bits, where n is the maximum

number of digests that can be stored in a given hash.

As a MAC address is hashed, in the event the trun-

cated digest collides with a previous value, the num-

ber of collisions is incremented. A rate of collisions is

calculated as the number of collisions divided by m.

This experiment is repeated 100 times on each

quantity of MAC addresses, for each digest size. On

each round a new set of random MAC addresses is de-

vised and a new random salt generated. The median

value is recorded once all the rounds are completed.

The experiment was run on digest sizes from 2

and quantities of MAC addresses from m = 100 to

m = 100,000. Due to the high amount of hashes that

needed to be computed for this experiment, it was run

overnight on a cloud computing environment.

5.2 Results and Discussion

Table 3 shows the results of these experiments.

The experiment demonstrates that for 100 MAC ad-

dresses, a digest size of 13 bits yields 1% collisions

but rises to 11.1% in a dataset of 1,000 hashes, 62.6%

in a dataset of 10,000 hashes and 95.9% in a dataset

of 100, 000 hashes. Similarly, a digest size of 2

suitable for acceptable 0.7% collisions in a dataset of

10,000 hashes but increases to 7.1% in 10,000 ad-

dresses and 48.7% in 100,000 addresses.

The experimental results closely match the ap-

proximation given by an analytical approximation.

For datasets of 1,000 and 10, 000 MAC addresses

both the probabilistic calculation and the experimen-

tal results show digest sizes of 17 bits and 20 yield

less than 1% collisions. For datasets of 100 MAC

addresses, whilst the probabilistic calculation showed

13 bits would yield a 1.2% rate of collisions, we ex-

perimentally found that 13 bits delivered the accept-

able 1% rate of collision. As the dataset was only

formed of 100 MAC Addresses, the collision rate in-

crements by 1% for each collision. As a comparison,

Eq (3) indicates that for 100 MAC addresses and a di-

gest size of 2

, the rate of collision should be 0.6%.

For 1,000 MAC addresses, a digest size of 2

should

yield a rate of collisions of 0.8%. Finally, for 10, 000

MAC addresses, a digest size of 2

results in a rate

of 1.0%.

We accordingly demonstrate that the obtained ex-

pression is practically suitable for determining the

probability of collision in a dataset of hashes, given

the size of a hash digest and the number of records

stored. The increase in the number of records

yields a higher rate of collisions, demonstrating the

anonymization effect.

Table 3: Median % of collisions (rounded to 1 decimal

place) for m MAC addresses hashed into n buckets.

m = 100 m = 1, 000 m = 10, 000 m = 100,000

n = 2

1.0% 11.1% 62.6% 95.9%

n = 2

0.0% 5.7% 42.2% 91.8%

n = 2

0.0% 3.0% 25.2% 83.7%

n = 2

0.0% 1.4% 13.8% 68.7%

n = 2

0.0% 0.7% 7.1% 48.7%

n = 2

0.0% 0.4% 3.7% 30.0%

n = 2

0.0% 0.2% 1.9% 16.9%

n = 2

0.0% 0.1% 1.0% 9.0%

n = 2

0.0% 0.0% 0.5% 4.6%

6 CONCLUSION AND FUTURE

WORK

In this paper we study MAC address anonymization

strategies for Bluetooth and Wi-Fi based tracking sys-

tems. Existing approaches to anonymization using

hash-based approaches have been found to be eas-

ily invertible due to a small search space and lim-

ited number of well-known OUI manufacturer pre-

ﬁxes which makes frequency-based attacks possible.

Whilst MAC addresses are 48 bits in length, we ﬁnd

that the entire allocated space of MAC addresses can

be represented in no more than 39 bits and 50% of

the most common manufacturers can be represented

in just 31 bits.

Contrary to prior work on anonymization of hash-

based datasets that sought to minimise the probability

of ”at least one” collision, we have instead formulated

the problem in terms of deﬁning a tolerable rate of

collisions in the overall dataset. We present a practical

approach to anonymizing MAC addresses by hash-

ing MAC addresses using computationally expensive

algorithms and truncating the digests to achieve k-

anonymity. We provide an approach for calculating

the rate of collisions in a dataset and experimentally

demonstrate this approach in practice, showing it to

be suitable to obtain hash conﬁgurations with an ac-

ceptable rate of collisions below 1% (for example,

using a digest size of 20 bits for 10, 000 MAC ad-

dresses).

The proposed approach opens the door for hash-

based anonymization to be applied to low-entropy

datasets, which is a potential further work. Other ar-

SECRYPT 2020 - 17th International Conference on Security and Cryptography

578

eas for potential investigation include an investigation

of privacy-preserving opt-out mechanisms. Our work

also opens the door for novel approaches in anony-

mous contact tracing during disease pandemics with-

out the need to deploy new software to user devices.

Further research on de-anonymization attacks of the

approach we have outlined would be of interest, par-

ticularly amongst practical datasets. Finally, multi-

dimensional de-anonymization attacks would prove to

be interesting, where such trackers capture additional

information such as walking or cycling speed, or de-

vice characteristics.

REFERENCES

Abberley, L., Gould, N., Crockett, K., and Cheng, J. (2017).

Modelling road congestion using ontologies for big

data analytics in smart cities. In 2017 International

Smart Cities Conference (ISC2), pages 1–6. IEEE.

Abedi, N., Bhaskar, A., Chung, E., and Miska, M. (2015).

Assessment of antenna characteristic effects on pedes-

trian and cyclists travel-time estimation based on blue-

tooth and wiﬁ mac addresses. Transportation Re-

search Part C: Emerging Technologies, 60:124–141.

Ali, J. (2017). Mechanism for the prevention of password

reuse through anonymized hashes. PeerJ PrePrints,

5:e3322v1.

Alimpia, J. B., Sison, A. M., and Medina, R. P. (2018).

An enhanced hash-based message authentication code

using bcrypt. Proceedings of the International Jour-

nal for Research in Applied Science and Engineering

Technology, 6(4).

Biryukov, A., Dinu, D., and Khovratovich, D. (2016). Ar-

gon2: new generation of memory-hard functions for

password hashing and other applications. In 2016

IEEE European Symposium on Security and Privacy

(EuroS&P), pages 292–302. IEEE.

Blocki, J., Harsha, B., and Zhou, S. (2018). On the eco-

nomics of ofﬂine password cracking. In 2018 IEEE

Symposium on Security and Privacy (SP), pages 853–

871. IEEE.

Blogg, M., Semler, C., Hingorani, M., and Troutbeck, R.

(2010). Travel time and origin-destination data col-

lection using bluetooth mac address readers. In Aus-

tralasian transport research forum, volume 36.

Clearview Intelligence Ltd (2020). Product speciﬁcation

m830 trafﬁc monitoring.

Cragg, S. (2013). Bluetooth detection–cheap but challeng-

ing. In Scottish Transport Applications and Research

Conference (STAR).

Demir, L., Cunche, M., and Lauradoux, C. (2014).

Analysing the privacy policies of wi-ﬁ trackers. In

Proceedings of the 2014 workshop on physical ana-

lytics, pages 39–44.

Demir, L., Kumar, A., Cunche, M., and Lauradoux, C.

(2017). The pitfalls of hashing for privacy. IEEE Com-

munications Surveys & Tutorials, 20(1):551–565.

Ertaul, L., Kaur, M., and Gudise, V. A. K. R. (2016).

Implementation and performance analysis of pbkdf2,

bcrypt, scrypt algorithms. In Proceedings of the Inter-

national Conference on Wireless Networks (ICWN),

page 66. The Steering Committee of The World

Congress in Computer Science, Computer . . . .

Fearn, N. (2019). Privacy warning over tﬂ plans to track

tube passengers via wiﬁ — computing.

Fuxjaeger, P., Ruehrup, S., Paulin, T., and Rainer, B. (2016).

Towards privacy-preserving wi-ﬁ monitoring for road

trafﬁc analysis. IEEE Intelligent Transportation Sys-

tems Magazine, 8(3):63–74.

Hidayata, A., Terabea, S., and Yaginumaa, H. (2018). Time

travel estimations using mac addresses of bus, pas-

sengers: A point to path-qgis analysis. Geoplanning:

Journal of Geomatics and Planning, 5(2):259–268.

Li, L., Pal, B., Ali, J., Sullivan, N., Chatterjee, R., and Ris-

tenpart, T. (2019). Protocols for checking compro-

mised credentials. In Proceedings of the 2019 ACM

SIGSAC Conference on Computer and Communica-

tions Security, pages 1387–1403.

Martin, J., Mayberry, T., Donahue, C., Foppe, L., Brown,

L., Riggins, C., Rye, E. C., and Brown, D. (2017). A

study of mac address randomization in mobile devices

and when it fails. Proceedings on Privacy Enhancing

Technologies, 2017(4):365–383.

Martin, J., Rye, E., and Beverly, R. (2016). Decomposi-

tion of mac address structure for granular device infer-

ence. In Proceedings of the 32nd Annual Conference

on Computer Security Applications, pages 78–88.

Marx, M., Zimmer, E., Mueller, T., Blochberger, M., and

Federrath, H. (2018). Hashing of personally identiﬁ-

able information is not sufﬁcient. SICHERHEIT 2018.

Minch, R. P. (2015). Location privacy in the era of the in-

ternet of things and big data analytics. In 2015 48th

Hawaii International Conference on System Sciences,

pages 1521–1530. IEEE.

Red Hat, Inc. (2020). Generating a new unique mac address.

Shariﬁ, E., Hamedi, M., Haghani, A., and Sadrsadat, H.

(2011). Analysis of vehicle detection rate for blue-

tooth trafﬁc sensors: A case study in maryland and

delaware. In 18th World Congress on on Intelligent

Transport Systems.

Thomas, K., Pullman, J., Yeo, K., Raghunathan, A., Kelley,

P. G., Invernizzi, L., Benko, B., Pietraszek, T., Patel,

S., Boneh, D., et al. (2019). Protecting accounts from

credential stufﬁng with password breach alerting. In

28th USENIX Security Symposium USENIX Security

19), pages 1556–1571.

Velisavljevic, V., Cano, E., Dyo, V., and Allen, B. (2016).

Wireless magnetic sensor network for road trafﬁc

monitoring and vehicle classiﬁcation. Transport and

Telecommunication Journal, 17(4).

Wetzels, J. (2016). Open sesame: The password

hashing competition and argon2. arXiv preprint

arXiv:1602.03097.

Practical Hash-based Anonymity for MAC Addresses

579