PRIVATE SEARCHING FOR SENSITIVE FILE SIGNATURES

John Solis

Scalable and Secure Systems Research, Sandia National Labs, Livermore, CA, U.S.A.

Keywords:

Private searching, Private matching, Homomorphic encryption applications.

Abstract:

We consider the problem of privately searching for sensitive or classiﬁed ﬁle signatures on an untrusted server.

Inspired by the private stream searching system of Ostrovsky and Skeith, we propose a new scheme optimized

for matching individual ﬁle signatures (versus keyword matching in documents). Our optimization stems

from the simple observation that a complete list of matching ﬁle signatures can be replaced by a much smaller

encrypted bitmask. This approach reduces a server’s response overhead from being linear in the number of

matched documents to linear with respect to a system robustness parameter.

1 INTRODUCTION AND

MOTIVATION

Government organizationsare responsible for protect-

ing sensitive and classiﬁed information from unautho-

rized disclosure. A potential solution to this problem

is to augment existing virus/malware scanners with

classiﬁed ﬁle signatures. During a routine scan, any

computer discovered to contain classiﬁed ﬁles can

be immediately conﬁscated. Unfortunately, this ap-

proach leaves the augmented databases vulnerable to

local exploits: databases may be leaked by new mal-

ware that compromise the computer. This is problem-

atic since an adversary can use the database to verify

classiﬁcation status of arbitrary ﬁles.

The ideal solution would be a scanner, capable of

executing on untrusted computers, that searches for

classiﬁed signatures without revealing any informa-

tion. In particular, the host itself should not learn

about the signatures being searched or when they have

been located.

We propose a new method for privately detecting

classiﬁed ﬁle signatures on untrusted systems. In-

spired by existing private stream searching systems,

we use Paillier encryption to construct a simple bit-

mask identifying all classiﬁed signatures present on a

particular host. No information can be leaked since

all operations are over encrypted ciphertexts – even in

compromised or untrusted contexts.

2 RELATED WORK

The closest related work is the Ostrovsky-Skeith pri-

vate stream searching system (Ostrovsky and Skeith,

2007). It allows a client to privately search through a

stream of documents, located on separate server, and

retain copies of any document containing any combi-

nation of secret keywords. The server to client com-

munication complexity is bounded by O(m ∗log m),

where m is the maximum number of documents that

can be retrieved. Subsequent work (Bethencourt

et al., 2009), improves communication and storage

complexity to O(m).

In our scenario, we are primarily interested in

the existence of a ﬁle, not necessarily its content.

Although testing for existence can be performed by

matching ﬁle contents, this requires a high communi-

cation overhead (especially for large ﬁles).

Private Set Intersection (PSI) (Freedman et al.,

2004) allows two parties, each containing a private set

of inputs, to jointly calculate their intersection with-

out leaking extra information about either set. In our

scenario, the sets would be (1) sensitive/classiﬁed ﬁle

signatures and (2) public ﬁle signatures. The intersec-

tion, i.e., all identiﬁed classiﬁed ﬁles, can be sent to a

central server for processing. Our goal is to develop

a scheme with lower communication complexity than

the approaches discussed here.

341

Solis J..

PRIVATE SEARCHING FOR SENSITIVE FILE SIGNATURES.

DOI: 10.5220/0003466703410344

In Proceedings of the International Conference on Security and Cryptography (SECRYPT-2011), pages 341-344

ISBN: 978-989-8425-71-3

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

3 PRIVATE SIGNATURE

SEARCHING SCHEME

Problem Statement. The administrator of a large

organization wants to scan its computers, referred to

hereafter as servers, in search of sensitive ﬁles. An ef-

ﬁcient private signature searching scheme should not

reveal any information to the server about the signa-

tures being searched or when they have been located.

Solution Overview. Building such a scheme re-

quires a solution with (1) minimal communication

complexity, and (2) a privacy-preserving method for

identifying matching ﬁle signatures.

We accomplish the ﬁrst requirement by observing

that, for each ﬁle signature in the administrator’s clas-

siﬁed database, we are only interested in communi-

cating a single bit of information: does this signature

exist on the server? To query for the entire database,

we construct a simple bitmask where individual bits

correspond to speciﬁc signatures. The exact one-to-

one mapping from sensitive ﬁle signatures to speciﬁc

bitmask indices is known only to the administrator.

For the second requirement, we apply the homo-

morphic properties of the semantically secure Pail-

lier encryption system (Paillier, 1999) to our bitmask.

The server can manipulate the bitmask because, in

the Paillier system, multiplying two ciphertexts to-

gether results in an encryption of the sum of the plain-

texts: E (α) ∗E (β) = E (α + β). Plaintexts are repre-

sented as elements of Z

and ciphertexts in Z

, where

n = pq is an RSA number with p < q and p ∤ (q −1).

Now assume the administrator provides the server

with a set of ciphertexts of the form E (2

), and for a

given set of ciphertexts, each value of i is used only

once. To discover sensitive ﬁles the server computes

the signature of all ﬁles it contains and multiplies

(in an oblivious manner) into our encrypted bitmask:

E (2

) when the signature is in the classiﬁed set, or

E (0) when it is not.

Since each ciphertext is of the form E (2

), the

product of all ciphertexts is essentially computing the

binary XOR over the original plaintexts. Giving a

simple example: E (8) ∗E (4) = E (12). In binary:

E (1000) ∗ E (0100) = E (1100), i.e., binary XOR.

The administrator decrypts the ﬁnal encrypted bit-

mask and uses the private one-to-one mapping to de-

termine matching signatures.

3.1 Formal Construction

Let E (·) denote the Paillier encryption function, C de-

note an ordered set of classiﬁed signatures, F denote

the set of ﬁle signatures on a server, Q denote a set

of encrypted bitmasks, and H

: {0, 1}

∗

→ {0, 1}

de-

note a one-way cryptographichash function that maps

arbitrary length input strings to strings of bit-length h.

A private signature searching scheme is a tuple of

algorithms:

Administrator:KeyGen(τ). The administrator ex-

ecutes the key generation algorithm of the Paillier

cryptosystem with security parameter τ to ﬁnd an ap-

propriate RSA number, n = pq. To guarantee that ele-

ments of Q are correctly represented (i.e., a unit mod

n), select an m such that 2

< min{p, q}. Output the

Paillier public key PK = n, corresponding secret key

SK = {p, q}, and maximum supported classiﬁed sig-

nature set size m.

Administator:Setup(SK, C , F , k). On input of a

classiﬁed signatures set and ﬁle signatures set, the ad-

ministrator veriﬁes |C |≤m and m <

|F | and aborts

if either test fails. Otherwise, construct a set of k one-

to-one mappings from elements in C to speciﬁc bit

positions in our bitmask as follows:

Since C is an ordered set, we take the existing

position of c

∈ C and use it as as the correspond-

ing bit position in our bitmask, e.g., the third element

maps to 2

. Now let K be a set of k values se-

lected uniformly at random from Z

. Each key, along

with a (keyed) pseudo-random permutation function

PRP

(C ), generates a unique permutation of C and

unique mappings from elements to bit positions. Up-

date the secret key to include the set of permutation

keys, i.e., SK = {{p, q}, K }.

Next, compute a set of tables with encrypted val-

ues representing the individual bitmask bits. For each

t ∈ {1, ..., k} and each k

′

∈ K , initialize each table,

, with s = 2|F | entries of E (0). Select an in-

dex i uniformly at random and for the j-th element

∈ PRP

′

(C ), set D

) mod s] := E (2

). If a

collision occurs between any two elements of C , the

entire array is discarded and a new index chosen for

H . Repeat the process until no collisions have oc-

curred and store the index in set I . Note that the

m ≤

|F | requirement guarantees the probability of

a collision will always be less than

(see Choice of

m discussion in Section 4 below).

The ﬁnal step is to compute k tables, W , with s

entries of E (0) in each table. These tables are used

by the Scan algorithm as “working” tables to store

intermediary results.

Server:Scan(PK, {D , W , I }, F ). This algorithm

outputs a single encryption element representing all

classiﬁed signatures present on a server.

For each f ∈F and each i ∈I , let t be the index of

i in I and set W

( f)] := D

( f)]. After all signa-

SECRYPT 2011 - International Conference on Security and Cryptography

342

tures in the system have been processed, the working

array is compressed into a single encryption element:

∏

j=1

t, j

where w

t, j

denotes the j-th element in working ta-

ble t. The result for each table represents the bitmask

corresponding to all classiﬁed signatures found in F .

Administrator:Verify(SK, c, C , S ). On input of a

secret key, each element in S is decrypted and stored

it in the set of plaintexts, P .

For the input classiﬁed signature c, we check if

the correct bits (based on permutations of C ) are set

in the bitmask plaintexts. For each t = {1, ..., k}, let

be the index of c ∈ PRP

(C ). Check if bit j

is set

in each plaintext p

∈P . Return 1 if all the bits are set

correctly, otherwise return 0.

3.2 System Properties

A private signature searching scheme must have the

following properties:

• Correctness. Verify is a probabilistic function

such that for robustness parameter k:

∀c ∈ C , Pr[Verify(c, P) = 1|c ∈ F ] ≥ 1 −neg(k)

• Privacy. Informally, an adversary should not

learn the signatures being searched or even when

they have been located.

4 ANALYSIS AND DISCUSSION

Choice of m. In this section, we quickly discuss our

motivation for the two initial tests in the Setup(·, ·) al-

gorithm: |C |≤m and m <

|F |. The ﬁrst test simply

to ensure that we have complete coverage in the map-

ping from signatures in C to bits in Q . Without this

one-to-one mapping, we cannot make any statements

about the correctness of the server.

The second test, is to ensure that we can quickly

ﬁnd a valid hash index (one that produces no colli-

sions) for H (·). A well known probability result,

known as the birthday paradox, tells us that given n

bins and

√

n balls the probability of having a single

collision is

. By requiring m <

|F |, the probabil-

ity of collision will always be less than

. Thus, the

probability of ﬁnding a valid hash index in k separate

trials is greater than 1−(

)

Communication Complexity. The communication

complexity of our scheme is asymptotically identical

to the PIR schemes discussed earlier. However, we

argue that administrator initialization is a one-time

setup cost that can be amortized over several execu-

tions. This makes our approach preferable in situa-

tions where frequent scanning is expected.

4.1 Security Analysis

Adversarial Model. We assume the honest-but-

curious adversarial model. In this model, servers

execute the Scan(·, ·) algorithm honestly and do not

intentionally or maliciously tamper with any output.

They may, however, observe or record any intermedi-

ary algorithm state in an attempt to learn any ﬁnal out-

put behavior. We argue that this is reasonable since,

within our context, the administrator is likely to have

some form of authority over the servers it queries.

We argue that a stronger adversarial model does

not make sense in our context. A malicious adver-

sary, for instance, would either refuse to execute the

Scan(·, ·) algorithm or skip over ﬁles during the scan-

ning operation. It could also simply replace the ﬁnal

output with encryptions of random elements (which

is possible given the Paillier public key).

Correctness and Privacy Proofs. Proof details

have been omitted due to space constraints. However,

we comment that the correctness proof follows from

the Ostrovsky-Skeithproof and correctly veriﬁes clas-

siﬁed signatures with high probability. The privacy

proof is simply a reduction to the semantic security

property of the Paillier cryptosystem.

5 IMPLEMENTATION

We implemented our scheme in C++ and used multi-

ple open source libraries. The Paillier cryptosystem

implementation was based on the GNU Multiple Pre-

cision Arithmetic Library and used the OpenSSL li-

brary SHA-1 implementation for cryptographic hash-

ing. SHA-1 was “keyed” by pre-pending the key to

any data being hashed.

One optimization technique used was to perform

multi-threaded table initialization for the data and

working tables. Since both tables are initialized with

E(0)

′

s, each thread perform a separate encryption op-

eration. All encryptions were done using a 1024-bit

Paillier public key (resulting in 2048-bit ciphertexts).

All tests were performed on a 64-bit Intel Core i7

960 / 3.2 Ghz CPU (4 CPU cores) running the Ubuntu

10.10 Linux distribution. We ﬁxed the system robust-

ness parameter (somewhat arbitrarily) at k = 5 and

varied the number of ﬁles stored on the server, |F |,

to be scanned. Both the administrator and server al-

PRIVATE SEARCHING FOR SENSITIVE FILE SIGNATURES

343

gorithms were benchmarked to gain an understanding

of practical performance issues.

5.1 Administrator Benchmarks

Table 1: Administrator Benchmarks.

Setup(·,·) [k = 5]

Experimental Extrapolated

Init Storage Init Storage

|F | (min) (MB) |F | (min) (MB)

10K 2.22 51 100K 22.34 512

20K 4.47 102 150K 33.51 768

30K 6.70 153 200K 44.68 1024

40K 8.97 204 250K 55.85 1280

50K 11.20 256 300K 67.02 1536

The initial results of the administrator bench-

marks, reported in Table 1, indicate that the proposed

scheme is a reasonable and practical approach for

querying servers with both small and large datasets,

|F |. However, as the number of ﬁles increases, the

administrator must decide when the data tables are too

large to distribute. This will likely depend on whether

the tables are transferred via the network or storage

devices, e.g., USB memory stick.

The most expensive administrator operation is the

data and working table initialization. However, be-

cause Paillier is a public key system, it is possible to

shift some of this burden to the server. In particular,

servers can compute their own working tables inde-

pendently for each scan operation and reduce the re-

quired communication overhead by half.

Extrapolating to Large File Sets. Regardless of

server ﬁle count, we average 1492 Paillier encryp-

tions per second or 6.7x10

−4

seconds per encryp-

tion. We extrapolate the expected initialization pro-

cessing times and storage costs for larger ﬁle counts

(right column of Table 1).

5.2 Server Benchmarks

For sever benchmarks, we scanned two large local

system directories whose size closely approximated

the table size generated by the administrator. The re-

sults, recorded in Table 2, records the time it took to

(1) perform hashes of all ﬁles in the directory, and (2)

perform all multiplications required to compress each

working table into a single encryption element.

Table 2: Server Benchmarks.

Scan(·, ·) [k = 5]

Table Local Running

Size |F | Directory Time (sec)

20K 18279 /usr/lib/ 59.31

40K 39197 /usr/src/ 539.33

Our results indicate that both operations can be

performed efﬁciently. In general, the time taken to

performing all hashing operations drastically exceeds

the time taken to perform all multiplications. This is

especially true when the ﬁles being hashed are large

and require several hard disk fetches.

Note that because ﬁle size variability, extrapolat-

ing these results to larger data sets is not insightful.

Total execution time is more accurately measured as

a function of ﬁle size than as number of scanned ﬁles.

6 FUTURE WORK AND

CONCLUSIONS

The scheme proposed represents the ﬁrst steps to-

wards an efﬁcient and scalable solution for private

searching of sensitive ﬁle signatures on untrusted

servers. However, there are many potential areas for

improvements and future work:

One possible direction is to consider reducing

communication overhead in large networked envi-

ronments. In our scheme, communication overhead

grows linearly with the robustness parameter k. How-

ever, scanning all servers in a network simultaneously

may overwhelm the administrator. It would be prefer-

able to reduce overhead by supporting an in-network

aggregation of all responses.

Another potential direction is to investigate tech-

niques supporting alternate query forms, e.g., OR,

AND, CNF. This would allow administrators to per-

form ﬁner granularity queries for a speciﬁc situation.

In conclusion, we proposed a novel construction

for private searching of sensitive ﬁle signatures, dis-

cussed implementation results, and show that our ap-

proach is efﬁcient for administrators and servers.

REFERENCES

Bethencourt, J., Song, D., and Waters, B. (2009). New tech-

niques for private stream searching. ACM Trans. Inf.

Syst. Secur., 12:16:1–16:32.

Freedman, M. J., Nissim, K., and Pinkas, B. (2004). Ef-

ﬁcient private matching and set intersection. pages

1–19. Springer-Verlag.

Ostrovsky, R. and Skeith, III, W. E. (2007). Private search-

ing on streaming data. J. Cryptol., 20:397–430.

Paillier, P. (1999). Public-key cryptosystems based on

composite degree residuosity classes. In Proceed-

ings of the 17th international conference on The-

ory and application of cryptographic techniques, EU-

ROCRYPT’99, pages 223–238, Berlin, Heidelberg.

Springer-Verlag.

SECRYPT 2011 - International Conference on Security and Cryptography

344