A Comparison of Statistical Linkage Keys with Bloom Filter-based Encryptions for Privacy-preserving Record Linkage using Real-world Mammography Data

Rainer Schnell, Anke Richter, Christian Borgs


New EU regulations on the need to encrypt personal identifiers for linking data will increase the importance of Privacy-Preserving Record Linkage (PPRL) techniques over the course of the next years. Currently, the use of Anonymous Linkage Codes (ALCs) is the standard procedure for PPRL of medical databases. Recently, Bloom filter-based encodings of pseudo-identifiers such as names have received increasing attention for PPRL tasks. In contrast to most previous research in PPRL, which is based on simulated data, we compare the performance of ALCs and Bloom filter-based linkage keys using real data from a large regional breast cancer screening program. This large regional mammography data base contains nearly 200.000 records. We compare precision and recall for linking the data set existing at point t0 with new incident cases occuring after t0 using different encoding and matching strategies for the personal identifiers. Enhancing ALCs with an additional identifier (place of birth) yields better recall than standard ALCs. Using the same information for Bloom filters with recommended parameter settings exceeds ALCs in recall, while preserving precision.


  1. Baeza-Yates, R. and Ribeiro-Neto, B. d. A. (1999). Modern Information Retrieval. Addison-Wesley, Harlow.
  2. Bloom, B. H. (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7):422-426.
  3. Borst, F., Allaert, F.-A., and Quantin, C. (2001). The Swiss solution for anonymous chaining patient files. In Patel, V., Rogers, R., and Haux, R., editors, Proceedings of the 10th World Congress on Medical Informatics: 2-5 September 2001; London, pages 1239-1241, Amsterdam. IOS Press.
  4. Broder, A. and Mitzenmacher, M. (2003). Network applications of Bloom filters: a survey. Internet Mathematics, 1(4):485-509.
  5. Brown, A., Borgs, C., Randall, S., and Schnell, R. (2016). High quality linkage using multibit trees for privacypreserving blocking. International Population Data Linkage Conference (IPDLN2016): 24.08-26.08.2016; Swansea.
  6. Council of European Union (2016). Council regulation (EU) no 679/2016.
  7. Domingo-Ferrer, J. and Muralidhar, K. (2016). New directions in anonymization: Permutation paradigm, verifiability by subjects and intruders, transparency to users. Information Sciences, 337-338:11-24.
  8. Durham, E. A. (2012). A framework for accurate, efficient private record linkage. Dissertation. Vanderbilt University.
  9. Eggli, Y., Halfon, P., Chikhi, M., and Bandi, T. (2006). Ambulatory healthcare information system: A conceptual framework. Health Policy, 78:26-38.
  10. El Kalam, A., Melchor, C., Berthold, S., Camenisch, J., Clauss, S., Deswarte, Y., Kohlweiss, M., Panchenko, A., Pimenidis, L., and Roy, M. (2011). Further privacy mechanisms. In Camenisch, J., Leenes, R., and Sommer, D., editors, Digital Privacy, pages 485-555. Springer, Berlin.
  11. Farrow, J. and Schnell, R. (2017). Locational privacy preserving distance computations with intersecting sets of randomly labelled grid points. Journal of the Royal Statistical Society, Series A, Under review.
  12. Goldreich, O. (2004). Foundations of Cryptography. Volume 2, Basic Applications. Cambridge University Press, Cambridge.
  13. Herzog, T. N., Scheuren, F. J., and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. Springer, New York.
  14. Herzog, T. N., Scheuren, F. J., and Winkler, W. E. (2010). Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):535-543.
  15. Holly, A., Gardiol, L., Eggli, Y., Yalcin, T., and Ribeiro, T. (2005). Ein neues gesundheitsbasiertes Risikoausgleichssystem für die Schweiz. G+G Wissenschaft, 5(2):16-31.
  16. Johnson, S. B., Whitney, G., McAuliffe, M., Wang, H., McCreedy, E., Rozenblit, L., and Evans, C. C. (2010). Using global unique identifiers to link autism collections. Journal of the American Medical Informatics Association, 17(6):689-695.
  17. Jutte, D. P., Roos, L. L., and Brownell, M. D. (2010). Administrative record linkage as a tool for public health. Annual Review of Public Health, 31:91-108.
  18. Karakasidis, A., Koloniari, G., and Verykios, V. S. (2015). Scalable blocking for privacy preserving record linkage. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7815, pages 527-536, New York, NY, USA. ACM.
  19. Karapiperis, D., Verykios, V. S., Katsiri, E., and Delis, A. (2016). A tutorial on blocking methods for privacypreserving record linkage. In Karydis, I., Sioutas, S., Triantafillou, P., and Tsoumakos, D., editors,Algorithmic Aspects of Cloud Computing: First International Workshop, ALGOCLOUD 2015, Patras, Greece, September 14-15, 2015. Revised Selected Papers, pages 3-15. Springer International Publishing, Cham.
  20. Karmel, R., Anderson, P., Gibson, D., Peut, A., Duckett, S., and Wells, Y. (2010). Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study. BMC Health Services Research, 10(41).
  21. Karmel, R. and Gibson, D. (2007). Event-based record linkage in health and aged care services data: a methodological innovation. BMC Health Services Research, 7:154.
  22. Katalinic, A., Bartel, C., Raspe, H., and Schreer, I. (2007). Beyond mammography screening: quality assurance in breast cancer diagnosis (the quamadi project). British Journal of Cancer, 96(1):157-161.
  23. Kelman, C. W., Bass, A. J., and Holman, C. D. J. (2002). Research use of linked health data: a best practice protocol. Australian and New Zealand Journal of Public Health, 26(3):251-255.
  24. Kijsanayotin, B., Speedie, S. M., and Connelly, D. P. (2007). Linking patients' records across organizations while maintaining anonymity. Proceedings of the 2007 American Medical Informatics Association Annual Symposium, page 1008.
  25. Kirsch, A. and Mitzenmacher, M. (2006). Less hashing same performance: building a better Bloom filter. In Azar, Y. and Erlebach, T., editors, Algorithms-ESA 2006. Proceedings of the 14th Annual European Symposium: 11- 13 September 2006; Zürich, Switzerland, pages 456- 467, Berlin. Springer.
  26. Krawczyk, H., Bellare, M., and Canetti, R. (1997). HMAC: keyed-hashing for message authentication. Internet RFC 2104.
  27. Kroll, M. and Steinmetzer, S. (2015). Who Is 1011011111...1110110010? Automated Cryptanalysis of Bloom Filter Encryptions of Databases with Several Personal Identifiers. InBiomedical Engineering Systems and Technologies 2015, pages 341-356. Springer.
  28. Kuzu, M., Kantarcioglu, M., Durham, E., and Malin, B. (2011). A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In The 11th Privacy Enhancing Technologies Symposium: 27-29 July 2011; Waterloo, Canada.
  29. Kuzu, M., Kantarcioglu, M., Durham, E. A., Toth, C., and Malin, B. (2013). A practical approach to achieve private medical record linkage in light of public resources. Journal of the American Medical Informatics Association, 20(2):285-292.
  30. Niedermeyer, F., Steinmetzer, S., Kroll, M., and Schnell, R. (2014). Cryptanalysis of basic bloom filters used for privacy preserving record linkage. Journal of Privacy and Confidentiality , 6(2):59-69.
  31. Office fédéral de la statistique (1997). La protection des données dans la statistique médicale. Technical report, Neuchatel.
  32. R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  33. Randall, S., Ferrante, A., Boyd, J., Brown, A., and Semmens, J. (2016). Limited privacy protection and poor sensitivity: Is it time to move on from the statistical linkage key-581? Health Information Management Journal, 45(2):71-79.
  34. Randall, S. M., Ferrante, A. M., Boyd, J. H., Bauer, J. K., and Semmens, J. B. (2014). Privacy-preserving record linkage on large real world datasets. Journal of Biomedical Informatics, 50:205-212.
  35. Ridder, G. and Moffitt, R. (2007). The econometrics of data combination. In Heckman, J. J. and Leamer, E. E., editors, Handbook of Econometrics, volume 6B, pages 5469-5547. Elsevier, Amsterdam.
  36. Robertson, A. M. and Willett, P. (1998). Applications of n-grams in textual information systems. Journal of Documentation, 54(1):48-67.
  37. Ryan, T., Holmes, B., and Gibson, D. (1999). A national minimum data set for home and community care. Canberra, AIHW.
  38. Schmidlin, K., Clough-Gorr, K. M., Spoerri, A., and SNC study group (2015). Privacy preserving probabilistic record linkage (P3rl): a novel method for linking existing health-related data and maintaining participant confidentiality. BMC medical research methodology, 15:46.
  39. Schnell, R. (2014). An efficient privacy-preserving record linkage technique for administrative data and censuses. Journal of the International Association for Ofcfiial Statistics, 30(3):263-270.
  40. Schnell, R. (2015). Privacy preserving record linkage. In Harron, K., Goldstein, H., and Dibben, C., editors, Methodological Developments in Data Linkage, pages 201- 225. Wiley, Chichester.
  41. Schnell, R., Bachteler, T., and Reiher, J. (2009). Privacypreserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making, 9(41).
  42. Schnell, R., Bachteler, T., and Reiher, J. (2011). A novel error-tolerant anonymous linking code. Working Paper WP-GRLC-2011-02, German Record Linkage Center, Duisburg.
  43. Schnell, R. and Borgs, C. (2015). Building a national perinatal database without the use of unique personal identifiers. In 2015 IEEE 15th International Conference on Data Mining Workshops (ICDM 2015), pages 232- 239., Atlantic City, NJ, USA. IEEE Publishing.
  44. Schnell, R. and Borgs, C. (2016). Randomized response and balanced bloom filters for privacy preserving record linkage. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDM 2016), Barcelona, Dec 12, 2016 - Dec 15, 2016. IEEE Publishing.
  45. Schülter, E., Kaiser, R., Oette, M., Müller, C., Schmeisser, N., Selbig, J., Beerenwinkel, N., Lengauer, T., Däumer, M., and Hoffmann, D. (2007). Arevir: A database to support the analysis of resistance mutations of human immunodeficiency. European Journal of Medical Research, 12(Supplememt III):10-11.
  46. Stallings, W. (2014). Cryptography and Network Security: Principles and Practice. Pearson, New Jersey, 6 edition.
  47. Taylor, L. K., Irvine, K., Iannotti, R., Harchak, T., and Lim, K. (2014). Optimal strategy for linkage of datasets containing a statistical linkage key and datasets with full personal identifiers. BMC Medical Informatics and Decision Making, 14:85.
  48. Tessmer, A., Welte, T., Schmidt-Ott, R., Eberle, S., Barten, G., Suttorp, N., and Schaberg, T. (2011). Influenza vaccination is associated with reduced severity of communityacquired pneumonia. European Respiratory Journal, 38(1):147-153.
  49. Vatsalan, D. and Christen, P. (2016). Privacy-preserving matching of similar patients. Journal of Biomedical Informatics, 59:285-298.
  50. Vatsalan, D., Christen, P., and Verykios, V. S. (2013). A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6):946-969.

Paper Citation

in Harvard Style

Schnell R., Richter A. and Borgs C. (2017). A Comparison of Statistical Linkage Keys with Bloom Filter-based Encryptions for Privacy-preserving Record Linkage using Real-world Mammography Data . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017) ISBN 978-989-758-213-4, pages 276-283. DOI: 10.5220/0006140302760283

in Bibtex Style

author={Rainer Schnell and Anke Richter and Christian Borgs},
title={A Comparison of Statistical Linkage Keys with Bloom Filter-based Encryptions for Privacy-preserving Record Linkage using Real-world Mammography Data},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017)},

in EndNote Style

JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017)
TI - A Comparison of Statistical Linkage Keys with Bloom Filter-based Encryptions for Privacy-preserving Record Linkage using Real-world Mammography Data
SN - 978-989-758-213-4
AU - Schnell R.
AU - Richter A.
AU - Borgs C.
PY - 2017
SP - 276
EP - 283
DO - 10.5220/0006140302760283