A Fusion Approach to Computing Distance for Heterogeneous Data

Aalaa Mojahed, Beatriz de la Iglesia

2014

Abstract

In this paper, we introduce heterogeneous data as data about objects that are described by different data types, for example, structured data, text, time series, images etc. We provide an initial definition of a heterogeneous object using some basic data types, namely structured and time series data, and make the definition extensible to allow for the introduction of further data types and complexity in our objects. There is currently a lack of methods to analyse and, in particular, to cluster such data. We then propose an intermediate fusion approach to calculate distance between objects in such datasets. Our approach deals with uncertainty in the distance calculation and provides a representation of it that can later be used to fine tune clustering algorithms. We provide some initial examples of our approach using a real dataset of prostate cancer patients including visualisation of both distances and uncertainty. Our approach is a preliminary step in the clustering of such heterogeneous objects as the distance between objects produced by the fusion approach can be fed to any standard clustering algorithm. Although further experimental evaluation will be required to fully validate the Fused Distance Matrix approach, this paper presents the concept through an example and shows its feasibility. The approach is extensible to other problems with objects represented by different data types, e.g. text or images.

References

  1. Berndt, D. J. and Clifford, J. (1996). Finding patterns in time series: A dynamic programming approach. In Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., editors, Advances in Knowledge Discovery and Data Mining, pages 229-248. American Association for Artificial Intelligence, Menlo Park, CA, USA.
  2. Bettencourt-Silva, J., Iglesia, B. D. L., Donell, S., and Rayward-Smith, V. (2011). On creating a patientcentric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, pages 6730-6737.
  3. Bostr öm, H., Andler, S. F., Brohede, M., Johansson, R., Karlsson, A., van Laere, J., Niklasson, L., Nilsson, M., Persson, A., and Ziemke, T. (2007). On the definition of information fusion as a field of research. Technical report, Institutionen fö r kommunikation och information.
  4. Cormode, G. and McGregor, A. (2008). Approximation algorithms for clustering uncertain data. In Proceedings of the Twenty-seventh ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, PODS 7808, pages 191-200, New York, NY, USA. ACM.
  5. Dimitriadou, E., Weingessel, A., and Hornik, K. (2002). A combination scheme for fuzzy clustering. In Pal, N. and Sugeno, M., editors, Advances in Soft Computing AFSS 2002, volume 2275 of Lecture Notes in Computer Science, pages 332-338. Springer Berlin Heidelberg.
  6. Evrim, Rasmussen, M. A., Savorani, F., Ns, T., and Bro, R. (2013). Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemometrics and Intelligent Laboratory Systems, 129(9):53-63.
  7. Goh, C. (1996). Representing and reasoning about semantic conflicts. In In Heterogeneous Information System, PhD Thesis, MIT.
  8. Greene, D. and Cunningham, P. (2009). A matrix factorization approach for integrating multiple data views. In Buntine, W., Grobelnik, M., Mladeni, D., and ShaweTaylor, J., editors, Machine Learning and Knowledge Discovery in Databases, volume 5781 of Lecture Notes in Computer Science, pages 423-438. Springer Berlin Heidelberg.
  9. Henniger, O. and Muller, S. (2007). Effects of time normalization on the accuracy of dynamic time warping. In Biometrics: Theory, Applications, and Systems, 2007. BTAS 2007. First IEEE International Conference on, pages 1-6.
  10. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: A review. ACM Comput. Surv., 31(3):264- 323.
  11. Kim, W. and Seo, J. (1991). Classifying schematic and data heterogeneity in multidatabase systems. Computer, 24(12):12-18.
  12. Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group.
  13. Long, B., Zhang, Z., Wu, X., and Yu, P. S. (2006). Spectral clustering for multi-type relational data. In ICML, pages 585-592.
  14. Ma, H., Yang, H., Lyu, M. R., and King, I. (2008). Sorec: Social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 7808, pages 931-940, New York, NY, USA. ACM.
  15. Maragos, P., Gros, P., Katsamanis, A., and Papandreou, G. (2008). Cross-modal integration for performance improving in multimedia: A review. In Maragos, P., Potamianos, A., and Gros, P., editors, Multimodal Processing and Interaction, volume 33 of Multimedia Systems and Applications, pages 1-46. Springer US.
  16. Pavlidis, P., Cai, J., Weston, J., and Noble, W. S. (2002). Learning gene functional classifications from multiple data types. Journal of Computational Biology, 9(2):401-411.
  17. Ratanamahatana, C. A. and Keogh, E. (2005). Three myths about dynamic time warping data mining. Proceedings of SIAM International Conference on Data Mining (SDM05), pages 506-510.
  18. Skillicorn, D. B. (2007). Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group.
  19. Strehl, A. and Ghosh, J. (2003). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583-617.
  20. Z?itnik, M. and Zupan, B. (2014). Matrix factorization-based data fusion for gene function prediction in baker's yeast and slime mold. Systems Biomedicine, 2:1-7.
  21. van Vliet, M. H., Horlings, H. M., van de Vijver, M. J., Reinders, M. J. T., and Wessels, L. F. A. (2012). Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS ONE, 7(7):e40358.
  22. Yu, S., Falck, T., Daemen, A., Tranchevent, L.-C., Suykens, J., De Moor, B., and Moreau, Y. (2010). L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics, 11(1).
  23. Zeng, H.-J., Chen, Z., and Ma, W.-Y. (2002). A unified framework for clustering heterogeneous web objects. In Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference, pages 161-170.
Download


Paper Citation


in Harvard Style

Mojahed A. and de la Iglesia B. (2014). A Fusion Approach to Computing Distance for Heterogeneous Data . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 269-276. DOI: 10.5220/0005083702690276


in Bibtex Style

@conference{kdir14,
author={Aalaa Mojahed and Beatriz de la Iglesia},
title={A Fusion Approach to Computing Distance for Heterogeneous Data},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={269-276},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005083702690276},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - A Fusion Approach to Computing Distance for Heterogeneous Data
SN - 978-989-758-048-2
AU - Mojahed A.
AU - de la Iglesia B.
PY - 2014
SP - 269
EP - 276
DO - 10.5220/0005083702690276