LINK INTEGRATOR - A Link-based Data Integration Architecture

Pedro Lopes, Joel Arrais, José Luís Oliveira



The evolution of the World Wide Web has created a great opportunity for data production and for the construction of public repositories that can be accessed all over the world. However, as our ability to generate new data grows, there is a dramatic increase in the need for its efficient integration and access to all the dispersed data. In specific fields such as biology and biomedicine, data integration challenges are even more complex. The amount of raw data, the possible data associations, the diversity of concepts and data formats, and the demand for information quality assurance are just a few issues that hinder the development of a general proposal and solid solutions. In this article we describe a lightweight information integration architecture that is capable of unifying, in a single access point, several heterogeneous bioinformatics data sources. The model is based on web crawling that automatically collects keywords related with biological concepts that are previously defined in a navigation protocol. This crawling phase allows the construction of a link-based integration mechanism that conducts users to the right source of information, keeping the original interfaces of available information and maintaining the credits of original data providers.


  1. Adams, M. D., Kelley, J. M., et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651-1656.
  2. Al, B. E. T. & Junien, C. (2000) UMD (Universal Mutation Database): A Generic Software to Build and Analyze Locus-Specific Databases. Human Mutation, 94.
  3. Arrais, J., Santos, B., et al. (2007) GeneBrowser: an approach for integration and functional classification of genomic data. Journal of Integrative Bioinformatics, 4.
  4. Bairoch, A., Apweiler, R., et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Research, 33, 0-159.
  5. Belleau, F., Nolin, M.-A., et al. (2008) Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41, 706-716.
  6. Collins, F. S., Patrinos, A., et al. (1998) New Goals for the U.S. Human Genome Project: 1998-2003. Science, 282, 682-689.
  7. Cotton, R. G. H., Auerbach, A. D., et al. (2008) GENETICS: The Human Variome Project. Science, 322, 861-862.
  8. Edgar, R., Domrachev, M. & Lash, A. E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30, 207-210.
  9. Fokkema, I. F., Den Dunnen, J. T. & Taschner, P. E. (2005) LOVD: easy creation of a locus-specific sequence variation database using an "LSDB-in-abox" approach. Human Mutation, 26, 63-68.
  10. Haas, L. M., Schwarz, P. M., et al. (2001) DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal, 40, 489-511.
  11. Hamosh, A., Scott, A. F., et al. (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research, 33, 514-517.
  12. Hubbard, T., Barker, D., et al. (2002) The Ensembl genome database project. Nucleic Acids Research, 30, 38-41.
  13. Lopes, P., Arrais, J. & Oliveira, J. L. (2008) Dynamic Service Integration using Web-based Workflows. Proceedings of the 10th Internation Conference on Information Integration and Web Applications & Services. Linz, Austria, Association for Computer Machinery.
  14. Maglott, D., Ostell, J., et al. (2007) Entrez Gene: genecentered information at NCBI. Nucleic Acids Research, 35.
  15. Oinn, T., Addis, M., et al. (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20, 3045-3054.
  16. Oliveira, J. L., Dias, G. M. S., et al. (2004) DiseaseCard: A Web-based Tool for the Collaborative Integration of Genetic and Medical Information. Proceedings of the 5th International Symposium on Biological and Medical Data Analysis, ISBMDA 2004. Barcelona, Spain, Springer.
  17. Polyzotis, N., Skiadopoulos, S., et al. (2008) Meshing Streaming Updates with Persistent Data in an Active Data Warehouse. Knowledge and Data Engineering, IEEE Transactions on, 20, 976-991.
  18. Pruitt, K. D. & Maglott, D. R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research, 29, 137-140.
  19. Reddy, S. S. S., Reddy, L. S. S., et al. (2009) Advanced Techniques for Scientific Data Warehouses. Advanced Computer Control, 2009. ICACC 7809. International Conference on.
  20. Zhu, Y., An, L. & Liu, S. (2008) Data Updating and Query in Real-Time Data Warehouse System. Computer Science and Software Engineering, 2008 International Conference on.

Paper Citation

in Harvard Style

Lopes P., Arrais J. and Luís Oliveira J. (2009). LINK INTEGRATOR - A Link-based Data Integration Architecture . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 274-277. DOI: 10.5220/0002291702740277

in Bibtex Style

author={Pedro Lopes and Joel Arrais and José Luís Oliveira},
title={LINK INTEGRATOR - A Link-based Data Integration Architecture},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},

in EndNote Style

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - LINK INTEGRATOR - A Link-based Data Integration Architecture
SN - 978-989-674-011-5
AU - Lopes P.
AU - Arrais J.
AU - Luís Oliveira J.
PY - 2009
SP - 274
EP - 277
DO - 10.5220/0002291702740277