TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY

Ari Pirkola

2012

Abstract

The contributions of this paper are twofold. First, we present a new type of dictionary that is intended as a search assistance in topic-specific Web searching. The method to construct the dictionary is a general method that can be applied to any reasonable topic. The first implementation deals with climate change. The dictionary has the following new features compared to standard dictionaries and thesauri: (A) It contains real-text phrases (e.g. rising sea levels) in addition to the standard dictionary forms (sea-level rise). The phrases were extracted automatically from the pages dealing with climate change, and are thus known to appear in the pages discussing climate change issues when used as search terms. (B) Synonyms, i.e., different spelling, syntactic, and short form variants of the phrase are grouped together into the same entry (synonym set) using approximate string matching. (C) Each phrase is assigned an importance score (IS) which is calculated based on the frequencies of the phrase in relevant pages (i.e., pages on climate change) and non-relevant pages. Second, we investigate how effective the IS is for indicating the best phrase among synonymous phrases and for indicating effective phrases in general from the viewpoint of search results. The experimental results showed that the best phrases have higher ISs than the other phrases of a synonym set, and that the higher the IS is the better the search results are. This paper also describes the crawler used to fetch the source data for the climate change dictionary and discusses the benefits of using the dictionary in Web searching.

References

  1. Belkin, N. J., Oddy, R. N., Brooks, H. M., 1982. ASK for information retrieval: Part I. Background and history. Journal of Documentation, 38 (2), 61-71.
  2. Belkin, N. J., Oddy, R. N., Brooks, H. M., 1982. ASK for information retrieval: Part I. Background and history. Journal of Documentation, 38 (2), 61-71.
  3. Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Focused crawls, tunneling, and digital libraries. Proc. of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Rome, Italy, September 16-18, pp. 91-106.
  4. Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Focused crawls, tunneling, and digital libraries. Proc. of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Rome, Italy, September 16-18, pp. 91-106.
  5. Chakrabarti, S., van den Berg, M. and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Proc. of the Eighth International World Wide Web Conference, Toronto, Canada, May 11-14, pp. 1623-1640.
  6. Chakrabarti, S., van den Berg, M. and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Proc. of the Eighth International World Wide Web Conference, Toronto, Canada, May 11-14, pp. 1623-1640.
  7. Cronen-Townsend, S., Zhou, Y. and Croft, B., 2002. Predicting query performance. Proc. of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 11- 15, pp. 299-306.
  8. Cronen-Townsend, S., Zhou, Y. and Croft, B., 2002. Predicting query performance. Proc. of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 11- 15, pp. 299-306.
  9. Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C.L. and Gori, M., 2000. Focused crawling using context graphs. Proc. of the 26th International Conference on Very Large Databases (VLDB), Cairo, Egypt, September 10-14, pp. 527-534.
  10. Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C.L. and Gori, M., 2000. Focused crawling using context graphs. Proc. of the 26th International Conference on Very Large Databases (VLDB), Cairo, Egypt, September 10-14, pp. 527-534.
  11. El-Beltagy, S. and Rafea, A., 2009. KP-Miner: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132-144.
  12. El-Beltagy, S. and Rafea, A., 2009. KP-Miner: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132-144.
  13. He, B. and Ounis, I., 2006. Query performance prediction. Information Systems, 31(7), 585-594.
  14. He, B. and Ounis, I., 2006. Query performance prediction. Information Systems, 31(7), 585-594.
  15. Ingwersen, P. and Järvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Heidelberg, Springer.
  16. Ingwersen, P. and Järvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Heidelberg, Springer.
  17. Jaene, H. and Seelbach, D., 1975. Maschinelle Extraktion von zusammengesetzten Ausdrücken aus englischen Fachtexten. Report ZMD-A-29. Beuth Verlag, Berlin.
  18. Jaene, H. and Seelbach, D., 1975. Maschinelle Extraktion von zusammengesetzten Ausdrücken aus englischen Fachtexten. Report ZMD-A-29. Beuth Verlag, Berlin.
  19. Jansen, B. J., Spink, A. and Saracevic, T., 2000. Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing & Management, 36(2), 207-227.
  20. Jansen, B. J., Spink, A. and Saracevic, T., 2000. Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing & Management, 36(2), 207-227.
  21. Lee, H. J., 2008. Mediated information retrieval in Web searching. Proc. of the American Society for Information Science and Technology, 45(1), pages 1-10.
  22. Lee, H. J., 2008. Mediated information retrieval in Web searching. Proc. of the American Society for Information Science and Technology, 45(1), pages 1-10.
  23. Muresan, G. and Harper, D. J. 2004., Topic modeling for mediated access to very large document collections. Journal of the American Society for Information Science and Technology, 55 (10), 892-910.
  24. Muresan, G. and Harper, D. J. 2004., Topic modeling for mediated access to very large document collections. Journal of the American Society for Information Science and Technology, 55 (10), 892-910.
  25. Perez-Iglesias, J. and Araujo. L., 2010. Standard deviation as a query hardness estimator. The 17th International Symposium on String Processing and Information Retrieval (SPIRE 2010), Los Cabos, Mexico, October 11-13, pp. 207-212.
  26. Perez-Iglesias, J. and Araujo. L., 2010. Standard deviation as a query hardness estimator. The 17th International Symposium on String Processing and Information Retrieval (SPIRE 2010), Los Cabos, Mexico, October 11-13, pp. 207-212.
  27. Pirkola, A., 2011a. Constructing topic-specific search keyphrase suggestion tools for Web information retrieval. Proc. of the 12th International Symposium on Information Science (ISI 2011), Hildesheim, Germany, March 9-11, pp. 172-183.
  28. Pirkola, A., 2011a. Constructing topic-specific search keyphrase suggestion tools for Web information retrieval. Proc. of the 12th International Symposium on Information Science (ISI 2011), Hildesheim, Germany, March 9-11, pp. 172-183.
  29. Pirkola, A., 2011b. A Web search system focused on climate change. Digital Proceedings, Earth Observation of Global Changes (EOGC), Munich, Germany, 13-15 April.
  30. Pirkola, A., 2011b. A Web search system focused on climate change. Digital Proceedings, Earth Observation of Global Changes (EOGC), Munich, Germany, 13-15 April.
  31. Robertson and Willett., 1998. Applications of n-grams in textual information systems. Journal of Documentation, 54(1), 48-69.
  32. Robertson and Willett., 1998. Applications of n-grams in textual information systems. Journal of Documentation, 54(1), 48-69.
  33. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J., 2008. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427-445.
  34. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J., 2008. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427-445.
  35. Tang, T., Hawking, D., Craswell, N. and Griffiths, K., 2005. Focused crawling for both topical relevance and quality of medical information. Proc. of the 14th ACM International Conference on Information and Knowledge Management (CIKM 7805), Bremen, Germany, October 31-November 5, pp. 147-154.
  36. Tang, T., Hawking, D., Craswell, N. and Griffiths, K., 2005. Focused crawling for both topical relevance and quality of medical information. Proc. of the 14th ACM International Conference on Information and Knowledge Management (CIKM 7805), Bremen, Germany, October 31-November 5, pp. 147-154.
  37. Turney, P.D., 2003. Coherent keyphrase extraction via Web mining. Proc. of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, pp. 434-439.
  38. Turney, P.D., 2003. Coherent keyphrase extraction via Web mining. Proc. of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, pp. 434-439.
  39. Wei, X., Peng, F., Tseng. H., Lu, Y. and Dumoulin, B., 2009. Context sensitive synonym discovery for web search queries. Proc. of the 18th ACM Conference on Information and Knowledge Management (CIKM 7809), Hong Kong, pp. 1585-1588.
  40. Wei, X., Peng, F., Tseng. H., Lu, Y. and Dumoulin, B., 2009. Context sensitive synonym discovery for web search queries. Proc. of the 18th ACM Conference on Information and Knowledge Management (CIKM 7809), Hong Kong, pp. 1585-1588.
  41. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999. KEA: Practical automatic keyphrase extraction. Proc. of the 4th ACM conference on Digital Libraries, Berkeley, California, pp. 254-255.
  42. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999. KEA: Practical automatic keyphrase extraction. Proc. of the 4th ACM conference on Digital Libraries, Berkeley, California, pp. 254-255.
Download


Paper Citation


in Harvard Style

Pirkola A. (2012). TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 289-298. DOI: 10.5220/0003895602890298


in Harvard Style

Pirkola A. (2012). TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 289-298. DOI: 10.5220/0003895602890298


in Bibtex Style

@conference{webist12,
author={Ari Pirkola},
title={TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={289-298},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003895602890298},
isbn={978-989-8565-08-2},
}


in Bibtex Style

@conference{webist12,
author={Ari Pirkola},
title={TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={289-298},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003895602890298},
isbn={978-989-8565-08-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY
SN - 978-989-8565-08-2
AU - Pirkola A.
PY - 2012
SP - 289
EP - 298
DO - 10.5220/0003895602890298


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY
SN - 978-989-8565-08-2
AU - Pirkola A.
PY - 2012
SP - 289
EP - 298
DO - 10.5220/0003895602890298