A Survey of Collaborative Web Search

Through Collaboration among Search Engine Users to More Relevant Results

Pavel Surynek

Faculty of Mathematics and Physics, Charles University in Prague, Malostranské náměstí 25, Prague, Czech Republic

Keywords: Collaborative Web Search, Social Search, Search Engine, Search Results, Collaborative Filtering,

Recommender Systems, System Integration.

Abstract: A survey on collaborative aspects of web search is presented in this paper. Current state in full-text web

search engines with regards on users collaboration is given. The position of the paper is that it is becoming

increasingly important to learn from other users searches in a collaborative way in order to provide more

relevant results and increase benefit from web search sessions. Recommender systems represent a rich

source of concepts that could be employed to enable collaboration in web search. A discussion of techniques

used in recommender systems is followed by a suggestion of integration web search with recommender sys-

tems. An initial experience with web search powering small academic site is reported finally.

1 INTRODUCTION AND

MOTIVATION

Web search is an area of the information technology

industry where artificial intelligence and particularly

knowledge engineering techniques can be applied

with potentially significant impacts. Currently users

face a still increasing amount of data of many kinds

that can be accessed through web (textual data, mul-

timedia data, automatically collected data – CCTV

records, etc...). Organizing and making these infor-

mation accessible represents a continuous challenge

despite the existence of powerful mainstream search

engines like Bing (Microsoft Corp., 2013), Google

(Google Inc., 2013), and Yahoo! (Yahoo! Inc.,

2013), and also national ones like Yandex (Yandex

Corp., 2013), Baidu (Baidu, Inc., 2013), Naver

(NHN Corp., 2013), Seznam (Seznam.cz and

Lukačovič, 2013), and Jyxo (CET21 and Illich,

2013). When looking at these search engines more

closely it can be observed that from the point of

view of software engineering aspect they represent a

pinnacle of technological achievement in software

industry. However, if the point of view of artificial

intelligence is adopted then it cannot be certainly

said if these search engines are intelligent. For ex-

ample little is utilized by search engines from series

of user’s search queries and little support is provided

for cooperation among users. It is quite reasonable

assumption that a series of queries characterize the

effort of what the user want to find better than the

single query. The typical search engine however

does not help in this effort – users are put into isola-

tion typically which precludes any cooperation and

recommendation from other users based on past

queries. To be honest, for instance the Bing search

engine (more correctly the decision engine) uses

certain technology that provide search results based

on user’s search history and geographical location.

This cannot be always said about other search en-

gines. Another relatively weak point of contempo-

rary search engines is that little is done in under-

standing the search query from the perspective of

natural language processing and understanding

(Chakrabarti, 2003).

1.1 Through Collaboration to better

Experience: A Suggestion

A brief survey of techniques used in contemporary

search engines and suggestion of improvements that

can enable a better user’s experience from the search

session are the main parts of the paper. It is suggest-

ed to employ cooperation among users in their

search effort to achieve the goal. The complete us-

er’s interaction with the search engine will be con-

sidered – that is whole user’s search history as well

as the sequence of latest user’s queries will be taken

into account (Pasca and van Durme, 2007).

331

Surynek P..

A Survey of Collaborative Web Search - Through Collaboration among Search Engine Users to More Relevant Results.

DOI: 10.5220/0004621803310336

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2013), pages 331-336

ISBN: 978-989-8565-81-5

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

1.2 Web Search and Recommender

Systems

Recent advances in the area of recommender sys-

tems (Ricci et al., 2011) indicate that great im-

provement of user’s benefit in finding desirable

items can be achieved through collaboration among

users. Hence it is suggested to integrate techniques

of recommender systems with web search.

Until the recent announcement of Facebook’s

(Facebook Inc., 2013) social search engine (Graph

Search) in January 2013 (Straley, 2013), recom-

mender systems and search engines have been con-

sidered mostly separately. Since social networks

possesses profiles of millions of users it had a great

opportunity to carry out cooperation among users by

recommending results to users (or items, products,

events, etc...) that have been searched by users simi-

lar to them in terms of the profile similarity. It is

expected that the attempt will initiate interest in

integrations of recommender systems and search

engines.

The focus of this work regarding the integration

of web search and recommender systems is different

than that of used in social networks. It is planned to

target research on operators that do not possess any

user profiles which is quite rare anyway. The data

suggest to be exploited is the user’s interaction and

behavior. So there is a technical limitation in this

sense but on the other hand the limitation represents

a research opportunity and also implies wider range

of potential operators of such integrated cooperative

search service. Certain attempts have been already

done in recommending search results using case

based reasoning on user’s search history (Ross and

Wolfram, 2000; Smyth et al., 2011).

2 SURVEY OF THE CURRENT

STATE IN WEB SEARCH

The search engine (Büttcher et al., 2010) can be

regarded from two different aspects – the software

engineering one and the computational intelligence

one. From the perspective of software engineering

the top level design of traditional search engine is

relatively simple; difficulties appear after looking at

individual building blocks.

The top level operational design consists of the

phase of crawling the web in which textual or mul-

timedia documents are retrieved from the web and

stored into the internal storage. Then a phase of

indexing follows immediately after. A so called

index is build in the indexing phase. The index asso-

ciates searched items, that is words in most cases,

with their occurrence in web documents. After the

index is built users can post their textual queries

through a web interface which are then processed on

the server side and searched in the index. Multiple

iterations of crawling and indexing are made typical-

ly to keep the index up to date while the search ser-

vice is provided without any interruption.

2.1 Algorithmic and Software

Engineering Challenges

One of the most important software engineering

challenges appears in building the index. It is very

challenging to design the index structure capable of

holding the associations of searched items for mil-

lions of web documents (consider that Google has

the index for more than 40e+9 pages; in other major

search providers this number is around 20e+9 pages

(de Kunder, 2013)) and to keep the speed of im-

portant operations such as insertion or search in the

index at the acceptable level. This challenge is ad-

dressed mostly by efficient programming and by

distributing the task. The requirements on the disk

capacity to store the index is also a considerable

issue. Various techniques of compression are usually

employed to make the size of the index manageable

(consider that uncompressed index needs many

times larger space than the indexed documents

themselves).

If no kind of load balancing is considered then

the search phase is relatively easy. First, a certain

kind of parsing of the user’s textual query is made.

Then direct searches into the index are performed

and results are presented to the user. If too many

user’s are expected to post queries then the task

needs to be distributed on several machines while

each of them needs to have access to an up to date

index – so certain kind of distribution and/or syn-

chronization of local indexes is imposed by this

phase as well.

Let us note that modern algorithmic techniques

play a significant role in making the search engines

possible. For example modern data structures de-

rived from trie (Baeza-Yates and Gonnet, 1996;

Fredkin, 1960) such as suffix trees (Ukkonen, 1995;

Mansour et al., 2011) or suffix arrays (Dementiev et

al., 2008; Manber and Myers, 1990), are used as

building blocks of index structures. Important bene-

fits for user’s experience during the search engine

session is directly rooted in these data structures –

let us mention approximate string matching (Navar-

KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

332

ro, 2001; Navarro et al., 2001) supported by suffix

trees that allows users make typos in their queries.

Another important algorithmic issue that should

be in focus of the search engine designer is the

cache awareness/obliviousness of data structures

(Bender et al., 2005) and algorithms (Frigo et al.,

2012). As the storage for index must be inherently

built in hierarchical manner – from external storage

to CPU cache – this issue is a key for the overall

performance of the search engine. Moreover classi-

cal consideration of cache is reduced on buffering

between the main memory and CPU; in the case of

index storage the situation is more complicated as

there is at least one more layer between main

memory and disk. One more layer is represented by

the distribution in the case of distributed index.

2.2 Computational Intelligence in

Web Search

Into the computational intelligence aspect it is ac-

counted ordering of search results and understanding

user’s search query. It is well known that a break-

through in ordering of search results was made by

the PageRank algorithm (Brin and Page, 1998) that

ranks web page by simulating a so called random

surfers who randomly follows links in web pages or

randomly skips to new web pages. A web page

where many random surfers gather are considered to

be important and results on them appear on the top.

Many other supporting proprietary techniques are

used to improve orderings given by PageRank. It is

necessary to take into account position of searched

term within web documents. Another important

issue which is however out of scope of this study is

security and protection against biasing search results

by generating artificial links – the well known

spamdexing represents one of these infestations.

The understanding of search query has been ad-

dressed in multiple ways. One approach is to analyze

the given textual query by techniques of natural

language processing which includes syntactic and

semantic analysis of the query at the level of sen-

tences (Zhou et al., 2007). The search in the index is

performed not only for terms entered by the user but

also for terms that are semantically related to them.

Semantically related terms are often search in ontol-

ogies.

An interesting example of syntactic analysis of

user queries is represented by the Sherlock Holmes

search engine (commercially known as Morfeo)

(Mareš and Špalek, 2009) which performs Czech

language stemming of searched terms. Similar tech-

niques are implemented in major search engines

mostly for English.

The difficulty of making research in search en-

gines is that it is little publicly known about internal

techniques used by major search providers since this

represents a proprietary know-how and one of the

most valued secrets of each company.

3 A BRIEF SURVEY OF

RECOMMENDER SYSTEMS

The second technology in focus of this paper is rep-

resented by recommender systems (Resnick and

Varian, 1997; Ricci et al., 2011). Great advances

have been made in automated recommending of

various items such as music and movies recently.

The most successful technique for making automat-

ed recommendations is collaborative filtering in

which preferences of many users are processed in

order to make good recommendations for other us-

ers. Knowing what items the user is interested in and

preferences of other users, the system is able to

recommend other items the user may be also inter-

ested in. Contrary to item-based recommendation

(Sarwar et al., 2001), the collaborative filtering is

able to bring novelty from the perspective of the

active user since not only similar items to those

already preferred can be recommended (Melville et

al., 2002) (if items are represented as vectors then

similarity between two items can be calculated as

correlation or as their angle, that is, scalar product).

Recommender systems became widely known

thanks to commercially successful recommender

system of Amazon.com (Linden et al., 2003) and

thanks to Netflix prize (Bell and Koren, 2007). One

of the most successful approaches in recommending

is a so called matrix factorization (Koren et al.,

2009.).

If user preferences are represented in a matrix

where items corresponds to columns and rows to

users then the immediate observation is that this

matrix very sparse. The given sparse matrix can be

submitted to dimensionality reduction methods

(Rennie and Srebro, 2005) that project explicit user

preferences into a different space (of smaller dimen-

sion) where user preferences are represented in more

general categories. If for instance the original matrix

represents preferences of particular movies then the

transformed one can represent preferences of movie

genres. Thus algebraic methods serve here very

efficiently for knowledge mining and knowledge

discovery.

ASurveyofCollaborativeWebSearch-ThroughCollaborationamongSearchEngineUserstoMoreRelevantResults

333

4 COLLABORATIVE WEB

In this preliminary research, it is suggested to inte-

grate techniques from web search with techniques

from recommender systems to enable collaborative

web search implicitly. Integrating of both approach-

es also require to develop new techniques in both

areas inspired by each other. The implicit coopera-

tion means that the user is not required to do any-

thing else than entering search queries into the

search box and picking some of provided results. No

acting in sense of the social networking is expected.

4.1 Research Topics in Collaborative

Web Search

The top level research question is how to make co-

operative recommendations for the user of the web

search engine. This means to suggest what the user

may be interested in from knowing her/his and other

users past short-term and long-term interaction with

the search engine. The aspect of novelty in suggest-

ing interesting items should be also investigated

since the user may not be sure in what she/he is

trying to find and novelty in recommendation could

provide her/him new search directions.

It is expected that combination of feature based

and semantic methods will be used to make efficient

recommendations. Feature based methods include

methods that regard the user’s preference of

searched items as feature vectors and matrices re-

spectively if groups of users are considered, and

these collections of features are processed by alge-

braic and numeric methods such as matrix factoriza-

tion. The semantic approach includes understanding

of searched terms by their meaning in natural lan-

guage and ability to imply what terms are semanti-

cally related. Consider that in the case of search

engine there is a weak association between searched

terms and items (web documents) the user actually

want to find. Hence it is needed to develop tech-

niques that discover these associations automatical-

ly. The source of techniques for this task is machine

learning (Witten and Frank, 2011).

4.2 Technical Issues in Evaluation

Except this top level research direction there are

several minor issues that should be addressed. One

of them is evaluation of suggested techniques. Cur-

rently it is unclear how to evaluate the search engine

in a real-life scenario while scientific attributes of

such evaluation are ensured; that is, repeatability and

reproducibility of results, since interaction with live

users is hard to reproduce. Hence suggestion of

relevant test scenarios and acquisition of real life

data will be also the task within this research topic.

On the other hand web with millions of users repre-

sents a huge pool of opportunities to obtain testing

data and to make experiments.

Another problematic point that is planned to be

addressed is a so called cold start of collaborative

recommendation. If no or few data is collected from

users of particular recommender system then it is

hard to cooperate in suggesting recommendations.

One option plan to overcome this problem is using

semantic information instead of cooperative one and

to gradually increase the cooperative component as

the amount of collected data increases.

5 CONCLUSIONS AND FUTURE

WORK

Two areas are in focus of this paper – web search

and recommender systems. A survey of contempo-

rary mainstream search engines is given while cer-

tain deficiencies rooted in little cooperation among

search engine users are identified. The main sugges-

tion is to improve user’s benefit from search engine

session by enabling collaboration among users to

provide more relevant results. Current techniques

used in recommender systems are surveyed and

integration of web search with suitable recommend-

er techniques is suggested.

It is planned to integrate recommender tech-

niques into our experimental search engine yeRCH.

The search engine is implemented in C++, PHP, and

JavaScript and all the standard search engine fea-

tures have been already implemented. A short pre-

liminary demo of the yeRCH search engine power-

ing full-text search in an academic web site is shown

in the appendix. Data regarding user’s behavior are

being collected and will be utilized in the future

research and development.

ACKNOWLEDGEMENTS

This work is supported by the Charles University in

Prague, Czech Republic within the PRVOUK pro-

ject (section P46) and by the Czech Science Founda-

tion under the contract number GAP103/10/1287.

KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

334

I would like to thank anonymous reviewers for

their critical comments and suggestions that will

help me in future research of the topic.

REFERENCES

Linden, G., Smith, B., York, J., 2003. Amazon.com Rec-

ommendations: Item-to-Item Collaborative Filtering.

IEEE Internet Computing, Volume 7 (1), pp. 76-80,

http://www.amazon.com/, IEEE Press.

Baeza-Yates, R. A.; Gonnet, G. H., 1996. Fast text search-

ing for regular expressions or automaton searching on

tries. Journal of the ACM, Volume 43 (6), pp. 915–

936, ACM.

Baidu, Inc., 2013. Baidu Search. http://www.baidu.com/,

China, (Accessed on March 2013).

Bell, R. M., Koren, Y., 2007. Lessons from the Netflix

Prize Challenge. SIGKDD Explorations, Volume 9,

pp. 75-79, ACM.

Bender, M. A., Demaine, E. D., Farach-Colton, M., 2005.

Cache-Oblivious B-Trees. SIAM Journal of Compu-

ting, Volume 35(2), pp. 341-358, ACM.

Brin, S., Page, L., 1998. The anatomy of a large-scale

hypertextual Web search engine. Computer Networks

and ISDN Systems, Volume 30, pp. 107–117, Else-

vier.

Büttcher, S., Clarke, C. L. A., Cormack, G., V., 2010.

Information Retrieval: Implementing and Evaluating

Search Engines. MIT Press.

CET21, Illich, M., 2013. Jyxo search / Yoopy.

http://sluzby.yoopy.cz/, Czech Republic, (Accessed on

March 2013).

Chakrabarti, S., 2003. Mining the web - discovering

knowledge from hypertext data, pp. I-XVIII, 1-345,

Morgan Kaufmann.

Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.,

2008. Better external memory suffix array construc-

tion. ACM Journal of Experimental Algorithmics,

Volume 12, ACM.

Facebook Inc., 2013. facebook - Connect with friends and

the world around you on Facebook. http://

www.facebook.com, USA, (Accessed on March

2013).

Fredkin, E., 1960. Trie Memory. Communications of the

ACM, Volume 3 (9), pp. 490–499, ACM.

Frigo, M., Leiserson, C. E., Prokop, H., Ramachandran,

S., 2012. Cache-Oblivious Algorithms. ACM Transac-

tions on Algorithms, Volume 8(1), ACM.

Google Inc., 2013. Google Search. http://

www.google.com/, USA, (Accessed on March 2013).

Koren, Y., Bell, R. M., Volinsky, C., 2009. Matrix Factor-

ization Techniques for Recommender Systems. IEEE

Computer, Volume 42 (8), pp. 30-37, IEEE Press.

de Kunder, M., 2013. The size of the World Wide Web.

http://www.worldwidewebsize.com/, Netherlands,

(Accessed on March 2013).

Manber, U., Myers, G., 1990. Suffix arrays: a new method

for on-line string searches. Proceedings of the first

annual ACM-SIAM symposium on Discrete algo-

rithms, pp. 319-327, ACM.

Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.,

2011. ERA: Efficient Serial and Parallel Suffix Tree

Construction for Very Long Strings. Proceedings of

the VLDB Endowment, Volume 5 (1), pp. 49–60,

University of Michigan.

Mareš, M., Špalek, R., 2009. Sherlock Holmes Search

Engine. http://www.ucw.cz/holmes/, Czech Republic,

(Accessed on March 2013).

Melville, P., Mooney, R. J., Nagarajan, R., 2002. Content-

Boosted Collaborative Filtering for Improved Recom-

mendations. Proceedings of the 18th National Confer-

ence on Artificial Intelligence (AAAI), pp. 187-192,

AAAI Press.

Microsoft Corp., 2013. Bing Search. http://

www.bing.com, USA, (Accessed on March 2013).

Navarro, G., 2001. A guided tour to approximate string

matching. ACM Computing Surveys, Volume 33 (1),

pp. 31–88, ACM, 2001.

Navarro, G., Baeza-Yates, R. A., Sutinen, E., Tarhio, J.,

2001. Indexing Methods for Approximate String

Matching. IEEE Data Engineering Bulletin 24 (4): pp.

19–27, IEEE Press.

NHN Corp., 2013. Naver Search. http://www.naver.com/,

South Korea, (Accessed on March 2013).

Pasca, M., van Durme, B., 2007. What you seek is what

you get: Extraction of class attributes from query logs.

Proceedings of the 20th International Joint Conference

on Artificial Intelligence (IJCAI), pp. 2832-2837,

IJCAI, 2007.

Resnick, P., Varian, H., 1997. Recommender systems.

Communications of the ACM, Volume 40 (3), pp. 56–

58, ACM.

Rennie, J. D. M., Srebro, N., 2005. Fast Maximum Margin

Matrix Factorization for Collaborative Prediction.

Machine Learning, Proceedings of the 22nd Interna-

tional Conference (ICML 2005), pp. 713-719, ACM

International Conference Proceeding Series 119,

ACM.

Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (Editors),

2011. Recommender Systems Handbook. Springer

Verlag.

Ross, N., Wolfram, D., 2000. End user searching on the

Internet: An analysis of term pair topics submitted to

the Excite search engine. Journal of the American So-

ciety for Information Science, Volume 51 (10), pp.

949–958, JASIST.

Sarwar, B. M., Karypis, G., Konstan, J. A., Riedl, J., 2001.

Item-based collaborative filtering recommendation al-

gorithms. Proceedings of the 10th International World

Wide Web Conference (WWW 2001), pp. 285-295,

ACM.

Seznam.cz, a. s., Lukačovič, I., 2013. Seznam search.

http://www.seznam.cz/, Czech Republic, (Accessed on

March 2013).

Smyth, B., Freyne, J., Coyle, M., Briggs, P., 2011.

Rec-

ommendation as Collaboration in Web Search. AI

Magazine, Volume 32(3), pp. 35-45, AAAI Press.

ASurveyofCollaborativeWebSearch-ThroughCollaborationamongSearchEngineUserstoMoreRelevantResults

335

Smyth, B., Coyle, M., Briggs, P., 2012. HeyStaks: a real-

world deployment of social search. Proceedings of

Sixth ACM Conference on Recommender Systems

(RecSys 2012), http://www.heystaks.com/, pp. 289-

292, ACM, (Accessed on March 2013).

Straley, B., 2013. Facebook’s Graph Search: the Ultimate

Personalized Discovery Engine? http://

searchenginewatch.com/article/2238590/Facebooks-

Graph-Search-the-Ultimate-Personalized-Discovery-

Engine, Search Engine Watch, January 23, 2013, (Ac-

cessed on March 2013).

Ukkonen, E., 1995. On-line construction of suffix trees.

Algorithmica, Volume 14 (3), pp. 249–260, Springer

Verlag.

Witten, I. H., Frank, E., 2011. Data Mining: Practical

machine learning tools and techniques. Morgan

Kaufmann.

Yahoo! Inc., 2013. Yahoo! Search. http://

www.yahoo.com/, USA, (Accessed on March 2013).

Yandex Corp., 2013. Yandex Search. http://

www.yandex.ru/, Russia, (Accessed on March 2013).

Zhou, Q., Wang, C., Xiong, M., Wang, H., Yu, Y., 2007.

Spark: adapting keyword query to semantic search.

The Semantic Web, pp. 694-707, Springer Verlag,

2007.

APPENDIX

yeRCH Search Engine Experiments

A prototype search engine called yeRCH has been

developing to be able to conduct experiments with

cooperation in web search. yeRCH search engine is

now powering full-text search on the academic web

site of the author’s institute. Data collected from the

search engine operation will be further processed in

order to improve users experience (entered search

terms, clicked through results, etc. are collected).

An initial experiment has been done with the

search engine. If results are summarized, the most

important observation from collected users behav-

iour is that users often use the search to find con-

crete term on the web while few related searches are

invoked. Hence it is considered that adding recom-

mendation of related searches would increase user’s

benefit from the search.

Figure 1: Illustration of using yeRCH search engine pow-

ering full text search on the author’s academic web site.

KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

336