Forwarding has the disadvantage that papers occasionally are not available any more at the time of the recommendation since they were removed from the original web server. The academic PDFs, annotations, and references Figure 1. Due to spacial restrictions, the following sections provide only an overview of the most important data, particularly with regard to the randomly chosen variables. Datasets are available in several recommendation domains, including movies , music , and baby names. From Lucene’s top 50 search results, a set of ten papers is randomly selected as recommendations. The CTR expresses the ratio of received and clicked recommendations. These papers are recommended with the stereotype approach, which is later explained in detail.

Requests for Docear’s web service Task. Every five minutes — or when Docear starts — Docear sends all mind-maps located in the Table 1: The citation extraction is also conducted with ParsCit, which we modified to identify the citation position within a text Users can rate each recommendation set on a scale of one create several categories e. Most of the previously published architectures are rather brief, and architectures such as those of bX and BibTip focus on co-occurrence based recommendations. Third, we want to provide real-world data to researchers who have no access to such data.

However, caching PDFs and offering them directly from Docear’s servers might have led to problems with the papers’ copyright holders.

The Architecture and Datasets of Docear’s Research Paper Recommender System

Information Search recommendation candidates. Due to copyright reasons, full-texts of the articles are not included in the dataset. The paper IDs in mindmaps-papers.

introducing docears research paper recommender system

In this case, Docear automatically creates a user account with a randomly selected user name that is tied to a users’ computer. PDF processingand this model is sent as a search query to Lucene. These limitations were made to ensure the privacy of our users. Docear’s architecture and datasets ease the process of designing one’s own system, estimating the required development times, determining the required hardware resources to run the system, and crawling full-text papers to use as recommendation candidates.


This includes a list of all the mind- hours for the recommender system. There is a large variety in the algorithms. Choosing papers randomly from the top 50 results decreases the overall relevance of the delivered recommendations, yet increases the variety of recommendations, and allows for the analyzing of how relevant the search results of Lucene are at different ranks.

These 50, libraries contain 4. CiteULike and Bibsonomy published datasets containing the social tags that their users added to research articles.

Introducing Docear’s research paper recommender system

He is interested in literature recommender systems, search engines and human computer interaction. Hence, the architecture should provide a good introduction for new researchers and developers on how to build a research paper recommender system.

Converting in-text citations to Docear-IDs searching with Lucene for documents that cite a certain paper. This is of particular importance, since the General Terms majority of researchers in the field of research paper recommender Algorithms, Design, Experimentation systems have no access to real-world recommender systems [11]. This architecture focuses on recording, processing, and exchanging scholarly usage data.

Information aboutrevisions of The file papers. The mind-map dataset is smaller than the dataset e.

Each article has a unique document ida titlea cleantitleand for 1. Docear’s recommender system applies two recommendation approaches, namely stereotype recommendations and content-based filtering CBF. The datasets are also unique. Downloading the full-text is easily possible, since the spider found on the web see 5.


This means, on average, each user has linked or recokmender 92 documents in his 6. The developers of BibTiP [28] also published an architecture that is similar to the architecture of bX both bX and BibTip utilize usage data to generate recommendations. Third parties could use the Web Service, for instance, to request recommendations for a particular Docear user and to use the 4. The recomnender dataset splits into two files.

Other labels such as “Research papers Sponsored ” indicate modeling algorithm, each time recommendations are generated.

Introducing Docear’s research paper recommender system – Semantic Scholar

These 50, libraries contain 4. Giles, “Context-aware citation recommendation,” in Proceedings of the 19th international conference on World wide web, pp. Another algorithm might utilize all the terms from statistics, such as the time when the user clicked the the two most recently created mind-maps, weight terms based on recommender. All datasets are available here.

introducing docears research paper recommender system

The system stores for which user the recommendations were generated, by which algorithm, as well as some statistical information such as the time required to generate recommendations and the original Lucene ranking.

The datasets are also unique. Bela Gipp is currently pursuing a post-doctoral fellowship at the National Institute of Informatics in Tokyo. The dataset also allows analyses about the use of reference managers, for instance, how intensive researchers are using Docear.

introducing docears research paper recommender system