image

Digital Tools and Uses Set

coordinated by
Imad Saleh

Volume 5

Digital Libraries and Crowdsourcing

Mathieu Andro

Wiley Logo

Preface

In lieu of outsourcing certain tasks to service providers with access to countries where labor is cheap, libraries throughout the world are relying more and more on groups of internet users, turning their relationship with users into one that is more collaborative. After a conceptual chapter about the consequences of this new economic model on society and on libraries, an overview of projects in the areas of on-demand digitization, participative correction of OCR especially in the form of games (gamification) and folksonomy will be presented. This panorama leads to an overview of crowdsourcing applied to digitization and digital libraries and analyses in the area of information and communication sciences.

Acknowledgments

I would like to thank Imad Saleh, Professor at the Paragraphe Laboratory of Paris 8 University, for having agreed to supervise my thesis project, for his kindness and for his advice throughout the entire project; Samuel Szoniecky, Senior Lecturer at the Paragraphe Laboratory of Paris 8 University, for having agreed to be the co-director of my thesis and for having invited me to speak with his students; Ghislaine Chartron (Professor at the National Conservatory of Arts and Crafts); Stéphane Chaudiron (Professor at Charles-de-Gaulle University, Lille 3); Céline Paganelli, Senior Lecturer – HDR (accreditation to supervise research) at Paul Valéry University, Montpellier; Alain Garnier, CEO of Jamespot and crowdsourcing advisor at the Groupement Français des Industries de l’Information (GFII) for having agreed to be an examiner of my thesis; François Houllier, Institut University, National de la Recherche Agronomique (INRA), for letting me participate alongside him in a task force on citizen science in order to submit a report on the subject at the request of the appropriate ministers; Odile Hologne, from the department of promoting scientific and technical information of the Institut National de la Recherche Agronomique (INRA), for having encouraged experimentation around INRA’s Numalire project within the framework of my work; Filippo Gropallo and Denis Maingreaud, from the company Orange and the company Yabé, for their project Numalire in which they allowed me to participate, and for their collaboration throughout this research project; Marc Maisonneuve and Emmanuelle Asselin, from the consulting firm TOSCA, for their collaboration on the book that we published together on software and platforms for developing digital libraries; Gaëtan Tröger, Ecole des Ponts ParisTech, for his collaboration in the study that we carried out on the visibility and statistics of the consultation of digital libraries; Pauline Rivière, Sainte-Geneviève Library, and Anaïs Dupuy-Olivier, Académie de Médecine, for their collaboration in the feedback on the Numalire experiment that we wrote together; Robert Miller, Internet Archive, for the collaboration that we had at Sainte-Geneviève Library, which became the first library in France to participate in the Internet Archive; Stéphane Ipert, Centre de Conservation du Livre, for the collaborations and interesting discussions that we had; Pierre Beaudoin and Rémi Mathis, previous and current presidents of Wikimedia France, an association with which collaborations with Wikisource were achieved (National Veterinary School of Toulouse in 2008) or only envisioned (Sainte-Geneviève Library); Valérie Chansigaud, science historian and Wikipedia contributor, with whom first contact was established at the museum followed by a pilot experiment in digitization and participative correction of OCR, which was conducted in 2008 at the National Veterinary School of Toulouse; Gilonne d’Origny, from the company ondemandbooks.com, with whom a collaboration on the first installation of an Espresso Book Machine in France was unsuccessful; Daniel Teeter, from the company Amazon, for the interesting opportunity for partnership that was nearly established; Juan Pirlot de Corbion, founder of chapitre.com and YouScribe, for the passionate discussions that we had over the course of our meetings; Daniel Benoilid, founder of the paid crowdsourcing company Foule Factory, for the discussions that we had; Jean-Pierre Gerault, CEO of the company I2S, leader in the area of manufacturing scanners for the digitization of heritage, president of the Comité Richelieu and CEO of Publishroom, for the interesting discussions that we had; Arnaud Beaufort, National Library of France, whom I met during the Wikimedia days at the National Assembly and with whom I then had an interesting conversation; Silvia Gstrein and Veronika Gründhammer, University of Innsbruck, for having invited me to speak at the Ebooks on Demand 2014 conference; Yves Desrichard and Armelle de Boisse, Ecole Nationale Supérieure des Sciences de l’Information et des Bibliothèques, for having allowed me to speak during the “Quoi de neuf en bibliothèques ?” days these last 5 years; Thierry Claerr, Ministry of Culture and Communication, who allowed me to speak regularly at the ENSSIB and sought me out to write a collaborative work, and with whom I had some very enriching discussions; Jean-Marie Feurtet, Agence Bibliographique de l’Enseignement Supérieur, for our collaboration on a mutualization project of a digital library and for having invited me to speak at the 2011 ABES; Nicolas Turenne, Institut National de la Recherche Agronomique (INRA), for having invited me to show the preliminary results of this work at the seminar entitled “Digital Traces” (Cortext group, Institute for Research and Innovation in Society); Pierre-Benoît Joly, director of the Institute for Research and Innovation in Society (IFRIS), for having invited me to give a master’s level course in Digital Studies and Innovation (NUMI); SNCF for the comfort of the train trips I took while writing this thesis; Google for the Google Drive service, which was used to write the thesis while providing real-time access to it for the director, my collaborators and my contacts who then had the opportunity to add comments; my wife Véronique and my three children Terence, Orégane and Eloïse.

I also want to thank the following people for the constructive comments that they added to the text of the thesis made available in its first draft on Google Drive: Christine Young (proofreading the article in English), Wilfrid Niobet (one idea, eight comments, six corrections), Célya Gruson-Daniel (three comments, four corrections), Olivia Dejean (nine corrections), Michaël Jeulin (seven corrections), Catherine Thiolon (ten comments), Caroline Dandurand (five comments), Diane Le Hénaff (three comments), Sophie Aubin (two comments), Nicolas Ricci (one comment), Pauline Rivière (one comment), Frédérique Bordignon (one comment), Sylvie Cocaud (one comment), Marjolaine Hamelin (one comment), Silvère Hanguehard (one comment), Christine Sireyjol (one comment), Odile Viseux (one comment), Véronique Decognet (one comment), Dominique Fournier (two corrections) and all of the “unknown soldiers” who remained anonymous in their comments (82 corrections).

Mathieu ANDRO
November 2017

Introduction

Libraries already resort to outsourcing certain tasks involved in entering bibliographic records, cataloguing, indexation or OCR correction, to service providers in countries where labor is inexpensive. This outsourcing has remained within a contractual and limited framework and has not profoundly overturned the underlying ways in which libraries work. However, with the development of crowdsourcing, it is possible to imagine externalizing (outsourcing) some of these tasks not to service providers but to “crowds” of Internet users and therefore having amateurs carry out some of the professionals’ work. Crowdsourcing thus changes the paradigm up on which libraries are based, which now largely centers around the creation and conservation of collections. It also changed the relationship between the service providers, namely the librarians, and their consumers, namely the users. The latter are also becoming active producers of services. Crowdsourcing could also interrogate the collection management policies of libraries, which anticipate need based on a supply that is not directly or immediately determined by demand. This is especially the case with the on-demand digitization by crowdfunding, a form of crowdsourcing that calls not on the work of crowds, but on their financial resources, or with the printing on demand which is inseparable from it. With these on-demand economic models, the collection management policy is finally shared with users who decide what will be digitized and/or printed. In this way, the collections become the work of the users.

This book has the goal of providing responses to the question of relying on crowdsourcing for library professionals, as well as for students, researchers in information and communication sciences and, more generally, people interested in collective intelligence projects. It is the result of a thesis on information and communication sciences that simultaneously includes action research, an experiment and an analysis of the literature [AND 16]. This thesis itself has previously been the subject of an article using the main contributions [AND 17].

Beyond the questions of costs/benefits and advantages/disadvantages, the question of an evolution of the librarian’s profession refocused on their singular skills will be addressed. This work also has the scientific goal of providing a contribution to knowledge of crowdsourcing on the theoretical and conceptual level around economic models.

This work is limited to the application of crowdsourcing in the area of digitization and digital libraries. Since the 1990s, the digitization of documents has been widespread in libraries. Today, with mass digitization and the development of gigantic digital libraries such as Google Books, which has crossed the threshold of 30 million books, or Internet Archive, Hathi Trust, Europeana, the “harvester” of European digital libraries, it is becoming more and more difficult to identify printed matter that has not been digitized and still deserves to be, among the 130 million1 existing titles printed since the invention of printing.

A significant part of what has been digitized by libraries has never been put online. It generates duplicate digitization and is “sleeping” on CD-ROMs, DVDs or external hard drives whose lifetime is limited. The development of a digital library can, in fact, be expensive in terms of software administration and servers, and the result can be disappointing in terms of functionalities, durability, costs and visibility. In 2012, we published a study dedicated to the software programs YooLib (Polinum), Invenio (CERN), ORI-OAI (universities), DSpace (DuraSpace), DigiTool (Ex Libris), Mnesys (Naoned), ContentDM (OCLC), Eprint (University of Southampton), Greenstone (University of Waikato) and Omeka (George Mason University) [AND 12]. In this study, we found that it was more advantageous for libraries to participate in a shared digital library such as Internet Archive as much from the point of view of costs (free), functions (optical chapter recognition and conversion into EPUB and MOBI for e-readers directly implemented on archive.org) and permanent archiving (multiple mirror servers around the world) as from that of visibility. Indeed, the position of a website in the list of Google search results depends on its PageRank. This depends largely on the number of links that point to its domain name. Under these conditions, a digital library with a large amount of content will automatically have a better PageRank and better visibility on the web and will therefore generate much more web traffic than a small digital library with very little content.

As Waibel [WAI 08] maintains, two schools of thought exist: an old school that believes that each library needs to create its own digital library and attempt to attract Internet users to it, and a new school that instead believes that in going beyond institutional communication and better satisfying the needs of Internet users, libraries would be better off participating in the digital libraries collectives already visited by Internet users, such as Internet Archive or even Flickr. This is also our point of view. With enough web traffic, libraries may prompt the participation of Internet users.

The introductory part of the book attempts to articulate its context and the methodology that was used.

Chapter 1 addresses the philosophical, political and economic representations of crowdsourcing and its consequences regarding the way in which libraries function. This conceptual chapter contains, in particular:

Chapter 2 contains a selection of projects through types of tasks including:

This chapter contains data and information collected from the literature for each project.

Original analyses for each major type of project are given in the conclusion of Chapter 2.

In Chapter 3, analyses from the point of view of information and communication sciences and a state of the art are offered with, notably: