Size Matters - Web and Book Archiving

ID: Lesk 2003 PDF: (afstuderen:Lesk 2003 - Size Matters - Web and Book Archiving.pdf|PDF)

===Intro *Gathering material is cheap; selection is expensive. Search is cheap; manual organization is expensive; time and distance hardly matter. The Internet Archive is pursuing projects based on size, including its full sweeps of the Web back to 1996, and its help with the Million Book project that will put a million older books online. *Wendell Holmes wrote “Every library should try to be complete on something, if it were only the history of pinheads.” *’'’The Web is somewhere around 20-30 terabytes of text, with images probably 4 times as much and the “deep web” (databases found behind web pages, often restricted in access) at perhaps 400 times more [2].’’’ *The topics of interest to students are broader than they once were. History in United States colleges once focussed on European and American political and economic history; today universities routinely offer (and students enroll for) courses covering all areas of the world and extending across social history, the history of minority groups within countries, and other special topics. *Santayana’s remark that it doesn’t matter what students read as long as they all read the same thing has vanished from the modern university (although the students have more TV programs in common than earlier generations had books in common). *The modern equivalent of the letters and manuscripts that researchers sought in archives are perhaps the newsgroups postings and emails that one can find on the Web. How badly off is an undergraduate doing all searches on the Web? *it seems that the publications in the standard online search systems are very specialized. By contrast, the Google ranking tends to promote the more general and introductory items to the top of the list; more detail is available in the thousands of items further down. *For most of the queries I tried, an undergraduate looking for some introductory material would be better off with the Google results than the commercial abstracting & indexing services. *Looking for words like “Kurdistan”, “Tibet”, or “Macedonia” often yields pages posted by political organizations with a clear agenda; similarly a search for “Creationism” retrieves many one-sided web pages. A naive reader might need some help sorting out what to believe. But on average, it is better to have a very large collection, than a carefully selected small collection

===IMAGE COLLECTIONS *The “image” features of the Web search engines usually use words they find near an image , hoping that these words are descriptive. *color histogram or texture, which can be computed quickly, but often don’t let the user express what really matters in the search. Algorithms to analyze shape and extract object exist but are still being improved and have not in general been made into effective search engines; see Jitenda Malik’s work [4] for some examples of progress in this area. *’'’but there are many more images than words.’’’ *Large digitized collections offer the possibility of comparative research that has never before been possible. *Gutenberg Bibles …. the HUMI institute at Keio University is preparing digital versions of each copy of Gutenberg’s work; when complete, this will permit anyone to compare the pages from different copies without traveling at all. *The Dunhuang team is scanning the cave paintings in Dunhuang *an enormous amount of travel turns into a few mouse clicks. *“Virtual reality” is valuable for Dunhuang, since it is difficult to study the paintings in the original caves; *Beowulf manuscript in the British Library was damaged by fire in the 18th century; some parts are more readable in computer images made with infrared or UV light. Digital versions of fossils scanned in Texas can be reproduced in larger than life size, to help students view small fossils; and reproductions can be cut in half to let students see the inside. *the wide range of material online, in areas where copyright permits, often exceeds what is available from any individual research library.

===STORAGE *organizational and legal issues. When only one copy of something can supply users all over the world, who should provide it? How do we deal with copyright issues for objects which may have no commercial value but whose copyright holder can not be found without great administrative expense? *“why should our university pay for a resource whose users are outside our community?” *[[JSTOR]] project, which provides libraries with access to back issues of more than 320 scholarly journals, and isnow continuing with [[ARTSTOR]] to provide access to over 200,000 images of artworks. *Wiley’s “Interscience” with more than 300 online journals, Springer “LINK” with 400, and Elsevier’s “Science Direct” service offering 1100. Each of these projects, however, involves carefully selected material (Ian Irvine, a previous head of Elsevier, once said that what wound up on the Web was the stuff rejected by his journals). Perhaps the most important projects in the longer run will be those that do not try to select, but emphasize quantity of material.

===LAW AND ECONOMICS *The most serious difficulty in many digitization projects is obtaining permissions for the conversion or collection of objects under copyright. *the law protects content for a long time; 95 years in the United States for older books and “life of author plus 70 years” in Europe. *The administrative costs of clearing copyright for a large collection may well far exceed actual payments made to the creators or rights holders. *The [[Napster]] controversy has made publishers of sound and video extremely reluctant to allow public access to anything, no matter how long past its time of commercial availability. For many old audio works, it is difficult even to find the rights holders. Traditional expectations of libraries conflict with the commerical hopes of the entertainment industry, and negotiations have been difficult. *If someone proposes helping the Afghan economy by locating a major digital archive there, one reason might be that among the many laws Afghanistan does not have is a copyright law. *In scholarly circles, funding may be less important than credit; academic authors are not accustomed to being paid for scholarly articles, for example, but do need to consider their tenure and promotion cases. *The point is to get credit for the work being done in a world where payment is not likely.

===THE INTERNET ARCHIVE & THE MILLION BOOK PROJECT *Web archive and the Million Book Project. The first is the basis of the Archive; it is the only source for Web pages back to 1996. The Million Book project, originated by Prof. Raj Reddy of Carnegie-Mellon University is an effort to place 1 million books online (books that are out of copyright, or for which permission can be obtained), so that the historic content of libraries will be available to today’s users. Other Internet Archive projects include music, moving images, and children’s books. The Web archive is now over 100 terabytes, and growing 12 terabytes a month; it includes about 2 billion pages. *The Internet Archive gets a Web sweep about every two months, taking as many pages as it can. Authors of pages have the right to refuse to have them archived; *technical issues for the Archive are dynamic pages, having a mirror site, and speed of sweeping. (i) Dynamic pages are those where copying the static content is not good enough; you have to have a Java interpreter, or Flash plug-in, or other softwarecurrent as for the date of the page. At the moment, the Archive does not make an effort to preserve browsers for each date. *The Archive typically gets a new copy of the Web every sixty days; with the average website life at about 100 days, *The Archive also does not find text behind internal data bases on sites, unless the site has provided an alternate route to those pages *However, even for databases that may be kept away from the Archive by accident rather than design, there is no real feeling of obligation to see that web pages are preserved. *many public web pages have been withdrawn from the Archive, especially after [[September 11th]]. *We need to appreciate of the importance of online resources for our future, with some sort of public expectation that this material will be available. In the past we’ve often missed the permanent value of new technologies or new genres. Only about half the movies made before 1950 survive. We have only snatches of early radio or television. The Web is today’s new creative medium; we need to see that it is saved. We know from experience since 1995 that creators won’t save their own pages. We can not rely on human judgment to choose a few pages worth saving: (a) it’s too expensive to examine 2 billion web pages by hand, and (b) we can’t know what will be important in the future. *libraries must be careful about what they buy. Single journals today charge over $16,000 per year; *A study at Berkeley estimated the total amount of information in the world at two exabytes [13], and much is on analog paper, film, or tape. The Internet Archive has made efforts at other new media. *Television Archive, *Music files at the Archive come from *[[The Million Book Project,]] is in the process of scanning one million out-of-copyright books, to add materials of high quality to our online resources. The Web is frequently criticized for the low quality of much of its content; this project would add a million books that once were worth publishing and buying in a library. The scanning is done in India, using equipment and software largely supplied by the US, and with books from both countries. *The scanning is done with Minolta look-down scanners, so that the books are not damaged in the process. Librarians consider this essential if they are to loan books to the project; they don’t welcome suggestions that books could be replaced with reprints. *Traditional microfilming projects in the US a decade ago sometimes cost $100/book, with one third being spent on filming, one third on selection and one third on administrative costs. The effective cost of scanning on the India project, even valuing the equipment and labor costs not perceived by those providing the books, is below $10/book; *The size of the project also explains why it is inclusive rather than selective. If we think of US public domain books as those before 1923, at that time even the largest research libraries would only have had perhaps 2 million books. *We do hope, using the copyright renewal records, to include many books published before 1964, *All the books are run through [[OCR software]] ([[Abbyy Finereader]]) *Again, we need a public understanding of the value of preserving our creative history, and a legal environment to make it possible.

*We blame our predecessors for not saving things they didn’t consider of sufficient value;

===References #HOLMES, OLIVER WENDELL, SR. Poet at the Breakfast Table, VIII. Houghton Mifflin, Boston, 1892 (quoted from the online Project Gutenberg edition). #BERGMAN, MICHAEL K. The Deep Web: Surfacing Hidden Value. White paper available at and see also J. of Electronic Publishing for August 2001. *….