PhD Thesis

The Web is increasingly important for all aspects of our society, culture and economy. Web
archiving is the process of gathering digital materials from the Web, ingesting it, ensuring
that these materials are preserved in an archive, and making the collected materials available for future use and research. Web archiving is a difficult problem due to organizational and technical reasons. We focus on the technical aspects of Web archiving.

In this dissertation, we focus on improving the data acquisition aspect of the Web archiving process. We establish the notion of Website Archivability (WA) and we introduce the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure WA for any website. We propose new algorithms to optimise Web crawling using near-duplicate detection and webgraph cycle detection, resolving also the problem of web spider traps. Following, we suggest that different types of websites demand different Web archiving approaches. We focus on social media and more specifically on weblogs. We introduce weblog archiving as a special type of Web archiving and present our findings and developments in this area: a technical survey of the blogosphere, a scalable approach to harvest modern weblogs and an integrated approach to preserve weblogs using a digital repository system.

Keywords: Web Archiving, Web Crawling, Web Analytics, Webgraphs, Weblogs, Digital Repositories.

Web Crawling, Analysis and Archiving PhD Thesis Full Text available from the National Archive of PhD Theses.