How to Crawl the Web

About the series

Lecturer: Hector Molina-Garcia

A crawler collects large numbers of web pages, to be used for building an index or for data mining. Crawlers consume significant network and computing resources, both at the visited web servers and at the site(s) collecting the pages, and thus it is critical to make them efficient and well behaved. In this talk I will discuss how to build a "good" crawler, addressing questions such as:

How can a crawler gather "important" pages only?

How can a crawler efficiently maintain its collection "fresh"?

How can a crawler be parallelized?

I will also summarize results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time.

View the presentation slides. Hector Garcia-Molina

(http://www-db.stanford.edu/people/hector.html) is the Leonard Bosack and Sandra Lerner Professor in the Departments of Computer Science and Electrical Engineering at Stanford University, Stanford, California. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award, and is a member of the President's Information Technology Advisory Committee (PITAC).

Organization

Directorate for Computer and Information Science and Engineering (CISE)

How to Crawl the Web

About the series

Past events in this series

How to Crawl the Web

Organization