пятница, 6 мая 2016 г.

Implementing web crawler


First of all I want to separate my thoughts about problems the developer of crawler will face from parser implementation details. So this article will be language agnostic and I want to talk about corner stones of web crawling.

From a distance crawler programming seems to be a very simple task. Really, at a glance you have to perform only the following simple steps:

1. download a page from some URL,
2. analyze content (payload).
3. extract links.
4. for each link repeat steps from 1 to 4.

However this simplicity is illusory and I want to tell why it is. There are a lot of nuances and I will try to cover the most important ones.

1. Asynchronous processing.

If you would handle each link sequentially, your parser will be very slow because you will have to wait until page content will be loaded. We should design with asynchrony from scratch – handle other pages when the current page is being loading. Note – it should be asynchronous even if links processing is parallel and executed by multiple different threads – we should not block execution flow by waiting server response. Moreover, if you would implement asynchronous links processing you may even not need to additionally complicate the architecture by parallel processing (depending of your particular case, of course).

2. Delays between requests.

A lot of servers have defense mechanisms from undesirable activities like DDOS, or just visiting by bots is forbidden in many places. Also the requests count per second drastically increases comparing to synchronous version while using asynchronous crawling. Your parser should not heavily load servers you crawl. So we need to introduce some delay between requests to prevent banning.
However, you should keep in mind the fact that page can contain links to the same domain and to the other domains. Obviously we do not need a delay when loading link from the other domain – crawler will work faster. Remember that we crawl pages asynchronously. Imagine that two pages from domain 'a' have been downloaded and handled simultaneously. Each from these pages contains link to domain 'b'. If we just load these two pages from domain 'b' without delay it would be treated as simultaneous requests to the domain 'b' – situation we try to avoid. So we should invent a smart way of tracking requests to different domains.

3. Links extraction and handling.

The task of links extraction and handling is also non-trivial. First of all, we need a good HTML parsing library to extract links. This is the first task. The second task is to prepare links for visiting. Let me describe it in detail.

In general, extracted links may be divided on 4 categories:

a. absolute URLs
b. relative URLs
c. protocol relative URLs
d. invalid URLs

We can simple handle the first category – just load the page by link URL without changing of this URL. To handle second type of links we should consider parent URL (current, while handling page with links), build absolute URL from both relative and parent URLs and then load the page by final URL. For protocol relative URLs (URLs, starting from two slashes, //) we should determine which protocol is currently used and construct absolute URL after that. What about invalid URLs – we will consider URLs as invalid if we cannot refer URL to any of the 3 groups above. For example, it can be an anchor (<a href="#paragraph1">Test link</a>). Of course, we will not visit such URLs.

4. Extracting text to analyze.

Obviously, we crawling to analyze some text in pages. We should find the effective way to separate the text from HTML tags. There are frameworks for many languages/technologies that allow to traverse the HTML tree and extract inner text from each node. For example, HTML agility pack for all .NET languages, or F# Data library for F#. You can see the implementation of this example in my previous topic.

5. URL filtration and cancellation.
You should define which URLs we want or do not want to analyze. Also we should define cancellation criterias or clauses when crawling should be finished.

6. Exception handling.

It is very hard to foresee all corner situations. We should carefully catch them and analyze to improve parser. The simplest way to handle exceptions is to add logging to your crawler. Of course it will be better if exceptions will be handled asynchronously without causing delays to crawler work.

7. User agent.

Some servers can block requests without user agents. It is better to provide user agent that will be used in crawler's requests.

Комментариев нет:

Отправить комментарий