The previous article was dedicated to description of important aspects in crawler programming. In this article we will implement web crawler in F# taking these aspects into account.
First let's implement auxiliary module which will help us to handle links. There are two functions that will be useful for us in future.
1. let prepareUrl url (parentUrl: string) — this function is intended for preparing link URL for transition. As I mentioned in previous article — there are different kinds of URLs (absolute, relative, protocol relative and invalid) — it is convenient to create special active pattern for these types for later use.
First let's implement auxiliary module which will help us to handle links. There are two functions that will be useful for us in future.
1. let prepareUrl url (parentUrl: string) — this function is intended for preparing link URL for transition. As I mentioned in previous article — there are different kinds of URLs (absolute, relative, protocol relative and invalid) — it is convenient to create special active pattern for these types for later use.
let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = if input.StartsWith("http://") || input.StartsWith("https://") then AbsoluteUrl else if input.StartsWith("//") then ProtocolRelativeUrl else if not (String.IsNullOrWhiteSpace input) then RelativeUrl else InvalidUrl
let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/") then Some StartsWithSlash else None let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/") then Some EndsWithSlash else None
let prepareUrl url (parentUrl: string) = match url with | AbsoluteUrl -> Some url | ProtocolRelativeUrl -> let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//")) Some (protocol + url) | RelativeUrl -> let preparedParentUrl = match parentUrl with | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length) | _ -> parentUrl let preparedUrl = match url with | StartsWithSlash -> url | _ -> "/" + url Some (preparedParentUrl + preparedUrl) | InvalidUrl -> None
let getDomainName url: string = let uri = new Uri(url) let host = uri.Host if host.StartsWith("www.") then host.Substring(4) else host
namespace Crawler open System module LinksHandler = let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = if input.StartsWith("http://") || input.StartsWith("https://") then AbsoluteUrl else if input.StartsWith("//") then ProtocolRelativeUrl else if not (String.IsNullOrWhiteSpace input) then RelativeUrl else InvalidUrl let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/") then Some StartsWithSlash else None let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/") then Some EndsWithSlash else None let prepareUrl url (parentUrl: string) = match url with | AbsoluteUrl -> Some url | ProtocolRelativeUrl -> let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//")) Some (protocol + url) | RelativeUrl -> let preparedParentUrl = match parentUrl with | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length) | _ -> parentUrl let preparedUrl = match url with | StartsWithSlash -> url | _ -> "/" + url Some (preparedParentUrl + preparedUrl) | InvalidUrl -> None let getDomainName url: string = let uri = new Uri(url) let host = uri.Host if host.StartsWith("www.") then host.Substring(4) else host
[<Measure>] type ms type ParserOptions = interface abstract SleepTimeInsideDomain: int<ms> with get, set abstract StartUrl : string with get, set abstract CancellationTokenSource : CancellationTokenSource with get, set abstract ReturnPureText: bool with get, set abstract GetStartNode: HtmlDocument -> HtmlNode abstract HandleUrl : url: string * content: string -> unit abstract HandleException : ex: Exception * url: string -> unit abstract FilterUrl: url: string -> bool endThe ParserOptions type is used for crawler behavior customization. Below is the description of available options.
1. SleepTimeInsideDomain — delay in milliseconds between requests to the same domain. Measure type 'ms' is used to specify value in milliseconds.
2. StartUrl — URL we starting crawling from.
3. CancellationTokenSource — it is used for crawling cancellation, we pass it to the crawler through crawler options and when we will decide to cancel crawling we should just call CancellationTokenSource.Cancel() method.
4. ReturnPureText — depending of specifics of our analysis we may need either full HTML code, or extracted plain text. If we set it to true, returned text will be cleaned from HTML envelope.
5. GetStartNode — it is usually used when you crawl a specific site and want to analyze only the specific block of page. For example, you crawl Wikipedia and want to analyze only content without headers and side bars.
6. HandleUrl — payload function that will be called after page is downloaded. For example, you can save text content in database. This function receives page content and URL as parameters.
7. HandleException — exception handler.
8. FilterUrl — you can customize which URLs can or cannot be visited.
We will use F# agents (MailboxProcessor) as the basement of the crawler. It will help us to efficiently and easily manage shared state (we need to keep information about visited links to prevent visiting of the same page multiple times and information about last visit time by each visited domain to correctly organize delays between per-domain requests).
First, as message is received by the agent, link from the message is added to the visited links collection
visitedLinks.Add(preparedUrl) |> ignoreSecond, we should determine delay time, basing on link domain and previous information about domain access time. If we even haven't have the domain registered we set delay to 0, because we still have not visited this domain and register it as visited at current time. Otherwise, if we have already visited this domain, we calculate time span between current time and domain last access time. If it will be greater than SleepTimeInsideDomain option value then the delay will be set to 0 too. However if it will be less, then delay will be calculated as (last domain access time + SleepTimeInsideDomain option value - current time).
let (accessTimeByDomains, sleepTime) = if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0 else let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain] let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain) let now = DateTime.Now let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0 let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain) accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
do! Async.StartChild (async { if sleepTime > 0 then do! Async.Sleep sleepTime printfn "***********************************************************" printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString()) printfn "***********************************************************" let! html = HtmlDocument.AsyncLoad(preparedUrl) let startNode = parserArguments.GetStartNode(html) parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString()) let loadCompleteTime = DateTime.Now let links = html.Descendants ["a"] |> Seq.choose (fun x -> x.TryGetAttribute("href") |> Option.map (fun a -> x.InnerText(), a.Value())) |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link)) links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl)) }) |> Async.IgnoreNote that the MailboxProcessor.Start method receive CancellationToken as a parameter. This helps us to cancel the crawling. On exception ExceptionHandler from the parser options will be called. When the agent finishes processing it returns updated collections with information about visited links and domains visit time. Here is the full source code of this module.
namespace Crawler open FSharp.Data open System open System.Threading module Parser = [<Measure>] type ms type ParserOptions = interface abstract SleepTimeInsideDomain: int<ms> with get, set abstract StartUrl : string with get, set abstract CancellationTokenSource : CancellationTokenSource with get, set abstract ReturnPureText: bool with get, set abstract GetStartNode: HtmlDocument -> HtmlNode abstract HandleUrl : url: string * content: string -> unit abstract HandleException : ex: Exception * url: string -> unit abstract FilterUrl: url: string -> bool end let startCrawling (parserArguments: ParserOptions) = MailboxProcessor.Start((fun self -> let rec handleUrl(visitedLinks: Set<string>, accessTimeByDomains: Map<string, DateTime>) = async { let! (targetUrl, parentUrl) = self.Receive() try let url = LinksHandler.prepareUrl targetUrl parentUrl match url with | Some preparedUrl -> let currentLinkDomain = LinksHandler.getDomainName preparedUrl visitedLinks.Add(preparedUrl) |> ignore let (accessTimeByDomains, sleepTime) = if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0 else let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain] let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain) let now = DateTime.Now let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0 let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain) accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay do! Async.StartChild (async { if sleepTime > 0 then do! Async.Sleep sleepTime printfn "***********************************************************" printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString()) printfn "***********************************************************" let! html = HtmlDocument.AsyncLoad(preparedUrl) let startNode = parserArguments.GetStartNode(html) parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString()) let loadCompleteTime = DateTime.Now let links = html.Descendants ["a"] |> Seq.choose (fun x -> x.TryGetAttribute("href") |> Option.map (fun a -> x.InnerText(), a.Value())) |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link)) links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl)) }) |> Async.Ignore return! handleUrl(visitedLinks.Add(preparedUrl), accessTimeByDomains) | None -> () with | ex -> parserArguments.HandleException (ex, targetUrl) } handleUrl (Set.empty, Map.empty)), parserArguments.CancellationTokenSource.Token).Post (parserArguments.StartUrl, "")
Finally let's look to the parser options example. This is the options for Wikipedia parsing. Note that we will parse only content block, without including headers and sidebars. In this example parsing will be stopped after 100 URLs will have been processed.
open Crawler open System open System.Threading open Parser open TextStore open FSharp.Data type ParserArguments() = member val ProcessedUrlCount = 0 with get, set interface ParserOptions with member val SleepTimeInsideDomain = 3000<ms> with get, set member val StartUrl = "https://en.wikipedia.org" with get, set member val CancellationTokenSource = new CancellationTokenSource() with get, set member val ReturnPureText = true with get, set member this.HandleException(ex: Exception, url: string) = printfn "%s EXCEPTION for url %s" (ex.Message + "---" + ex.StackTrace) url () member this.HandleUrl(url: string, content: string) = if this.ProcessedUrlCount = 100 then (this :> ParserOptions).CancellationTokenSource.Cancel() else this.ProcessedUrlCount <- this.ProcessedUrlCount + 1 UrlHandler.handleUrl(url, content) member this.FilterUrl(url: string) = url.StartsWith("/w") member this.GetStartNode(document: HtmlDocument) = let contentDiv = document.Descendants(fun node -> node.HasId("content") && node.HasName("div") && node.HasAttribute("role", "main")) |> Seq.tryHead match contentDiv with | Some(div) -> div | None -> document.Body() [<EntryPoint>] let main argv = let parserArguments = new ParserArguments() Parser.startCrawling parserArguments System.Console.ReadKey() |> ignore 0
Комментариев нет:
Отправить комментарий