воскресенье, 29 мая 2016 г.

Generating permutations in F# and C#

Despite that article is called «Permutations Generation» I would rather want to emphasize on using functional recursive patterns helping to decompose complex problem on simple ones. Permutation algorithm is just an illustration of how we can implement algorithm using recursion. Here is the example. We have a list [1; 2; 3; 4] and want to get all permutations in lexicographic order. That's what should we get. [
[1; 2; 3; 4]; [1; 2; 4; 3]; [1; 3; 2; 4]; [1; 3; 4; 2]; [1; 4; 2; 3]; [1; 4; 3; 2];
[2; 1; 3; 4]; [2; 1; 4; 3]; [2; 3; 1; 4]; [2; 3; 4; 1]; [2; 4; 1; 3]; [2; 4; 3; 1];
[3; 1; 2; 4]; [3; 1; 4; 2]; [3; 2; 1; 4]; [3; 2; 4; 1]; [3; 4; 1; 2]; [3; 4; 2; 1];
[4; 1; 2; 3]; [4; 1; 3; 2]; [4; 2; 1; 3]; [4; 2; 3; 1]; [4; 3; 1; 2]; [4; 3; 2; 1]]
This data example sorted in lexicographic order gives us a key to solving this task: I intentionally formatted the output data such way that lists inside each row start from the same digit. First, we should divide source list on head and tail, 1 and [2; 3; 4], respectively. After that we can see that first row from output data can be produced by constructing list the following way: current head (1) and all possible permutations from current tail ([2; 3; 4]). So we get 1 and [2; 3; 4], 1 and [2; 4; 3]; 1 and [3; 4; 2] and etc. However, dividing into head and tail is not ordinary as in pattern matching where first element is the head and others are compose the tail. It is just a concept in our case. Implementation will be a little bit tricky. After we handled first row with 1 as head, we should go to the second row – the head in this case will be 2 and the tail will be [1; 3; 4]. In this case we constructed the tail by concatenating elements before the head and after the head – [1] and [3; 4], respectively. That's all theory we should know to start implementing the algorithm.
let generatePermutations source = 
    let rec generatePermutationsFrom prevElements tailElements =
        let rec nextStep source =
                match source with 
                            | [x; y] -> [[x; y]]
                            | x :: xs -> (generatePermutationsFrom [] xs |> List.map (fun perm -> x :: perm))
                            | _ -> []
        
        match tailElements with 
        | [] -> []
        | x :: xs ->
            let fullList = x :: (prevElements @ xs)
            (nextStep fullList) @ (generatePermutationsFrom (prevElements @ [x]) xs)
            
    generatePermutationsFrom [] source
The key aspects from this code are:
1. Constructing list the following way: current head (1) and all possible permutations from current tail ([2; 3; 4]).
let rec nextStep source =
                match source with 
                            | [x; y] -> [[x; y]]
                            | x :: xs -> (generatePermutationsFrom [] xs |> List.map (fun perm -> x :: perm))
                            | _ -> []

2. Selecting the new head and the new tail.
let fullList = x :: (prevElements @ xs)
Also we can use the main idea of this algorithm and easily rewrite it in C#.
static List<List<T>> GetPermutations<T>(List<T> source)
        {
            if (source.Count() < 2) return new List<List<T>> { source };
 
            var prevElements = new List<T>();
            var tailElements = source.Skip(1).ToList();
            var result = new List<List<T>>();
            foreach (var item in source)
            {
                var newList = Enumerable.Concat(prevElements, tailElements).ToList();
                var tailPermutations = GetPermutations(newList);
                foreach (var tailPermutation in tailPermutations)
                    result.Add(Enumerable.Concat(new T[] { item }, tailPermutation).ToList());
                prevElements.Add(item);
                tailElements = tailElements.Skip(1).ToList();
            }
            return result;
        }

вторник, 24 мая 2016 г.

Using Neo4j with F#. Prepare to web pages text analysis.


In my previous blog posts I wrote about crawling and HTML text extraction. It is time to save data we crawled. Also it is good to have the ability to retrieve saved data to perform some analysis later. Let's introduce two types: Page and SymbolGroup. Page type will be used for representation of HTML page. What about SymbolGroup type — for simplicity you can treat it just as word representation. However, for languages of CJK (Chinese/Japanese/Korean) group it will not be correct. Here is the F# code with these types.
module Types
 
open Newtonsoft.Json
 
    [<CLIMutable>]
    type SymbolGroup = { Guid: string; Name: string; TailSeparator: string }
 
    [<CLIMutable>]
    type Page = { Guid: string; Url: string; }
Logically Page and Symbol group entities are linked by «Contains» relation. We will use Neo4j for data store and we will reflect this relation. First let's create code for data saving.
module PageRepository
 
open Types
open Neo4jClient.Cypher
 
    let savePage(page: Page, symbolGroups: seq<SymbolGroup>) = 
        
        let graphClient = ConnectionProvider.getGraphClient()
        let query = graphClient.Cypher
                                .Create("(page:Page {newPage})")
                                .WithParam("newPage", page)
        let query, _ = symbolGroups |> Seq.fold (fun (q: ICypherFluentQuery, i) symbolGroup -> 
                                                       q
                                                        .Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
                                                        .Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
                                                        .WithParam("newSymbolGroup" + i.ToString(), symbolGroup), i + 1
                                        
                            ) (query, 0)
        query.ExecuteWithoutResults()
        ()
As you can see from the code, page and its symbol groups are saved by single Cypher query. Note that for each symbol group we create its own parameter in query:
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
This query consists from three parts:
1. Pages creation:
let query = graphClient.Cypher
                        .Create("(page:Page {newPage})")
                        .WithParam("newPage", page)
2. Symbol groups creation
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
3. Relations creation
.Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
To retrieve pages with its symbol groups I wrote the following code:
let getAllPages() =
   let graphClient = ConnectionProvider.getGraphClient()
   let result = graphClient.Cypher
                           .OptionalMatch("(page:Page)-[Contains]-(symbolGroup:SymbolGroup)")
                           .Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
                           .Results
   result
It is important to explicitly specify ICypherResultItem variable type in
.Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
expression. Result will be presented by tuple of page and collection of its symbol groups. Once we have persistense logic we can store our HTML data like this
namespace TextStore
 
open System
open Types
 
module UrlHandler = 
    
    let handleUrl(url:string, content:string) =
        let symbolGroups = TextUtils.extractWords(content) |> Seq.toList
        let guid = Guid.NewGuid()
        let page = { Guid = guid.ToString(); Url = url; }
        PageRepository.savePage(page, symbolGroups)
        ()
We can see the graph in Neo4j browser. For this example I crawled the Chinese wikipedia to reflect the thought that SymbolGroup type can be used not only for words representation.



четверг, 12 мая 2016 г.

Suffix Arrays in F#

Let's consider the task when you need to find the longest repeating subsequence in some sequence or just analyze if there are some repeating sequences.
One of the ways to solve it is to use suffix array. Suffix array is the array which contains all sorted subsequences (suffixes) of the given sequence. In this article I want to describe key points about suffix arrays and implement most important aspects in F#.
Below you can see the unit test which illustrates what suffix array is.
[<Test>]
member this.CreateSuffixArrayTest() = 
    let source = seq [ 'A''F''U''B''D''B''D' ]
    let result = SuffixArray.create source
    Assert.AreEqual(result.Length, 7)
    Assert.AreEqual(result.[0], [| 'A''F''U''B''D''B''D' |])
    Assert.AreEqual(result.[1], [| 'B''D' |])
    Assert.AreEqual(result.[2], [| 'B''D''B''D' |])
    Assert.AreEqual(result.[3], [| 'D' |])
    Assert.AreEqual(result.[4], [| 'D''B''D' |])
    Assert.AreEqual(result.[5], [| 'F''U''B''D''B''D' |])
    Assert.AreEqual(result.[6], [| 'U''B''D''B''D' |])

It is simple data structure and implementation is quite simple too, especially in F#.
First we need to create array of subsequences (suffixes) — we can do it by just one line

let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
Then we need to sort subsequences — we use Array.sortWith function with custom comparer. For the point of performance it is written in imperative way.
namespace Analytics
 
open System
 
module SuffixArray =
 
    let create (data: seq<'a>) = 
        let maxSuffix = data |> Seq.toArray
        let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
                    |> Array.sortWith (fun (x: 'a[]) (y: 'a[]) ->
                                            let minLength = Math.Min(x.Length, y.Length)
                                            let mutable index = 0
                                            while (index < minLength - 1) && (x.[index] = y.[index]) do index <- index + 1
                                            if index < minLength then
                                                if x.[index] > y.[index] then 1 else -1
                                            else
                                                if x.Length > y.Length then 1 else -1
                                            )
        result
After suffix array has been built we can start to analyze it.
Now we will find all repeating suffixes. Here is the unit test demonstrating how it should work.
[<Test>]
member this.FindRepeatingSuffixesTest() = 
    let source = seq [ 'A''F''U''B''D''B''D''A''F''U' ]
    let suffixArray = SuffixArray.create source
    let result = SuffixArray.findRepeatingSuffixes suffixArray |> Seq.toArray
    Assert.AreEqual(result.Length, 3)
    Assert.AreEqual(['A''F''U'], result.[0])
    Assert.AreEqual(['B''D'], result.[1])
    Assert.AreEqual(['F''U'], result.[2])
To find repeating suffixes we should just consider adjacent suffixes and take their largest common prefix (of course if it presents). Here is the code.
let findRepeatingSuffixes (data: 'a[][]) = seq {
    for i = 1 to data.Length - 1 do
        let largestCommonPrefix = findLargestCommonPrefix data.[i - 1] data.[i]
        if largestCommonPrefix.Length > 1 then yield largestCommonPrefix
    }
Here is the unit test showing how the findLargestCommonPrefix function should work.
[<Test>]
member this.FindLargestCommonPrefixTest() = 
    let first = [| 'A''F''U''B''D''B''D' |]
    let second = [| 'A''F''Z''B''D''B''D' |]
    let result = SuffixArray.findLargestCommonPrefix first second
    Assert.AreEqual(['A''F'], result)
And here is the implementation. At first we determine the smallest suffix. Apparently, the common suffix cannot be longer then the smallest suffix. Second – we just compare two suffixes element by element from the beginning until they have equal elements. After we turned on the end of the smallest suffix or there are no equal elements we should stop search.
let findLargestCommonPrefix (first: 'a[]) (second: 'a[]) =
    let minLength = Math.Min(first.Length, second.Length) 
    let mutable index = 0
    while (index < minLength) && (first.[index] = second.[index]) do index <- index + 1
    first.[0 .. Math.Max(0, index - 1)]
As you can see the analysis is quite simple and you can easily expand/modify it for your needs. Also because of F# type inference system all algorithms described above are generic. You can use it not only with strings (sequences of chars) but also with sequences of any other types.

Finally I would tell some notes about .NET environment configuration. As you may suspect the suffix array consumes a lot of memory. We should open project properties and UNCHECK “Prefer 32-bit” option on the build tab. Otherwise you could not create array above 4GB.

вторник, 10 мая 2016 г.

Implementing web crawler. Part 2. Using F#.

The previous article was dedicated to description of important aspects in crawler programming. In this article we will implement web crawler in F# taking these aspects into account.

First let's implement auxiliary module which will help us to handle links. There are two functions that will be useful for us in future.
1. let prepareUrl url (parentUrl: string)  — this function is intended for preparing link URL for transition. As I mentioned in previous article — there are different kinds of URLs (absolute, relative, protocol relative and invalid) — it is convenient to create special active pattern for these types for later use.
let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://"then AbsoluteUrl
            else if input.StartsWith("//"then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl
Also to simplify code let's introduce active patterns for slash detection.
let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/"then Some StartsWithSlash else None
let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/"then Some EndsWithSlash else None 
Below is the full source code of this function. If URL is not invalid it will be converted to absolute.
let prepareUrl url (parentUrl: string) =
    match url with
    | AbsoluteUrl -> Some url
    | ProtocolRelativeUrl -> 
        let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
        Some (protocol + url)
    | RelativeUrl -> 
        let preparedParentUrl = 
            match parentUrl with
            | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
            | _ -> parentUrl
        let preparedUrl = 
            match url with
            | StartsWithSlash -> url
            | _ -> "/" + url
        Some (preparedParentUrl + preparedUrl)
    | InvalidUrl -> None
2. let getDomainName url: string — this function is used for domain name extraction from link.
let getDomainName url: string = 
    let uri = new Uri(url)
    let host = uri.Host
    if host.StartsWith("www."then host.Substring(4)
    else host
Here is the full source code of this module.
namespace Crawler
 
open System
 
module LinksHandler =
    let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://"then AbsoluteUrl
            else if input.StartsWith("//"then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl
 
    let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/"then Some StartsWithSlash else None
    let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/"then Some EndsWithSlash else None
 
    let prepareUrl url (parentUrl: string) =
        match url with
        | AbsoluteUrl -> Some url
        | ProtocolRelativeUrl -> 
            let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
            Some (protocol + url)
        | RelativeUrl -> 
            let preparedParentUrl = 
                match parentUrl with
                | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
                | _ -> parentUrl
            let preparedUrl = 
                match url with
                | StartsWithSlash -> url
                | _ -> "/" + url
            Some (preparedParentUrl + preparedUrl)
        | InvalidUrl -> None
 
    let getDomainName url: string = 
        let uri = new Uri(url)
        let host = uri.Host
        if host.StartsWith("www."then host.Substring(4)
        else host
The next module contains crawling logic. It contains some additional types.
    [<Measure>] type ms
 
type ParserOptions = interface
    abstract SleepTimeInsideDomain: int<mswith get, set
    abstract StartUrl : string with get, set
    abstract CancellationTokenSource : CancellationTokenSource with get, set
    abstract ReturnPureText: bool with get, set
    abstract GetStartNode: HtmlDocument -> HtmlNode
    abstract HandleUrl : url: string * content: string -> unit
    abstract HandleException : ex: Exception * url: string -> unit
    abstract FilterUrl: url: string -> bool
end
The ParserOptions type is used for crawler behavior customization. Below is the description of available options.
1. SleepTimeInsideDomain — delay in milliseconds between requests to the same domain. Measure type 'ms' is used to specify value in milliseconds.
2. StartUrl — URL we starting crawling from.
3. CancellationTokenSource — it is used for crawling cancellation, we pass it to the crawler through crawler options and when we will decide to cancel crawling we should just call CancellationTokenSource.Cancel() method.
4. ReturnPureText — depending of specifics of our analysis we may need either full HTML code, or extracted plain text. If we set it to true, returned text will be cleaned from HTML envelope.
5. GetStartNode — it is usually used when you crawl a specific site and want to analyze only the specific block of page. For example, you crawl Wikipedia and want to analyze only content without headers and side bars.
6. HandleUrl — payload function that will be called after page is downloaded. For example, you can save text content in database. This function receives page content and URL as parameters.
7. HandleException — exception handler.
8. FilterUrl — you can customize which URLs can or cannot be visited.

We will use F# agents (MailboxProcessor) as the basement of the crawler. It will help us to efficiently and easily manage shared state (we need to keep information about visited links to prevent visiting of the same page multiple times and information about last visit time by each visited domain to correctly organize delays between per-domain requests).
First, as message is received by the agent, link from the message is added to the visited links collection
visitedLinks.Add(preparedUrl) |> ignore
Second, we should determine delay time, basing on link domain and previous information about domain access time. If we even haven't have the domain registered we set delay to 0, because we still have not visited this domain and register it as visited at current time. Otherwise, if we have already visited this domain, we calculate time span between current time and domain last access time. If it will be greater than SleepTimeInsideDomain option value then the delay will be set to 0 too. However if it will be less, then delay will be calculated as (last domain access time + SleepTimeInsideDomain option value - current time).
let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
After delay calculation we should handle this URL asynchronously. While this URL is being processed the agent will be able to handle other URLs (or receive other messages.) We use F# Data library for HTML parsing and links extracting. As soon as links were extracted we send to the message queue request to processing for each not visited link. Only links satisfying the filter function (FilterUrl) from the parser options will be handled.
do! Async.StartChild
        (async {
            if sleepTime > 0 then do! Async.Sleep sleepTime
 
            printfn "***********************************************************"
            printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
            printfn "***********************************************************"
            
            let! html = HtmlDocument.AsyncLoad(preparedUrl)
            let startNode = parserArguments.GetStartNode(html)
            parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
            let loadCompleteTime = DateTime.Now
            let links = html.Descendants ["a"]
                        |> Seq.choose (fun x ->
                        x.TryGetAttribute("href")
                        |> Option.map (fun a -> x.InnerText(), a.Value()))
                        |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
            links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
        }) |> Async.Ignore
Note that the MailboxProcessor.Start method receive CancellationToken as a parameter. This helps us to cancel the crawling. On exception ExceptionHandler from the parser options will be called. When the agent finishes processing it returns updated collections with information about visited links and domains visit time. Here is the full source code of this module.
namespace Crawler
 
open FSharp.Data
open System
open System.Threading
 
module Parser =
 
    [<Measure>] type ms
 
    type ParserOptions = interface
        abstract SleepTimeInsideDomain: int<mswith get, set
        abstract StartUrl : string with get, set
        abstract CancellationTokenSource : CancellationTokenSource with get, set
        abstract ReturnPureText: bool with get, set
        abstract GetStartNode: HtmlDocument -> HtmlNode
        abstract HandleUrl : url: string * content: string -> unit
        abstract HandleException : ex: Exception * url: string -> unit
        abstract FilterUrl: url: string -> bool
    end
 
    let startCrawling (parserArguments: ParserOptions) =
        MailboxProcessor.Start((fun self -> 
            let rec handleUrl(visitedLinks: Set<string>, accessTimeByDomains: Map<stringDateTime>) =
                async {
                    let! (targetUrl, parentUrl) = self.Receive()
                    try
                        let url = LinksHandler.prepareUrl targetUrl parentUrl
                        
                        match url with
                        | Some preparedUrl ->
                            let currentLinkDomain = LinksHandler.getDomainName preparedUrl
                            
                            visitedLinks.Add(preparedUrl) |> ignore
 
                            let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
 
                            do! Async.StartChild
                                    (async {
                                        if sleepTime > 0 then do! Async.Sleep sleepTime
 
                                        printfn "***********************************************************"
                                        printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
                                        printfn "***********************************************************"
                                        
                                        let! html = HtmlDocument.AsyncLoad(preparedUrl)
                                        let startNode = parserArguments.GetStartNode(html)
                                        parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
                                        let loadCompleteTime = DateTime.Now
                                        let links = html.Descendants ["a"]
                                                    |> Seq.choose (fun x ->
                                                    x.TryGetAttribute("href")
                                                    |> Option.map (fun a -> x.InnerText(), a.Value()))
                                                    |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
                                        links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
                                    }) |> Async.Ignore
                            return! handleUrl(visitedLinks.Add(preparedUrl), accessTimeByDomains)
                        | None -> ()
                    with
                        | ex -> parserArguments.HandleException (ex, targetUrl)
                }
            handleUrl (Set.empty, Map.empty)), parserArguments.CancellationTokenSource.Token).Post (parserArguments.StartUrl, "")
 
Finally let's look to the parser options example. This is the options for Wikipedia parsing. Note that we will parse only content block, without including headers and sidebars. In this example parsing will be stopped after 100 URLs will have been processed.

open Crawler
open System
open System.Threading
open Parser
open TextStore
open FSharp.Data
 
type ParserArguments() = 
    member val ProcessedUrlCount = 0 with get, set
    interface ParserOptions with
            member val SleepTimeInsideDomain = 3000<mswith get, set
            member val StartUrl = "https://en.wikipedia.org" with get, set
            member val CancellationTokenSource = new CancellationTokenSource() with get, set
            member val ReturnPureText = true with get, set
            member this.HandleException(ex: Exception, url: string) =
                printfn "%s EXCEPTION for url %s" (ex.Message + "---" + ex.StackTrace) url
                ()
            member this.HandleUrl(url: string, content: string) = 
                if this.ProcessedUrlCount = 100 then (this :> ParserOptions).CancellationTokenSource.Cancel()
                else
                    this.ProcessedUrlCount <- this.ProcessedUrlCount + 1
                    UrlHandler.handleUrl(url, content)
            member this.FilterUrl(url: string) = 
                url.StartsWith("/w")
            member this.GetStartNode(document: HtmlDocument) =
                let contentDiv = document.Descendants(fun node -> node.HasId("content") && node.HasName("div") && node.HasAttribute("role""main")) |> Seq.tryHead
                match contentDiv with
                | Some(div) -> div
                | None -> document.Body()
 
[<EntryPoint>]
let main argv = 
    let parserArguments = new ParserArguments()
    Parser.startCrawling parserArguments
    System.Console.ReadKey() |> ignore
    0 

пятница, 6 мая 2016 г.

Implementing web crawler


First of all I want to separate my thoughts about problems the developer of crawler will face from parser implementation details. So this article will be language agnostic and I want to talk about corner stones of web crawling.

From a distance crawler programming seems to be a very simple task. Really, at a glance you have to perform only the following simple steps:

1. download a page from some URL,
2. analyze content (payload).
3. extract links.
4. for each link repeat steps from 1 to 4.

However this simplicity is illusory and I want to tell why it is. There are a lot of nuances and I will try to cover the most important ones.

1. Asynchronous processing.

If you would handle each link sequentially, your parser will be very slow because you will have to wait until page content will be loaded. We should design with asynchrony from scratch – handle other pages when the current page is being loading. Note – it should be asynchronous even if links processing is parallel and executed by multiple different threads – we should not block execution flow by waiting server response. Moreover, if you would implement asynchronous links processing you may even not need to additionally complicate the architecture by parallel processing (depending of your particular case, of course).

2. Delays between requests.

A lot of servers have defense mechanisms from undesirable activities like DDOS, or just visiting by bots is forbidden in many places. Also the requests count per second drastically increases comparing to synchronous version while using asynchronous crawling. Your parser should not heavily load servers you crawl. So we need to introduce some delay between requests to prevent banning.
However, you should keep in mind the fact that page can contain links to the same domain and to the other domains. Obviously we do not need a delay when loading link from the other domain – crawler will work faster. Remember that we crawl pages asynchronously. Imagine that two pages from domain 'a' have been downloaded and handled simultaneously. Each from these pages contains link to domain 'b'. If we just load these two pages from domain 'b' without delay it would be treated as simultaneous requests to the domain 'b' – situation we try to avoid. So we should invent a smart way of tracking requests to different domains.

3. Links extraction and handling.

The task of links extraction and handling is also non-trivial. First of all, we need a good HTML parsing library to extract links. This is the first task. The second task is to prepare links for visiting. Let me describe it in detail.

In general, extracted links may be divided on 4 categories:

a. absolute URLs
b. relative URLs
c. protocol relative URLs
d. invalid URLs

We can simple handle the first category – just load the page by link URL without changing of this URL. To handle second type of links we should consider parent URL (current, while handling page with links), build absolute URL from both relative and parent URLs and then load the page by final URL. For protocol relative URLs (URLs, starting from two slashes, //) we should determine which protocol is currently used and construct absolute URL after that. What about invalid URLs – we will consider URLs as invalid if we cannot refer URL to any of the 3 groups above. For example, it can be an anchor (<a href="#paragraph1">Test link</a>). Of course, we will not visit such URLs.

4. Extracting text to analyze.

Obviously, we crawling to analyze some text in pages. We should find the effective way to separate the text from HTML tags. There are frameworks for many languages/technologies that allow to traverse the HTML tree and extract inner text from each node. For example, HTML agility pack for all .NET languages, or F# Data library for F#. You can see the implementation of this example in my previous topic.

5. URL filtration and cancellation.
You should define which URLs we want or do not want to analyze. Also we should define cancellation criterias or clauses when crawling should be finished.

6. Exception handling.

It is very hard to foresee all corner situations. We should carefully catch them and analyze to improve parser. The simplest way to handle exceptions is to add logging to your crawler. Of course it will be better if exceptions will be handled asynchronously without causing delays to crawler work.

7. User agent.

Some servers can block requests without user agents. It is better to provide user agent that will be used in crawler's requests.