суббота, 23 июля 2016 г.

CQRS and F# SQLProvider

There are situations when a project has very complicated business logic and requires to display a lot of information to users. If it is not properly implemented and structured then some serious problems can appear. Usually people focus on domain and it is reflected on data storage organization which is usually not intended for presentation purposes. In practice the attempt to build presentation layer with domain focused storage organization can give some upsetting consequences:
1. dozens of joins in data queries when one such query occupies several tens lines of code – huge performance impact,
2. difficulties to modify such code – even people who wrote this code should spend time to recall how it works,
3. possibility to easy introduce bugs while modifying,
4. troubles with quick entry of new team members to existing projects due to excess code complication.
Also there can be attempts to reuse such queries for both domain logic and view representation – in many cases it will fall. I want to talk rather about how to avoid this kind of problems. The way is already known and it is called Command Query Responsibility Segregation (further — CQRS). I will not give attention to the theoretical side too much – there is a lot of information about this pattern in Internet. I want to tell about my practical experience and about how easy this pattern is implemented using the F# SQLProvider. Briefly, according to this pattern, all logic can be divided into two parts – commands and queries. In practice commands are related to domain actions and queries are related to presentation. It can be convenient to divide not only code by purposes, but data stores too: one store for domain and one store for presentation. However, it does not necessary. You can use your DBMS features such as views, indexed views to get the same result. In some cases you may even not think about data storage — the gain will be just from separation of domain logic from view logic – your code will become more clear — that all depends from your particular case. Of course, if you will maintain two storages, you should make efforts to synchronize data between domain and presentation storage. You will also have to keep synchronization aspect in mind when you should perform some support tasks or make some changes in database manually. However, from my practice such things are nothing comparing to harmful consequences described above. And now let’s consider the practical aspect. Actually, after domain logic separation from presentation logic, we consider our application as consisting from two parts which makes our initial task easier, because we can handle each of these parts almost independently and each of these parts has limited scope of responsibility. That means that every part will be simpler from the architecture point. I found very convenient to use F# SQLProvider especially for query side. In ORM’s like NHibernate you have to perform mappings between tables and classes and keep tracking to reflect changes in DB to mappings. F# SQL Provider works the different way. It always gives to you actual database schema representation which is checked on compile time. For example, if you change field name in some table and there are any queries using this field in project — the project will not compile. Also it should be mentioned that development speed significantly increases after using F# SQL Provider because you do not need to perform mappings and you always get actual DB representation through autocomplete!

You can contradict that powerful ORM’s allow to perform complex mappings to objects. But in our case the idea is to reduce the complexity: the presentation data storage should be prepared such way that interaction with it should be fast and easy. So in this case mappings functionality will be just redundant! Also better problem decomposition does matter too. We just develop the query side and think about the query side only. To emphasize this separation I even move query side in separate project. We do not touch the domain logic while work on presentation. Also this is convenient when you are still afraid use F# for entire project so you can just implement query side using it. Here is the code examples. It is pretty simple, but it should be simple – this is the main idea of my post. To organize the query side we need just describe the view models like this.
[<CLIMutable>]
type ReportModel =
    {
        ReportId : int
        TaskId : int
        Header : string
        Content : string
        AuthorName : string
        Date: DateTime
    }
and create queries like this
open FSharp.Data.Sql
open DataProvider
open System
 
type ReportPresentationService() =
     member this.GetNewestReports(from: int, reportsCount: int) =
        let db = DataProvider.getDataContext()
        let targetQuery = 
            query { for report in db.Public.VReports do 
                    skip from
                    take reportsCount
                    select { ReportId = report.ReportId; Header = report.Header; Content = report.Description; 
                             Date = report.CreationDate; AuthorName = report.Login; TaskId = report.TaskId }}
        let result = targetQuery |> Seq.head
        result
In my example there is a view – v_reports from which I query the data. For my particular case it is enough. Note that the ReportModel type has CLIMutable attribute. That is because I use it in ASP.NET MVC and default ASP.NET MVC model binder requires models to have mutable properties. That’s all – the presentation layer came out very thin and simple. It contains only logic that responsible for presentation. It is easy to read and to make changes.
 

понедельник, 27 июня 2016 г.

Apriori algorithm implementation in F#

Because I now have a blog I want to share one of my old code pieces – Apriori algorithm implementation. Apriori algorithm is probably the most famous algorithm for frequent itemsets search and association rules implication. Let's look on definitions that will be used in algorithm description and implementation.
1. Support for an itemset is calculated as a ratio of transaction count containing this itemset to total transaction count.
2. Confidence for an association rule X → Y is calculated as ratio of transaction count containing X and Y to transaction count containing X.
3. Frequent itemset is the itemset with support equal or more defined support threshold.
4. Anti-monotony property – if itemset X is not frequent itemset that addition of new item xn to this itemset will not turn this itemset into frequent. This rule will be used to reduce the set of all possible association rules.
I'll briefly explain the algorithm by example. There are transactions: [3; 1; 4; 6;]; [1; 2; 4; 6]; [1; 3; 4; 9]; [2; 5; 4; 9]; [6; 7; 8; 9]. Let support will be 40%. First we should find all one-item itemsets satisfying the given support. Here is the result: [[1], [2], [3], [4], [6], [9]]. After that we will generate all possible two-item itemsets, prune them using the support value and get the following result. T [[1;6]; [1;4]; [1;3]; [2;4]; [3;4]; [4;9]; [4;6]]. Three- and more itemsets will be generated by the linking of the set above with itself. For example, if we link [1;6] and [1; 4] the result will be [1; 6; 4]. After pruning only two itemsets left: [1;4;6] and [1;3;4]. As these itemsets cannot be linked because they don't have k – 1 common element. (where k is the itemset length, so to be linked these itemsets should have 2 common elements, but they have only 1) we finish the candidates generation process. Now we have frequent itemsets and should infer itemset rules. We need to generate subsets of all sizes for remained itemsets. After filtering by support and confidence values we get the following results.
Reason=[1; 4] Consequence=[6] Support=0.6 Confidence=0.6666666667
Reason=[1; 6] Consequence=[4] Support=0.4 Confidence=1.0
Reason=[4; 6] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[1] Consequence=[4; 6] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[6] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[1; 4] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[1] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[1; 3] Consequence=[4] Support=0.4 Confidence=1.0
Reason=[1; 4] Consequence=[3] Support=0.6 Confidence=0.6666666667
Reason=[3; 4] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[1] Consequence=[3; 4] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[3] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[3] Consequence=[1; 4] Support=0.4 Confidence=1.0
Reason=[3] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[3] Consequence=[4] Support=0.4 Confidence=1.0

Below is the code of Apriori algorithm module with descriptions.
module AprioriModule
 
open System.Collections.Generic
open System
Let's define the type for Apriori rule the following way:
type Rule<'T> = { Reason: 'T list; Consequence: 'T list; Support: double; Confidence: double }
The function for calculation of itemset support absolute value is defined the following way: we increment support count for itemset if currently visited transaction contains every itemset item.
let GetItemsetSupportCount itemset transactions =
    Seq.fold (fun sum transaction -> if List.forall (fun item -> Set.contains item transaction) itemset then sum + 1 else sum) 0 transactions
After we know how to calculate support we can separate frequent itemsets from not frequent ones. Also we fill data about itemsets support (supportData : Dictionary<_ int="">).
let ExtractFrequentItemsets minSupport itemsets transactions (supportData : Dictionary<_, int>) =
    let transactionsCount = Seq.length transactions
    let minSupportRequiredCount = int(Math.Ceiling((float transactionsCount) * minSupport))
    itemsets |> List.filter (fun x -> 
        let supportCount = GetItemsetSupportCount x transactions
        if minSupportRequiredCount <= supportCount then
            supportData.Add(x, supportCount)
            true
        else false)
As mentioned above, new candidates are generated by linking of the set with itself. Here is the unit test to illustrate what should we get.
    [<Test>]
member this.MergeListsTest() = 
    let list1 = [1; 5; 6; 9]
    let list2 = [1; 5; 6; 12]
    let mergeResult = MergeLists list1 list2
    Assert.AreEqual([1; 5; 6; 9; 12], mergeResult)
 
    let list3 = [1; 8]
    let list4 = [8; 9]
    let mergeResult2 = MergeLists list3 list4
    Assert.AreEqual([], mergeResult2)
This specific kind of merge is implemented by the following way.
let MergeLists list1 list2 = 
    let rec MergeListsInner list1 list2 mergeResult =
        match list1, list2 with
        |   [],_ -> []
        |   _,[] -> []
        |   head1 :: tail1, head2 :: tail2 when head1 <> head2 -> []
        |   head1 :: [_], head2 :: tail2 -> mergeResult@list1@tail2 |> List.sort
        |   head1 :: tail1, head2 :: tail2 -> MergeListsInner tail1 tail2 ([head1]@mergeResult)
    let result = MergeListsInner list1 list2 []
    result
The one exclusion of this case is when we will generate all possible two-item itemsets. The GenerateCombinations function reflects this.
let rec GenerateCombinations item list combinations =
    match item, list with
    |   _ , [] -> List.filter (not << List.isEmpty) combinations
    |   [h], [x] :: xs -> GenerateCombinations item xs ([[h; x]] @ combinations)
    |   _ , x :: xs -> GenerateCombinations item xs (([MergeLists item x ]) @ combinations)
So again: for generating combinations of one-item itemsets we just linking them together;
|   [h], [x] :: xs -> GenerateCombinations item xs ([[h; x]] @ combinations)
for n-item (n > 1) itemsets we use MergeLists function
|   _ , x :: xs -> GenerateCombinations item xs (([MergeLists item x ]) @ combinations)
Here is the unit test for combinations generation functionality.
    [<Test>]
member this.GenerateCombinationsTest() =
    let tail1 = [[1; 3]; [1; 4]; [1; 5]]
    let generateResult1 = GenerateCombinations [1; 2] tail1 []
    Assert.AreEqual([ [1; 2; 5]; [1; 2; 4]; [1; 2; 3] ], generateResult1)
 
    let tail2 = [[2]; [3]]
    let generateResult2 = GenerateCombinations [1] tail2 []
    Assert.AreEqual([ [1; 3]; [1; 2;]], generateResult2)
Finally we can implement function for frequent itemset search. Unit test first.
[<Test>]
    member this.GetFrequentItemsetsTest() =
        let supportData = new Dictionary<_, _>(HashIdentity.Structural)
        let items = [[1]; [2]; [3]; [4]; [5]; [6]; [7]; [8]; [9]]
        let transactions = [Set.ofList[3; 1; 4; 6;]; Set.ofList[1; 2; 4; 6]; Set.ofList[1; 3; 4; 9]; Set.ofList[2; 5; 4; 9]; Set.ofList[6; 7; 8; 9]]
        let frequentItems = GetFrequentItemsets items transactions 0.4 supportData
        Assert.AreEqual([[1; 4; 6]; [1; 3; 4]], frequentItems);
And here is the implementation.
let GetFrequentItemsets candidates transactions minSupport supportData = 
    let rec GetFrequentItemsetsInner candidates transactions lastStepFrequentItemsets minSupport supportData = 
        let rec GenerateCandidates itemsets candidates =
            match itemsets with 
            |   [] -> candidates
            |   x :: xs -> GenerateCandidates xs (candidates @ (GenerateCombinations x xs []))
        let currentStepFrequentItemsets = (ExtractFrequentItemsets minSupport candidates transactions supportData) 
        match currentStepFrequentItemsets with 
        |   [] -> lastStepFrequentItemsets
        |   _  -> GetFrequentItemsetsInner (GenerateCandidates candidates []) transactions currentStepFrequentItemsets minSupport supportData
    GetFrequentItemsetsInner candidates transactions [] minSupport supportData
After we know frequent itemsets we can generate rules. There is auxiliary function for generation of combinations of all sizes.
let rec GenerateRulesCombinationsOfAllSizes acc size set = 
    let rec GenerateRulesCombinationsOfSize acc size set = seq {
      match size, set with 
      | n, x::xs -> 
          if n > 0 then yield! GenerateRulesCombinationsOfSize (x::acc) (n - 1) xs
          if n >= 0 then yield! GenerateRulesCombinationsOfSize acc n xs 
      | 0, [] -> yield List.sort acc 
      | _, [] -> () }
    seq { for i = size downto 1 do yield! GenerateRulesCombinationsOfSize [] i set }
To generate rules we should generate item combinations of all sizes for the given itemset. Each combination will be treated as reason and difference between the whole itemset and this combination will be treated as consequence.
let GetRules itemsets supportData minSupport minConfidence transactionsCount = 
    let GetItemsetRules itemset (supportData : Dictionary<_, int>) minSupport minConfidence allItemsetsLength= seq {
        let set = Set.ofList itemset
        let itemsetLength = set |> Set.count
        let combinations = GenerateRulesCombinationsOfAllSizes [] (itemsetLength - 1) itemset
        for reasonCombination in combinations do
            let rest = Set.difference set (Set.ofList reasonCombination)
            let consequenceCombinations = GenerateRulesCombinationsOfAllSizes [] (Set.count rest) (Set.toList rest)
            for consequenceCombinaion in consequenceCombinations do
                let rule = { Reason = reasonCombination; Consequence = consequenceCombinaion; 
                    Support = (double supportData.[reasonCombination] / double allItemsetsLength); 
                    Confidence =  (double supportData.[itemset] / double supportData.[reasonCombination])}
                if rule.Confidence >= minConfidence && rule.Support >= minSupport then yield rule
            }
    seq {
        for itemset in itemsets do yield! GetItemsetRules itemset supportData minSupport minConfidence transactionsCount }
That's all. Now we can try it by running the following way.
module AprioriProgram
 
open AprioriModule
open System.Collections.Generic
 
[<EntryPoint>]
let main argv = 
    let items = [[1]; [2]; [3]; [4]; [5]; [6]; [7]; [8]; [9]]
    let transactions = seq [ Set.ofList[3; 1; 4; 6;]; Set.ofList[1; 2; 4; 6]; 
        Set.ofList[1; 3; 4; 9]; Set.ofList[2; 5; 4; 9]; Set.ofList[6; 7; 8; 9] ]
    let supportData = new Dictionary<_, _>(HashIdentity.Structural)
    let minSupport = 0.4
    let frequentItems = GetFrequentItemsets items transactions minSupport supportData
    let rules = GetRules frequentItems supportData minSupport 0.6 (Seq.length transactions)
    
    System.Console.ForegroundColor <- System.ConsoleColor.Green
    
    frequentItems |> List.iter (fun x -> printf "%A " x)
    printfn ""
    printfn "-------------"
    rules |> Seq.iter (fun rule -> printfn "Reason=%A Consequence=%A Support=%A Confidence=%A" rule.Reason rule.Consequence rule.Support rule.Confidence)
    System.Console.ReadKey() |> ignore
    printfn "%A" argv
    0 

воскресенье, 29 мая 2016 г.

Generating permutations in F# and C#

Despite that article is called «Permutations Generation» I would rather want to emphasize on using functional recursive patterns helping to decompose complex problem on simple ones. Permutation algorithm is just an illustration of how we can implement algorithm using recursion. Here is the example. We have a list [1; 2; 3; 4] and want to get all permutations in lexicographic order. That's what should we get. [
[1; 2; 3; 4]; [1; 2; 4; 3]; [1; 3; 2; 4]; [1; 3; 4; 2]; [1; 4; 2; 3]; [1; 4; 3; 2];
[2; 1; 3; 4]; [2; 1; 4; 3]; [2; 3; 1; 4]; [2; 3; 4; 1]; [2; 4; 1; 3]; [2; 4; 3; 1];
[3; 1; 2; 4]; [3; 1; 4; 2]; [3; 2; 1; 4]; [3; 2; 4; 1]; [3; 4; 1; 2]; [3; 4; 2; 1];
[4; 1; 2; 3]; [4; 1; 3; 2]; [4; 2; 1; 3]; [4; 2; 3; 1]; [4; 3; 1; 2]; [4; 3; 2; 1]]
This data example sorted in lexicographic order gives us a key to solving this task: I intentionally formatted the output data such way that lists inside each row start from the same digit. First, we should divide source list on head and tail, 1 and [2; 3; 4], respectively. After that we can see that first row from output data can be produced by constructing list the following way: current head (1) and all possible permutations from current tail ([2; 3; 4]). So we get 1 and [2; 3; 4], 1 and [2; 4; 3]; 1 and [3; 4; 2] and etc. However, dividing into head and tail is not ordinary as in pattern matching where first element is the head and others are compose the tail. It is just a concept in our case. Implementation will be a little bit tricky. After we handled first row with 1 as head, we should go to the second row – the head in this case will be 2 and the tail will be [1; 3; 4]. In this case we constructed the tail by concatenating elements before the head and after the head – [1] and [3; 4], respectively. That's all theory we should know to start implementing the algorithm.
let generatePermutations source = 
    let rec generatePermutationsFrom prevElements tailElements =
        let rec nextStep source =
                match source with 
                            | [x; y] -> [[x; y]]
                            | x :: xs -> (generatePermutationsFrom [] xs |> List.map (fun perm -> x :: perm))
                            | _ -> []
        
        match tailElements with 
        | [] -> []
        | x :: xs ->
            let fullList = x :: (prevElements @ xs)
            (nextStep fullList) @ (generatePermutationsFrom (prevElements @ [x]) xs)
            
    generatePermutationsFrom [] source
The key aspects from this code are:
1. Constructing list the following way: current head (1) and all possible permutations from current tail ([2; 3; 4]).
let rec nextStep source =
                match source with 
                            | [x; y] -> [[x; y]]
                            | x :: xs -> (generatePermutationsFrom [] xs |> List.map (fun perm -> x :: perm))
                            | _ -> []

2. Selecting the new head and the new tail.
let fullList = x :: (prevElements @ xs)
Also we can use the main idea of this algorithm and easily rewrite it in C#.
static List<List<T>> GetPermutations<T>(List<T> source)
        {
            if (source.Count() < 2) return new List<List<T>> { source };
 
            var prevElements = new List<T>();
            var tailElements = source.Skip(1).ToList();
            var result = new List<List<T>>();
            foreach (var item in source)
            {
                var newList = Enumerable.Concat(prevElements, tailElements).ToList();
                var tailPermutations = GetPermutations(newList);
                foreach (var tailPermutation in tailPermutations)
                    result.Add(Enumerable.Concat(new T[] { item }, tailPermutation).ToList());
                prevElements.Add(item);
                tailElements = tailElements.Skip(1).ToList();
            }
            return result;
        }

вторник, 24 мая 2016 г.

Using Neo4j with F#. Prepare to web pages text analysis.


In my previous blog posts I wrote about crawling and HTML text extraction. It is time to save data we crawled. Also it is good to have the ability to retrieve saved data to perform some analysis later. Let's introduce two types: Page and SymbolGroup. Page type will be used for representation of HTML page. What about SymbolGroup type — for simplicity you can treat it just as word representation. However, for languages of CJK (Chinese/Japanese/Korean) group it will not be correct. Here is the F# code with these types.
module Types
 
open Newtonsoft.Json
 
    [<CLIMutable>]
    type SymbolGroup = { Guid: string; Name: string; TailSeparator: string }
 
    [<CLIMutable>]
    type Page = { Guid: string; Url: string; }
Logically Page and Symbol group entities are linked by «Contains» relation. We will use Neo4j for data store and we will reflect this relation. First let's create code for data saving.
module PageRepository
 
open Types
open Neo4jClient.Cypher
 
    let savePage(page: Page, symbolGroups: seq<SymbolGroup>) = 
        
        let graphClient = ConnectionProvider.getGraphClient()
        let query = graphClient.Cypher
                                .Create("(page:Page {newPage})")
                                .WithParam("newPage", page)
        let query, _ = symbolGroups |> Seq.fold (fun (q: ICypherFluentQuery, i) symbolGroup -> 
                                                       q
                                                        .Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
                                                        .Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
                                                        .WithParam("newSymbolGroup" + i.ToString(), symbolGroup), i + 1
                                        
                            ) (query, 0)
        query.ExecuteWithoutResults()
        ()
As you can see from the code, page and its symbol groups are saved by single Cypher query. Note that for each symbol group we create its own parameter in query:
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
This query consists from three parts:
1. Pages creation:
let query = graphClient.Cypher
                        .Create("(page:Page {newPage})")
                        .WithParam("newPage", page)
2. Symbol groups creation
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
3. Relations creation
.Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
To retrieve pages with its symbol groups I wrote the following code:
let getAllPages() =
   let graphClient = ConnectionProvider.getGraphClient()
   let result = graphClient.Cypher
                           .OptionalMatch("(page:Page)-[Contains]-(symbolGroup:SymbolGroup)")
                           .Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
                           .Results
   result
It is important to explicitly specify ICypherResultItem variable type in
.Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
expression. Result will be presented by tuple of page and collection of its symbol groups. Once we have persistense logic we can store our HTML data like this
namespace TextStore
 
open System
open Types
 
module UrlHandler = 
    
    let handleUrl(url:string, content:string) =
        let symbolGroups = TextUtils.extractWords(content) |> Seq.toList
        let guid = Guid.NewGuid()
        let page = { Guid = guid.ToString(); Url = url; }
        PageRepository.savePage(page, symbolGroups)
        ()
We can see the graph in Neo4j browser. For this example I crawled the Chinese wikipedia to reflect the thought that SymbolGroup type can be used not only for words representation.



четверг, 12 мая 2016 г.

Suffix Arrays in F#

Let's consider the task when you need to find the longest repeating subsequence in some sequence or just analyze if there are some repeating sequences.
One of the ways to solve it is to use suffix array. Suffix array is the array which contains all sorted subsequences (suffixes) of the given sequence. In this article I want to describe key points about suffix arrays and implement most important aspects in F#.
Below you can see the unit test which illustrates what suffix array is.
[<Test>]
member this.CreateSuffixArrayTest() = 
    let source = seq [ 'A''F''U''B''D''B''D' ]
    let result = SuffixArray.create source
    Assert.AreEqual(result.Length, 7)
    Assert.AreEqual(result.[0], [| 'A''F''U''B''D''B''D' |])
    Assert.AreEqual(result.[1], [| 'B''D' |])
    Assert.AreEqual(result.[2], [| 'B''D''B''D' |])
    Assert.AreEqual(result.[3], [| 'D' |])
    Assert.AreEqual(result.[4], [| 'D''B''D' |])
    Assert.AreEqual(result.[5], [| 'F''U''B''D''B''D' |])
    Assert.AreEqual(result.[6], [| 'U''B''D''B''D' |])

It is simple data structure and implementation is quite simple too, especially in F#.
First we need to create array of subsequences (suffixes) — we can do it by just one line

let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
Then we need to sort subsequences — we use Array.sortWith function with custom comparer. For the point of performance it is written in imperative way.
namespace Analytics
 
open System
 
module SuffixArray =
 
    let create (data: seq<'a>) = 
        let maxSuffix = data |> Seq.toArray
        let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
                    |> Array.sortWith (fun (x: 'a[]) (y: 'a[]) ->
                                            let minLength = Math.Min(x.Length, y.Length)
                                            let mutable index = 0
                                            while (index < minLength - 1) && (x.[index] = y.[index]) do index <- index + 1
                                            if index < minLength then
                                                if x.[index] > y.[index] then 1 else -1
                                            else
                                                if x.Length > y.Length then 1 else -1
                                            )
        result
After suffix array has been built we can start to analyze it.
Now we will find all repeating suffixes. Here is the unit test demonstrating how it should work.
[<Test>]
member this.FindRepeatingSuffixesTest() = 
    let source = seq [ 'A''F''U''B''D''B''D''A''F''U' ]
    let suffixArray = SuffixArray.create source
    let result = SuffixArray.findRepeatingSuffixes suffixArray |> Seq.toArray
    Assert.AreEqual(result.Length, 3)
    Assert.AreEqual(['A''F''U'], result.[0])
    Assert.AreEqual(['B''D'], result.[1])
    Assert.AreEqual(['F''U'], result.[2])
To find repeating suffixes we should just consider adjacent suffixes and take their largest common prefix (of course if it presents). Here is the code.
let findRepeatingSuffixes (data: 'a[][]) = seq {
    for i = 1 to data.Length - 1 do
        let largestCommonPrefix = findLargestCommonPrefix data.[i - 1] data.[i]
        if largestCommonPrefix.Length > 1 then yield largestCommonPrefix
    }
Here is the unit test showing how the findLargestCommonPrefix function should work.
[<Test>]
member this.FindLargestCommonPrefixTest() = 
    let first = [| 'A''F''U''B''D''B''D' |]
    let second = [| 'A''F''Z''B''D''B''D' |]
    let result = SuffixArray.findLargestCommonPrefix first second
    Assert.AreEqual(['A''F'], result)
And here is the implementation. At first we determine the smallest suffix. Apparently, the common suffix cannot be longer then the smallest suffix. Second – we just compare two suffixes element by element from the beginning until they have equal elements. After we turned on the end of the smallest suffix or there are no equal elements we should stop search.
let findLargestCommonPrefix (first: 'a[]) (second: 'a[]) =
    let minLength = Math.Min(first.Length, second.Length) 
    let mutable index = 0
    while (index < minLength) && (first.[index] = second.[index]) do index <- index + 1
    first.[0 .. Math.Max(0, index - 1)]
As you can see the analysis is quite simple and you can easily expand/modify it for your needs. Also because of F# type inference system all algorithms described above are generic. You can use it not only with strings (sequences of chars) but also with sequences of any other types.

Finally I would tell some notes about .NET environment configuration. As you may suspect the suffix array consumes a lot of memory. We should open project properties and UNCHECK “Prefer 32-bit” option on the build tab. Otherwise you could not create array above 4GB.

вторник, 10 мая 2016 г.

Implementing web crawler. Part 2. Using F#.

The previous article was dedicated to description of important aspects in crawler programming. In this article we will implement web crawler in F# taking these aspects into account.

First let's implement auxiliary module which will help us to handle links. There are two functions that will be useful for us in future.
1. let prepareUrl url (parentUrl: string)  — this function is intended for preparing link URL for transition. As I mentioned in previous article — there are different kinds of URLs (absolute, relative, protocol relative and invalid) — it is convenient to create special active pattern for these types for later use.
let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://"then AbsoluteUrl
            else if input.StartsWith("//"then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl
Also to simplify code let's introduce active patterns for slash detection.
let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/"then Some StartsWithSlash else None
let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/"then Some EndsWithSlash else None 
Below is the full source code of this function. If URL is not invalid it will be converted to absolute.
let prepareUrl url (parentUrl: string) =
    match url with
    | AbsoluteUrl -> Some url
    | ProtocolRelativeUrl -> 
        let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
        Some (protocol + url)
    | RelativeUrl -> 
        let preparedParentUrl = 
            match parentUrl with
            | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
            | _ -> parentUrl
        let preparedUrl = 
            match url with
            | StartsWithSlash -> url
            | _ -> "/" + url
        Some (preparedParentUrl + preparedUrl)
    | InvalidUrl -> None
2. let getDomainName url: string — this function is used for domain name extraction from link.
let getDomainName url: string = 
    let uri = new Uri(url)
    let host = uri.Host
    if host.StartsWith("www."then host.Substring(4)
    else host
Here is the full source code of this module.
namespace Crawler
 
open System
 
module LinksHandler =
    let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://"then AbsoluteUrl
            else if input.StartsWith("//"then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl
 
    let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/"then Some StartsWithSlash else None
    let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/"then Some EndsWithSlash else None
 
    let prepareUrl url (parentUrl: string) =
        match url with
        | AbsoluteUrl -> Some url
        | ProtocolRelativeUrl -> 
            let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
            Some (protocol + url)
        | RelativeUrl -> 
            let preparedParentUrl = 
                match parentUrl with
                | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
                | _ -> parentUrl
            let preparedUrl = 
                match url with
                | StartsWithSlash -> url
                | _ -> "/" + url
            Some (preparedParentUrl + preparedUrl)
        | InvalidUrl -> None
 
    let getDomainName url: string = 
        let uri = new Uri(url)
        let host = uri.Host
        if host.StartsWith("www."then host.Substring(4)
        else host
The next module contains crawling logic. It contains some additional types.
    [<Measure>] type ms
 
type ParserOptions = interface
    abstract SleepTimeInsideDomain: int<mswith get, set
    abstract StartUrl : string with get, set
    abstract CancellationTokenSource : CancellationTokenSource with get, set
    abstract ReturnPureText: bool with get, set
    abstract GetStartNode: HtmlDocument -> HtmlNode
    abstract HandleUrl : url: string * content: string -> unit
    abstract HandleException : ex: Exception * url: string -> unit
    abstract FilterUrl: url: string -> bool
end
The ParserOptions type is used for crawler behavior customization. Below is the description of available options.
1. SleepTimeInsideDomain — delay in milliseconds between requests to the same domain. Measure type 'ms' is used to specify value in milliseconds.
2. StartUrl — URL we starting crawling from.
3. CancellationTokenSource — it is used for crawling cancellation, we pass it to the crawler through crawler options and when we will decide to cancel crawling we should just call CancellationTokenSource.Cancel() method.
4. ReturnPureText — depending of specifics of our analysis we may need either full HTML code, or extracted plain text. If we set it to true, returned text will be cleaned from HTML envelope.
5. GetStartNode — it is usually used when you crawl a specific site and want to analyze only the specific block of page. For example, you crawl Wikipedia and want to analyze only content without headers and side bars.
6. HandleUrl — payload function that will be called after page is downloaded. For example, you can save text content in database. This function receives page content and URL as parameters.
7. HandleException — exception handler.
8. FilterUrl — you can customize which URLs can or cannot be visited.

We will use F# agents (MailboxProcessor) as the basement of the crawler. It will help us to efficiently and easily manage shared state (we need to keep information about visited links to prevent visiting of the same page multiple times and information about last visit time by each visited domain to correctly organize delays between per-domain requests).
First, as message is received by the agent, link from the message is added to the visited links collection
visitedLinks.Add(preparedUrl) |> ignore
Second, we should determine delay time, basing on link domain and previous information about domain access time. If we even haven't have the domain registered we set delay to 0, because we still have not visited this domain and register it as visited at current time. Otherwise, if we have already visited this domain, we calculate time span between current time and domain last access time. If it will be greater than SleepTimeInsideDomain option value then the delay will be set to 0 too. However if it will be less, then delay will be calculated as (last domain access time + SleepTimeInsideDomain option value - current time).
let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
After delay calculation we should handle this URL asynchronously. While this URL is being processed the agent will be able to handle other URLs (or receive other messages.) We use F# Data library for HTML parsing and links extracting. As soon as links were extracted we send to the message queue request to processing for each not visited link. Only links satisfying the filter function (FilterUrl) from the parser options will be handled.
do! Async.StartChild
        (async {
            if sleepTime > 0 then do! Async.Sleep sleepTime
 
            printfn "***********************************************************"
            printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
            printfn "***********************************************************"
            
            let! html = HtmlDocument.AsyncLoad(preparedUrl)
            let startNode = parserArguments.GetStartNode(html)
            parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
            let loadCompleteTime = DateTime.Now
            let links = html.Descendants ["a"]
                        |> Seq.choose (fun x ->
                        x.TryGetAttribute("href")
                        |> Option.map (fun a -> x.InnerText(), a.Value()))
                        |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
            links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
        }) |> Async.Ignore
Note that the MailboxProcessor.Start method receive CancellationToken as a parameter. This helps us to cancel the crawling. On exception ExceptionHandler from the parser options will be called. When the agent finishes processing it returns updated collections with information about visited links and domains visit time. Here is the full source code of this module.
namespace Crawler
 
open FSharp.Data
open System
open System.Threading
 
module Parser =
 
    [<Measure>] type ms
 
    type ParserOptions = interface
        abstract SleepTimeInsideDomain: int<mswith get, set
        abstract StartUrl : string with get, set
        abstract CancellationTokenSource : CancellationTokenSource with get, set
        abstract ReturnPureText: bool with get, set
        abstract GetStartNode: HtmlDocument -> HtmlNode
        abstract HandleUrl : url: string * content: string -> unit
        abstract HandleException : ex: Exception * url: string -> unit
        abstract FilterUrl: url: string -> bool
    end
 
    let startCrawling (parserArguments: ParserOptions) =
        MailboxProcessor.Start((fun self -> 
            let rec handleUrl(visitedLinks: Set<string>, accessTimeByDomains: Map<stringDateTime>) =
                async {
                    let! (targetUrl, parentUrl) = self.Receive()
                    try
                        let url = LinksHandler.prepareUrl targetUrl parentUrl
                        
                        match url with
                        | Some preparedUrl ->
                            let currentLinkDomain = LinksHandler.getDomainName preparedUrl
                            
                            visitedLinks.Add(preparedUrl) |> ignore
 
                            let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
 
                            do! Async.StartChild
                                    (async {
                                        if sleepTime > 0 then do! Async.Sleep sleepTime
 
                                        printfn "***********************************************************"
                                        printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
                                        printfn "***********************************************************"
                                        
                                        let! html = HtmlDocument.AsyncLoad(preparedUrl)
                                        let startNode = parserArguments.GetStartNode(html)
                                        parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
                                        let loadCompleteTime = DateTime.Now
                                        let links = html.Descendants ["a"]
                                                    |> Seq.choose (fun x ->
                                                    x.TryGetAttribute("href")
                                                    |> Option.map (fun a -> x.InnerText(), a.Value()))
                                                    |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
                                        links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
                                    }) |> Async.Ignore
                            return! handleUrl(visitedLinks.Add(preparedUrl), accessTimeByDomains)
                        | None -> ()
                    with
                        | ex -> parserArguments.HandleException (ex, targetUrl)
                }
            handleUrl (Set.empty, Map.empty)), parserArguments.CancellationTokenSource.Token).Post (parserArguments.StartUrl, "")
 
Finally let's look to the parser options example. This is the options for Wikipedia parsing. Note that we will parse only content block, without including headers and sidebars. In this example parsing will be stopped after 100 URLs will have been processed.

open Crawler
open System
open System.Threading
open Parser
open TextStore
open FSharp.Data
 
type ParserArguments() = 
    member val ProcessedUrlCount = 0 with get, set
    interface ParserOptions with
            member val SleepTimeInsideDomain = 3000<mswith get, set
            member val StartUrl = "https://en.wikipedia.org" with get, set
            member val CancellationTokenSource = new CancellationTokenSource() with get, set
            member val ReturnPureText = true with get, set
            member this.HandleException(ex: Exception, url: string) =
                printfn "%s EXCEPTION for url %s" (ex.Message + "---" + ex.StackTrace) url
                ()
            member this.HandleUrl(url: string, content: string) = 
                if this.ProcessedUrlCount = 100 then (this :> ParserOptions).CancellationTokenSource.Cancel()
                else
                    this.ProcessedUrlCount <- this.ProcessedUrlCount + 1
                    UrlHandler.handleUrl(url, content)
            member this.FilterUrl(url: string) = 
                url.StartsWith("/w")
            member this.GetStartNode(document: HtmlDocument) =
                let contentDiv = document.Descendants(fun node -> node.HasId("content") && node.HasName("div") && node.HasAttribute("role""main")) |> Seq.tryHead
                match contentDiv with
                | Some(div) -> div
                | None -> document.Body()
 
[<EntryPoint>]
let main argv = 
    let parserArguments = new ParserArguments()
    Parser.startCrawling parserArguments
    System.Console.ReadKey() |> ignore
    0