понедельник, 27 июня 2016 г.

Apriori algorithm implementation in F#

Because I now have a blog I want to share one of my old code pieces – Apriori algorithm implementation. Apriori algorithm is probably the most famous algorithm for frequent itemsets search and association rules implication. Let's look on definitions that will be used in algorithm description and implementation.
1. Support for an itemset is calculated as a ratio of transaction count containing this itemset to total transaction count.
2. Confidence for an association rule X → Y is calculated as ratio of transaction count containing X and Y to transaction count containing X.
3. Frequent itemset is the itemset with support equal or more defined support threshold.
4. Anti-monotony property – if itemset X is not frequent itemset that addition of new item xn to this itemset will not turn this itemset into frequent. This rule will be used to reduce the set of all possible association rules.
I'll briefly explain the algorithm by example. There are transactions: [3; 1; 4; 6;]; [1; 2; 4; 6]; [1; 3; 4; 9]; [2; 5; 4; 9]; [6; 7; 8; 9]. Let support will be 40%. First we should find all one-item itemsets satisfying the given support. Here is the result: [[1], [2], [3], [4], [6], [9]]. After that we will generate all possible two-item itemsets, prune them using the support value and get the following result. T [[1;6]; [1;4]; [1;3]; [2;4]; [3;4]; [4;9]; [4;6]]. Three- and more itemsets will be generated by the linking of the set above with itself. For example, if we link [1;6] and [1; 4] the result will be [1; 6; 4]. After pruning only two itemsets left: [1;4;6] and [1;3;4]. As these itemsets cannot be linked because they don't have k – 1 common element. (where k is the itemset length, so to be linked these itemsets should have 2 common elements, but they have only 1) we finish the candidates generation process. Now we have frequent itemsets and should infer itemset rules. We need to generate subsets of all sizes for remained itemsets. After filtering by support and confidence values we get the following results.
Reason=[1; 4] Consequence=[6] Support=0.6 Confidence=0.6666666667
Reason=[1; 6] Consequence=[4] Support=0.4 Confidence=1.0
Reason=[4; 6] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[1] Consequence=[4; 6] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[6] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[1; 4] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[1] Support=0.6 Confidence=0.6666666667
Reason=[6] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[1; 3] Consequence=[4] Support=0.4 Confidence=1.0
Reason=[1; 4] Consequence=[3] Support=0.6 Confidence=0.6666666667
Reason=[3; 4] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[1] Consequence=[3; 4] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[3] Support=0.6 Confidence=0.6666666667
Reason=[1] Consequence=[4] Support=0.6 Confidence=0.6666666667
Reason=[3] Consequence=[1; 4] Support=0.4 Confidence=1.0
Reason=[3] Consequence=[1] Support=0.4 Confidence=1.0
Reason=[3] Consequence=[4] Support=0.4 Confidence=1.0

Below is the code of Apriori algorithm module with descriptions.
module AprioriModule
 
open System.Collections.Generic
open System
Let's define the type for Apriori rule the following way:
type Rule<'T> = { Reason: 'T list; Consequence: 'T list; Support: double; Confidence: double }
The function for calculation of itemset support absolute value is defined the following way: we increment support count for itemset if currently visited transaction contains every itemset item.
let GetItemsetSupportCount itemset transactions =
    Seq.fold (fun sum transaction -> if List.forall (fun item -> Set.contains item transaction) itemset then sum + 1 else sum) 0 transactions
After we know how to calculate support we can separate frequent itemsets from not frequent ones. Also we fill data about itemsets support (supportData : Dictionary<_ int="">).
let ExtractFrequentItemsets minSupport itemsets transactions (supportData : Dictionary<_, int>) =
    let transactionsCount = Seq.length transactions
    let minSupportRequiredCount = int(Math.Ceiling((float transactionsCount) * minSupport))
    itemsets |> List.filter (fun x -> 
        let supportCount = GetItemsetSupportCount x transactions
        if minSupportRequiredCount <= supportCount then
            supportData.Add(x, supportCount)
            true
        else false)
As mentioned above, new candidates are generated by linking of the set with itself. Here is the unit test to illustrate what should we get.
    [<Test>]
member this.MergeListsTest() = 
    let list1 = [1; 5; 6; 9]
    let list2 = [1; 5; 6; 12]
    let mergeResult = MergeLists list1 list2
    Assert.AreEqual([1; 5; 6; 9; 12], mergeResult)
 
    let list3 = [1; 8]
    let list4 = [8; 9]
    let mergeResult2 = MergeLists list3 list4
    Assert.AreEqual([], mergeResult2)
This specific kind of merge is implemented by the following way.
let MergeLists list1 list2 = 
    let rec MergeListsInner list1 list2 mergeResult =
        match list1, list2 with
        |   [],_ -> []
        |   _,[] -> []
        |   head1 :: tail1, head2 :: tail2 when head1 <> head2 -> []
        |   head1 :: [_], head2 :: tail2 -> mergeResult@list1@tail2 |> List.sort
        |   head1 :: tail1, head2 :: tail2 -> MergeListsInner tail1 tail2 ([head1]@mergeResult)
    let result = MergeListsInner list1 list2 []
    result
The one exclusion of this case is when we will generate all possible two-item itemsets. The GenerateCombinations function reflects this.
let rec GenerateCombinations item list combinations =
    match item, list with
    |   _ , [] -> List.filter (not << List.isEmpty) combinations
    |   [h], [x] :: xs -> GenerateCombinations item xs ([[h; x]] @ combinations)
    |   _ , x :: xs -> GenerateCombinations item xs (([MergeLists item x ]) @ combinations)
So again: for generating combinations of one-item itemsets we just linking them together;
|   [h], [x] :: xs -> GenerateCombinations item xs ([[h; x]] @ combinations)
for n-item (n > 1) itemsets we use MergeLists function
|   _ , x :: xs -> GenerateCombinations item xs (([MergeLists item x ]) @ combinations)
Here is the unit test for combinations generation functionality.
    [<Test>]
member this.GenerateCombinationsTest() =
    let tail1 = [[1; 3]; [1; 4]; [1; 5]]
    let generateResult1 = GenerateCombinations [1; 2] tail1 []
    Assert.AreEqual([ [1; 2; 5]; [1; 2; 4]; [1; 2; 3] ], generateResult1)
 
    let tail2 = [[2]; [3]]
    let generateResult2 = GenerateCombinations [1] tail2 []
    Assert.AreEqual([ [1; 3]; [1; 2;]], generateResult2)
Finally we can implement function for frequent itemset search. Unit test first.
[<Test>]
    member this.GetFrequentItemsetsTest() =
        let supportData = new Dictionary<_, _>(HashIdentity.Structural)
        let items = [[1]; [2]; [3]; [4]; [5]; [6]; [7]; [8]; [9]]
        let transactions = [Set.ofList[3; 1; 4; 6;]; Set.ofList[1; 2; 4; 6]; Set.ofList[1; 3; 4; 9]; Set.ofList[2; 5; 4; 9]; Set.ofList[6; 7; 8; 9]]
        let frequentItems = GetFrequentItemsets items transactions 0.4 supportData
        Assert.AreEqual([[1; 4; 6]; [1; 3; 4]], frequentItems);
And here is the implementation.
let GetFrequentItemsets candidates transactions minSupport supportData = 
    let rec GetFrequentItemsetsInner candidates transactions lastStepFrequentItemsets minSupport supportData = 
        let rec GenerateCandidates itemsets candidates =
            match itemsets with 
            |   [] -> candidates
            |   x :: xs -> GenerateCandidates xs (candidates @ (GenerateCombinations x xs []))
        let currentStepFrequentItemsets = (ExtractFrequentItemsets minSupport candidates transactions supportData) 
        match currentStepFrequentItemsets with 
        |   [] -> lastStepFrequentItemsets
        |   _  -> GetFrequentItemsetsInner (GenerateCandidates candidates []) transactions currentStepFrequentItemsets minSupport supportData
    GetFrequentItemsetsInner candidates transactions [] minSupport supportData
After we know frequent itemsets we can generate rules. There is auxiliary function for generation of combinations of all sizes.
let rec GenerateRulesCombinationsOfAllSizes acc size set = 
    let rec GenerateRulesCombinationsOfSize acc size set = seq {
      match size, set with 
      | n, x::xs -> 
          if n > 0 then yield! GenerateRulesCombinationsOfSize (x::acc) (n - 1) xs
          if n >= 0 then yield! GenerateRulesCombinationsOfSize acc n xs 
      | 0, [] -> yield List.sort acc 
      | _, [] -> () }
    seq { for i = size downto 1 do yield! GenerateRulesCombinationsOfSize [] i set }
To generate rules we should generate item combinations of all sizes for the given itemset. Each combination will be treated as reason and difference between the whole itemset and this combination will be treated as consequence.
let GetRules itemsets supportData minSupport minConfidence transactionsCount = 
    let GetItemsetRules itemset (supportData : Dictionary<_, int>) minSupport minConfidence allItemsetsLength= seq {
        let set = Set.ofList itemset
        let itemsetLength = set |> Set.count
        let combinations = GenerateRulesCombinationsOfAllSizes [] (itemsetLength - 1) itemset
        for reasonCombination in combinations do
            let rest = Set.difference set (Set.ofList reasonCombination)
            let consequenceCombinations = GenerateRulesCombinationsOfAllSizes [] (Set.count rest) (Set.toList rest)
            for consequenceCombinaion in consequenceCombinations do
                let rule = { Reason = reasonCombination; Consequence = consequenceCombinaion; 
                    Support = (double supportData.[reasonCombination] / double allItemsetsLength); 
                    Confidence =  (double supportData.[itemset] / double supportData.[reasonCombination])}
                if rule.Confidence >= minConfidence && rule.Support >= minSupport then yield rule
            }
    seq {
        for itemset in itemsets do yield! GetItemsetRules itemset supportData minSupport minConfidence transactionsCount }
That's all. Now we can try it by running the following way.
module AprioriProgram
 
open AprioriModule
open System.Collections.Generic
 
[<EntryPoint>]
let main argv = 
    let items = [[1]; [2]; [3]; [4]; [5]; [6]; [7]; [8]; [9]]
    let transactions = seq [ Set.ofList[3; 1; 4; 6;]; Set.ofList[1; 2; 4; 6]; 
        Set.ofList[1; 3; 4; 9]; Set.ofList[2; 5; 4; 9]; Set.ofList[6; 7; 8; 9] ]
    let supportData = new Dictionary<_, _>(HashIdentity.Structural)
    let minSupport = 0.4
    let frequentItems = GetFrequentItemsets items transactions minSupport supportData
    let rules = GetRules frequentItems supportData minSupport 0.6 (Seq.length transactions)
    
    System.Console.ForegroundColor <- System.ConsoleColor.Green
    
    frequentItems |> List.iter (fun x -> printf "%A " x)
    printfn ""
    printfn "-------------"
    rules |> Seq.iter (fun rule -> printfn "Reason=%A Consequence=%A Support=%A Confidence=%A" rule.Reason rule.Consequence rule.Support rule.Confidence)
    System.Console.ReadKey() |> ignore
    printfn "%A" argv
    0