четверг, 12 мая 2016 г.

Suffix Arrays in F#

Let's consider the task when you need to find the longest repeating subsequence in some sequence or just analyze if there are some repeating sequences.
One of the ways to solve it is to use suffix array. Suffix array is the array which contains all sorted subsequences (suffixes) of the given sequence. In this article I want to describe key points about suffix arrays and implement most important aspects in F#.
Below you can see the unit test which illustrates what suffix array is.
[<Test>]
member this.CreateSuffixArrayTest() = 
    let source = seq [ 'A''F''U''B''D''B''D' ]
    let result = SuffixArray.create source
    Assert.AreEqual(result.Length, 7)
    Assert.AreEqual(result.[0], [| 'A''F''U''B''D''B''D' |])
    Assert.AreEqual(result.[1], [| 'B''D' |])
    Assert.AreEqual(result.[2], [| 'B''D''B''D' |])
    Assert.AreEqual(result.[3], [| 'D' |])
    Assert.AreEqual(result.[4], [| 'D''B''D' |])
    Assert.AreEqual(result.[5], [| 'F''U''B''D''B''D' |])
    Assert.AreEqual(result.[6], [| 'U''B''D''B''D' |])

It is simple data structure and implementation is quite simple too, especially in F#.
First we need to create array of subsequences (suffixes) — we can do it by just one line

let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
Then we need to sort subsequences — we use Array.sortWith function with custom comparer. For the point of performance it is written in imperative way.
namespace Analytics
 
open System
 
module SuffixArray =
 
    let create (data: seq<'a>) = 
        let maxSuffix = data |> Seq.toArray
        let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) 
                    |> Array.sortWith (fun (x: 'a[]) (y: 'a[]) ->
                                            let minLength = Math.Min(x.Length, y.Length)
                                            let mutable index = 0
                                            while (index < minLength - 1) && (x.[index] = y.[index]) do index <- index + 1
                                            if index < minLength then
                                                if x.[index] > y.[index] then 1 else -1
                                            else
                                                if x.Length > y.Length then 1 else -1
                                            )
        result
After suffix array has been built we can start to analyze it.
Now we will find all repeating suffixes. Here is the unit test demonstrating how it should work.
[<Test>]
member this.FindRepeatingSuffixesTest() = 
    let source = seq [ 'A''F''U''B''D''B''D''A''F''U' ]
    let suffixArray = SuffixArray.create source
    let result = SuffixArray.findRepeatingSuffixes suffixArray |> Seq.toArray
    Assert.AreEqual(result.Length, 3)
    Assert.AreEqual(['A''F''U'], result.[0])
    Assert.AreEqual(['B''D'], result.[1])
    Assert.AreEqual(['F''U'], result.[2])
To find repeating suffixes we should just consider adjacent suffixes and take their largest common prefix (of course if it presents). Here is the code.
let findRepeatingSuffixes (data: 'a[][]) = seq {
    for i = 1 to data.Length - 1 do
        let largestCommonPrefix = findLargestCommonPrefix data.[i - 1] data.[i]
        if largestCommonPrefix.Length > 1 then yield largestCommonPrefix
    }
Here is the unit test showing how the findLargestCommonPrefix function should work.
[<Test>]
member this.FindLargestCommonPrefixTest() = 
    let first = [| 'A''F''U''B''D''B''D' |]
    let second = [| 'A''F''Z''B''D''B''D' |]
    let result = SuffixArray.findLargestCommonPrefix first second
    Assert.AreEqual(['A''F'], result)
And here is the implementation. At first we determine the smallest suffix. Apparently, the common suffix cannot be longer then the smallest suffix. Second – we just compare two suffixes element by element from the beginning until they have equal elements. After we turned on the end of the smallest suffix or there are no equal elements we should stop search.
let findLargestCommonPrefix (first: 'a[]) (second: 'a[]) =
    let minLength = Math.Min(first.Length, second.Length) 
    let mutable index = 0
    while (index < minLength) && (first.[index] = second.[index]) do index <- index + 1
    first.[0 .. Math.Max(0, index - 1)]
As you can see the analysis is quite simple and you can easily expand/modify it for your needs. Also because of F# type inference system all algorithms described above are generic. You can use it not only with strings (sequences of chars) but also with sequences of any other types.

Finally I would tell some notes about .NET environment configuration. As you may suspect the suffix array consumes a lot of memory. We should open project properties and UNCHECK “Prefer 32-bit” option on the build tab. Otherwise you could not create array above 4GB.

Комментариев нет:

Отправить комментарий