Let's consider the task when you need to find the longest repeating subsequence in some sequence or just analyze if there are some repeating sequences.
One of the ways to solve it is to use suffix array. Suffix array is the array which contains all sorted subsequences (suffixes) of the given sequence. In this article I want to describe key points about suffix arrays and implement most important aspects in F#.
Below you can see the unit test which illustrates what suffix array is.
It is simple data structure and implementation is quite simple too, especially in F#.
First we need to create array of subsequences (suffixes) — we can do it by just one line
Now we will find all repeating suffixes. Here is the unit test demonstrating how it should work.
Finally I would tell some notes about .NET environment configuration. As you may suspect the suffix array consumes a lot of memory. We should open project properties and UNCHECK “Prefer 32-bit” option on the build tab. Otherwise you could not create array above 4GB.
One of the ways to solve it is to use suffix array. Suffix array is the array which contains all sorted subsequences (suffixes) of the given sequence. In this article I want to describe key points about suffix arrays and implement most important aspects in F#.
Below you can see the unit test which illustrates what suffix array is.
[<Test>] member this.CreateSuffixArrayTest() = let source = seq [ 'A'; 'F'; 'U'; 'B'; 'D'; 'B'; 'D' ] let result = SuffixArray.create source Assert.AreEqual(result.Length, 7) Assert.AreEqual(result.[0], [| 'A'; 'F'; 'U'; 'B'; 'D'; 'B'; 'D' |]) Assert.AreEqual(result.[1], [| 'B'; 'D' |]) Assert.AreEqual(result.[2], [| 'B'; 'D'; 'B'; 'D' |]) Assert.AreEqual(result.[3], [| 'D' |]) Assert.AreEqual(result.[4], [| 'D'; 'B'; 'D' |]) Assert.AreEqual(result.[5], [| 'F'; 'U'; 'B'; 'D'; 'B'; 'D' |]) Assert.AreEqual(result.[6], [| 'U'; 'B'; 'D'; 'B'; 'D' |])
It is simple data structure and implementation is quite simple too, especially in F#.
First we need to create array of subsequences (suffixes) — we can do it by just one line
let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i)Then we need to sort subsequences — we use Array.sortWith function with custom comparer. For the point of performance it is written in imperative way.
namespace Analytics open System module SuffixArray = let create (data: seq<'a>) = let maxSuffix = data |> Seq.toArray let result = Array.init maxSuffix.Length (fun i -> maxSuffix |> Array.skip i) |> Array.sortWith (fun (x: 'a[]) (y: 'a[]) -> let minLength = Math.Min(x.Length, y.Length) let mutable index = 0 while (index < minLength - 1) && (x.[index] = y.[index]) do index <- index + 1 if index < minLength then if x.[index] > y.[index] then 1 else -1 else if x.Length > y.Length then 1 else -1 ) resultAfter suffix array has been built we can start to analyze it.
Now we will find all repeating suffixes. Here is the unit test demonstrating how it should work.
[<Test>] member this.FindRepeatingSuffixesTest() = let source = seq [ 'A'; 'F'; 'U'; 'B'; 'D'; 'B'; 'D'; 'A'; 'F'; 'U' ] let suffixArray = SuffixArray.create source let result = SuffixArray.findRepeatingSuffixes suffixArray |> Seq.toArray Assert.AreEqual(result.Length, 3) Assert.AreEqual(['A'; 'F'; 'U'], result.[0]) Assert.AreEqual(['B'; 'D'], result.[1]) Assert.AreEqual(['F'; 'U'], result.[2])To find repeating suffixes we should just consider adjacent suffixes and take their largest common prefix (of course if it presents). Here is the code.
let findRepeatingSuffixes (data: 'a[][]) = seq { for i = 1 to data.Length - 1 do let largestCommonPrefix = findLargestCommonPrefix data.[i - 1] data.[i] if largestCommonPrefix.Length > 1 then yield largestCommonPrefix }Here is the unit test showing how the findLargestCommonPrefix function should work.
[<Test>] member this.FindLargestCommonPrefixTest() = let first = [| 'A'; 'F'; 'U'; 'B'; 'D'; 'B'; 'D' |] let second = [| 'A'; 'F'; 'Z'; 'B'; 'D'; 'B'; 'D' |] let result = SuffixArray.findLargestCommonPrefix first second Assert.AreEqual(['A'; 'F'], result)And here is the implementation. At first we determine the smallest suffix. Apparently, the common suffix cannot be longer then the smallest suffix. Second – we just compare two suffixes element by element from the beginning until they have equal elements. After we turned on the end of the smallest suffix or there are no equal elements we should stop search.
let findLargestCommonPrefix (first: 'a[]) (second: 'a[]) = let minLength = Math.Min(first.Length, second.Length) let mutable index = 0 while (index < minLength) && (first.[index] = second.[index]) do index <- index + 1 first.[0 .. Math.Max(0, index - 1)]As you can see the analysis is quite simple and you can easily expand/modify it for your needs. Also because of F# type inference system all algorithms described above are generic. You can use it not only with strings (sequences of chars) but also with sequences of any other types.
Finally I would tell some notes about .NET environment configuration. As you may suspect the suffix array consumes a lot of memory. We should open project properties and UNCHECK “Prefer 32-bit” option on the build tab. Otherwise you could not create array above 4GB.
Комментариев нет:
Отправить комментарий