четверг, 28 апреля 2016 г.

Extracting text from HTML using F#

There are different kinds of tasks related to HTML parsing, one of which is extracting the text without tags from HTML document. You can face this task when you want to save text content from downloaded web page or perform some search through the text excluding HTML tags. In my case I needed to extract text to perform some lexical analysis on extracted data.
Often people use the Html Agility Pack for this. It is good decision if you use C#. However, if we use F# there is great and powerful F# Data library which includes functionality for working with HTML. Moreover, this library contains many features for handling different kinds of data so it is very often used in F# projects. Here is how to do it using the HTML Parser from the F# Data library.
Let's create a unit test first to illustrate the task.
[<Test>]
member this.ExtractTextTest() = 
    let htmlString = @"<!DOCTYPE html>
                            <html>
                                <body>
                                    <script type='text/javascript'>alert('test')</script>
                                    <h1>Header</h1>
                                    Some text
                                    <div>
                                        Just text
                                        <div>
                                            Inner text
                                            <a href='#'>Link text</a>
                                        </div>
                                        <p>Paragraph.</p>
                                    </div>
                                </body>
                            </html>
                            "
    let expectedText = @"Header Some text Just text Inner text Link text Paragraph. "
    let actualText = HtmlUtils.extractTextFromHtmlString htmlString
    Assert.AreEqual(expectedText.Trim(), actualText.Trim())

You can see the input data and the expected result.

There are different ways to extract text from HTML. The simplest way is to use HtmlNode class method — InnerText(). Here is the output of call for body node.

Header Some text  Just text  Inner text Link textParagraph.

That may be fine for some cases but in my case it is necessary to separate text in nodes by delimiters like space(« »). To achieve this we should traverse the html tree manually.
Here is the code for text extracting.

open FSharp.Data
open System
open System.IO
open System.Text
 
let extractText(startNode: HtmlNode) =
    let nodeTextHandler = fun (text: string-> let text = text.Trim() 
                                                                if text.EndsWith(" "then text else text + " "
    let rec getNodeText(node: HtmlNode, builder: StringBuilder) =
        let childNodes = node.Elements() |> List.filter (fun x -> x.Name() <> "script")
        if List.isEmpty childNodes then
            builder.Append(nodeTextHandler (node.InnerText()))
        else
            childNodes |> List.fold (fun acc elem -> getNodeText(elem, acc)) builder
    let test = getNodeText(startNode, new StringBuilder())
    test.ToString()

Note that we should explicitly exclude the content of script blocks — it should not be included in the result. For the rest, parsing code is trivial — we just visit all nodes recursively and concatenate extracted text. Also we should keep in mind specificity of InnerText() method mentioned above and call it only for nodes without childs.

We can rewrite this code to make it tail recursive explicitly. It is a little bit tricky because we should process both child node collection and remained nodes of the same level in continuations.
let extractText(startNode: HtmlNode) = 
    let builder = new StringBuilder()
    let nodeTextHandler = fun (text: string-> let text = text.Trim() 
                                                                if text.EndsWith(" "then text else text + " "
    let rec getText (nodes: HtmlNode list) (cont: unit -> StringBuilder) =   
        match nodes with  
        | [] -> cont () 
        | node :: xs -> 
            let childNodes = node.Elements() |> List.filter (fun x -> x.Name() <> "script")
            if List.isEmpty childNodes then builder.Append(nodeTextHandler (node.InnerText())) |> ignore
            getText childNodes (fun () ->
              getText xs (fun () ->
                cont()))
    (getText [startNode] (fun () -> builder) ).ToString()

Of course the Release mode should be set in the Configuration Manager to gain from the tail call optimization.
However, first implementation will not fail with StackOverflowException. It will be optimized to tail calls too.

We can perform load testing for both implementations by the following unit test.

[<Test>]
member this.ExtractTextTest2() =
    let htmlStringBuilder = new StringBuilder("<div>test</div>")
    for i = 0 to 10000 do
        htmlStringBuilder.Insert(0, "<div>lol" + i.ToString()) |> ignore
        htmlStringBuilder.Append("</div>") |> ignore
    let htmlString = "<body>" + htmlStringBuilder.ToString() + "</body>"
    let actualText = HtmlUtils.extractTextFromHtmlString htmlString
    Assert.True(actualText.Length > 0)

Both implementations will fail on Debug mode (when tail call optimization is disabled) and both will work fine in Release mode. Of course different IL will be generated for Debug and Release.