вторник, 24 мая 2016 г.

Using Neo4j with F#. Prepare to web pages text analysis.


In my previous blog posts I wrote about crawling and HTML text extraction. It is time to save data we crawled. Also it is good to have the ability to retrieve saved data to perform some analysis later. Let's introduce two types: Page and SymbolGroup. Page type will be used for representation of HTML page. What about SymbolGroup type — for simplicity you can treat it just as word representation. However, for languages of CJK (Chinese/Japanese/Korean) group it will not be correct. Here is the F# code with these types.
module Types
 
open Newtonsoft.Json
 
    [<CLIMutable>]
    type SymbolGroup = { Guid: string; Name: string; TailSeparator: string }
 
    [<CLIMutable>]
    type Page = { Guid: string; Url: string; }
Logically Page and Symbol group entities are linked by «Contains» relation. We will use Neo4j for data store and we will reflect this relation. First let's create code for data saving.
module PageRepository
 
open Types
open Neo4jClient.Cypher
 
    let savePage(page: Page, symbolGroups: seq<SymbolGroup>) = 
        
        let graphClient = ConnectionProvider.getGraphClient()
        let query = graphClient.Cypher
                                .Create("(page:Page {newPage})")
                                .WithParam("newPage", page)
        let query, _ = symbolGroups |> Seq.fold (fun (q: ICypherFluentQuery, i) symbolGroup -> 
                                                       q
                                                        .Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
                                                        .Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
                                                        .WithParam("newSymbolGroup" + i.ToString(), symbolGroup), i + 1
                                        
                            ) (query, 0)
        query.ExecuteWithoutResults()
        ()
As you can see from the code, page and its symbol groups are saved by single Cypher query. Note that for each symbol group we create its own parameter in query:
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
This query consists from three parts:
1. Pages creation:
let query = graphClient.Cypher
                        .Create("(page:Page {newPage})")
                        .WithParam("newPage", page)
2. Symbol groups creation
q.Create("(symbolGroup" + i.ToString() + ":SymbolGroup {newSymbolGroup" + i.ToString() + "})")
3. Relations creation
.Create("(page)-[:CONTAINS]->(symbolGroup" + i.ToString() + ")")
To retrieve pages with its symbol groups I wrote the following code:
let getAllPages() =
   let graphClient = ConnectionProvider.getGraphClient()
   let result = graphClient.Cypher
                           .OptionalMatch("(page:Page)-[Contains]-(symbolGroup:SymbolGroup)")
                           .Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
                           .Results
   result
It is important to explicitly specify ICypherResultItem variable type in
.Return(fun (page:ICypherResultItem) (symbolGroup:ICypherResultItem-> page.As<Page>(), symbolGroup.CollectAs<SymbolGroup>())
expression. Result will be presented by tuple of page and collection of its symbol groups. Once we have persistense logic we can store our HTML data like this
namespace TextStore
 
open System
open Types
 
module UrlHandler = 
    
    let handleUrl(url:string, content:string) =
        let symbolGroups = TextUtils.extractWords(content) |> Seq.toList
        let guid = Guid.NewGuid()
        let page = { Guid = guid.ToString(); Url = url; }
        PageRepository.savePage(page, symbolGroups)
        ()
We can see the graph in Neo4j browser. For this example I crawled the Chinese wikipedia to reflect the thought that SymbolGroup type can be used not only for words representation.



Комментариев нет:

Отправить комментарий