At the end of this codelab, you will be able to write a simple full-text search website using Wukong engine.
If you do not know Go yet, here is a tutorial.
Engine handles user requests, segmentation, indexing and sorting in their own goroutines.
Indexing pipeline
When a document is added, the main goroutine sends the doc through a channel to a segmenter goroutine, which segments the text and passes it to a indexer. Indexer build doc index from search keyword. Inverted index table is stored in memory for quick lookups.
Search pipeline
Main goroutine receives a user request, segments the query and passes it to an indexer goroutine. The indexer looks up corresponding documents for each search keyword and applies logic operation (list intersection) to get a set of docs that have all keywords. The list is then passed to a ranker to be scored, filtered and ranked. Finally the ranked docs are passed to the main goroutine through callback channel and returned to user .
In order to improve concurrency at search time, documents are sharded based on content and docid (number of shards can be specified by user) -- indexing and search requests are sent to all shards in parallel, and ranked docs are merged in the main goroutine before returning to user.
Above gives you basic idea on how Wukong works. A search system often consists of four parts, documents crawling, indexing, search and rendering. I will explain them in following sections.
Document capture technology a lot more to be able to separate out to write an article. Fortunately microblogging grab relatively simple API provided by Sina implemented and already have Go language SDK can concurrently fetch and quite fast.
I've caught about one hundred thousand micro-blog on the testdata / weibo_data.txt years, so you do not need to do. Each line of the file stored in a micro-Bo, the following format
<Weibo id>||||<timestamp>||||<User id>||||<user name>||||<share counts>||||<comment counts>||||<fav counts>||||<thumbnail URL>||||<large image URL>||||<body>
Microblogging saved in the following struct easy access, just loaded the data we need:
type Weibo struct {
Id uint64
Timestamp uint64
UserName string
RepostsCount uint64
Text string
}
If you are interested in the details, see crawling crawling process testdata / crawl_weibo_data.go.
Use Wukong engine you need import two packages
import (
"Github.com/huichen/wukong/engine"
"Github.com/huichen/wukong/types"
)
The first package defines the engine function, and the second package defines the common structure. Before using the engine needs to be initialized, for example,
var search engine.Engine
searcher.Init(types.EngineInitOptions{
SegmenterDictionaries: "../../data/dictionary.txt",
StopTokenFile: "../../data/stop_tokens.txt",
IndexerInitOptions: & types.IndexerInitOptions {
IndexType: types.LocationsIndex,
},
})
Types.EngineInitOptions defines the initialization engines need to set parameters, such as where the loaded word from the dictionary file, stop word lists, indexes, type, BM25 parameters, and the default rating rules (see "Search" a) and output pagination options. Please read the details of the code structure of the Notes.
What must be emphasized is that please choose carefully IndexerInitOptions.IndexType type, there are three different types of index table:
These three indexes from top to bottom at the same time provide more computing power also consumes more memory, especially LocationsIndex, when the document is very long memory intensive. According to the need to balance choice.
After a good initialization can add an index, the following example will add a microblogging engine
searcher.IndexDocument(docId, types.DocumentIndexData{
Content: weibo.Text, // Weibo struct is defined above. Content must be in UTF-8.
Fields: WeiboScoringFields {
Timestamp: weibo.Timestamp,
RepostsCount: weibo.RepostsCount,
},
})
DocId document must be unique, for it can be directly used microblogging microblogging ID. Wukong engine allows you to join three kinds of index data:
Special attention is ** **, keyword (tokens) and labels (labels) formed the indexer in the search key (keywords), documentation and code will be repeated three concepts, please do not be confused. Search for text in the search key is a logical query, such as a body of the document appears in the "bicycle" the key words there are "fitness" This category labels, but the "fitness" of the word does not directly appear in the text, When the query "bicycle" + "fitness" This search key combination, this article will be queried. Design label is intended to facilitate the dimension from the non-literal meaning quickly narrow scope of the query.
Engine uses the index of non-synchronous mode, that is when IndexDocument returns the index may not yet be added to the index table, which facilitate you cycle concurrently added to the index. If you need to wait before you start adding up the index operation, please call the following function
searcher.FlushIndex ()
Search process in two steps, the first step is to look in the index table containing the search key documents, which has been introduced in the last one before. The second step is an index of all documents to be sorted.
Sort the core of the document score. Wukong engine allows you to customize any of the scoring rules (scoring criteria). Search the microblogging example, we define scoring rules are as follows:
Such rules need to save some of the score for each document data, such as microblogging Published, microblogging forwarding number and so on. The data is stored in the following structure body
type WeiboScoringFields struct {
Timestamp uint64
RepostsCount uint64
}
You may have noticed, this is the last one when the document is added to the index passed to the function call IndexDocument parameter type (in fact, that argument is the interface {} type, so you can pass any type of structure). The data stored in the memory sequencer waits for the call.
With these data, we can score, the code is as follows:
type WeiboScoringCriteria struct {
}
func (criteria WeiboScoringCriteria) Score (
doc types.IndexedDocument, fields interface {}) [] float32 {
if reflect.TypeOf (fields)! = reflect.TypeOf (WeiboScoringFields {}) {
return [] float32 {}
}
wsf: = fields. (WeiboScoringFields)
output: = make ([] float32, 3)
if doc.TokenProximity> MaxTokenProximity {/ / Step
output [0] = 1.0 / float32 (doc.TokenProximity)
} Else {
output [0] = 1.0
}
output [1] = float32 (wsf.Timestamp / (SecondsInADay * 3)) / / Step
output [2] = float32 (doc.BM25 * (1 + float32 (wsf.RepostsCount) / 10000)) / / The third step
return output
}
WeiboScoringCriteria actually inherited types.ScoringCriteria interface that implements Score function. This function takes two parameters:
With custom scoring data and custom scoring rules, we will be able to search, and see the code below
response: = searcher.Search (types.SearchRequest {
Text: "cycling"
RankOptions: & types.RankOptions {
ScoringCriteria: & WeiboScoringCriteria {},
OutputOffset: 0,
MaxOutputs: 100,
},
})
Which, Text is entered search phrase (must be UTF-8 format), will be sub-word as a keyword. And the same index, Wukong engine allows to bypass the built-in word documents directly enter keywords and labels, see types.SearchRequest structure annotation. RankOptions defines the sorting options. WeiboScoringCriteria is our scoring rules defined above. In addition, you can also OutputOffset and MaxOutputs parameters control the paging output. Search results in response variable, the specific content types / search_response.go SearchResponse defined in the file structure, such as the structure returned keywords appear in the document location, you can used to generate the document summary.
The final step in completing the user search is the search results to the user. The usual practice is to make a background service search engine, and then let the front end of the JSON-RPC way to call it. Front engine itself does not belong to Goku is not so much inked.
Read here, you should use Wukong microblogging search engine have a basic understanding, I suggest you do it yourself to complete it. If you do not have patience, you can see the code has been completed, see examples / codelab / search_server.go, total of less than 200 lines. Run this example is very simple, enter the examples / codelab directory input
go run search_server.go
Waiting terminal in a "indexes, xxx microblogging" after the output in the browser to open http://localhost:8080 to enter the search page, which implements a simplified version http://soooweibo.com
If you want to learn more about Wukong engine, I suggest you read the code directly. Code directory structure is as follows:
/core core components, including the index and sorter
/data dictionary files and stop word file
/docs documentation
/engine engine, including the main coroutine, word coroutine, indexers, coroutines, and sorter implementation of coroutines
/examples examples and performance testing procedures
/testdata test data
/types commonly used structure
/utils commonly used functions