12 vuotta sitten · 316212fe6e
--- a/docs/en/codelab.md
+++ b/docs/en/codelab.md
@@ -1,46 +1,44 @@
 
				 Wukong Engine Codelab
			
 
				 ====
			
 
				 
			
 
				-**don't read. translation in progress**
			
 
				-
			
 
				 At the end of this codelab, you will be able to write a simple full-text search website using Wukong engine.
			
 
				 
			
 
				-If you do not know Go yet, [here](http://tour.golang.org/#1) is a tutorial.
			
 
				+If you do not know how to code in Go yet, [here](http://tour.golang.org/#1) is a tutorial.
			
 
				 
			
 
				 ## Engine basics
			
 
				 
			
 
				-Engine handles user requests, segmentation, indexing and sorting in their own goroutines.
			
 
				+Wukong handles user requests, segmentation, indexing and ranking in different goroutines.
			
 
				 
			
 
				-1. main goroutines, are responsible for sending and receiving user requests
			
 
				-2. segmenter goroutines
			
 
				-3. indexer goroutines, for building inverted index and lookup
			
 
				+1. main goroutine, responsible for user interfacing
			
 
				+2. segmenter goroutines, for tokenization and Chinese word segmentation
			
 
				+3. indexer goroutines, for building inverted index and performing lookup
			
 
				 4. ranker goroutines, for scoring and ranking documents
			
 
				 
			
 
				 ![](https://raw.github.com/huichen/wukong/master/docs/en/wukong_en.png)
			
 
				 
			
 
				 **Indexing pipeline**
			
 
				 
			
 
				-When a document is added, the main goroutine sends the doc through a channel to a segmenter goroutine, which segments the text and passes it to a indexer. Indexer build doc index from search keyword. Inverted index table is stored in memory for quick lookups.
			
 
				+When a document indexing request is received from the user, the main goroutine sends the doc via a channel to a segmenter goroutine, which tokenizes/segments the text and passes it to a indexer. Indexer builds search keyword index from the doc. Custom data fields for ranking are also saved in a ranker.
			
 
				 
			
 
				 **Search pipeline**
			
 
				 
			
 
				-Main goroutine receives a user request, segments the query and passes it to an indexer goroutine. The indexer looks up corresponding documents for each search keyword and applies logic operation (list intersection) to get a set of docs that have all keywords. The list is then passed to a ranker to be scored, filtered and ranked. Finally the ranked docs are passed to the main goroutine through callback channel and returned to user .
			
 
				+Main goroutine receives a search request from the user, segments the query and passes it to an indexer goroutine. The indexer looks up a list of documents that satisfies the search condition. The list is then passed to a ranker to be scored, filtered and ranked. Finally the ranked docs are passed to the main goroutine through callback channel and returned to the user.
			
 
				 
			
 
				-In order to improve concurrency at search time, documents are sharded based on content and docid (number of shards can be specified by user) -- indexing and search requests are sent to all shards in parallel, and ranked docs are merged in the main goroutine before returning to user.
			
 
				+In order to improve concurrency at search time, documents are sharded based on content and docids (number of shards can be specified by the user) -- indexing and search requests are sent to all shards in parallel, and ranked docs are merged in the main goroutine before returning to the user.
			
 
				 
			
 
				-Above gives you basic idea on how Wukong works. A search system often consists of four parts, **document crawling**, **indexing**, **search** and **rendering**. I will explain them in following sections.
			
 
				+Above gives you the basic idea on how Wukong works. A search system often consists of four parts, **document crawling**, **indexing**, **search** and **rendering**. I will explain them in following sections.
			
 
				 
			
 
				 ## Doc crawling
			
 
				 
			
 
				-The technology of doc crawling deserve a full article. Fortunately for Weibo fetching, Sina provides simple APIs to get Weibo data and there's an easy-to-use [Go language SDK](http://github.com/huichen/gobo) which allows fetching with high concurrency.
			
 
				+The docs to be crawled are [Weibo](http://en.wikipedia.org/wiki/Sina_Weibo) posts. Doc crawling itself deserves a full article, but fortunately for Weibo crawling, there's an easy-to-use [Weibo API](http://github.com/huichen/gobo) written in Go which allows fetching data with high concurrency.
			
 
				 
			
 
				 I've downloaded about one hundred thousand weibo posts and stored them in testdata/weibo_data.txt, so you do not need to do it yourself. One line of the file has following format:
			
 
				 
			
 
				     <Weibo id>||||<timestamp>||||<User id>||||<user name>||||<share counts>||||<comment counts>||||<fav counts>||||<thumbnail URL>||||<large image URL>||||<body>
			
 
				 
			
 
				-Weibos are saved in following struct (only stored fields used by this codelab)
			
 
				+Weibos are saved in this struct (only stored fields used by the codelab)
			
 
				 
			
 
				-```Go
			
 
				+```go
			
 
				 type Weibo struct {
			
 
				     Id           uint64
			
 
				     Timestamp    uint64
			
@@ -50,44 +48,44 @@ type Weibo struct {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-However, if you are interested in the crawling details, see this script [testdata/crawl_weibo_data.go](/testdata/crawl_weibo_data.go).
			
 
				+If you are interested in how Weibo crawling works, here's the script [testdata/crawl_weibo_data.go](/testdata/crawl_weibo_data.go).
			
 
				 
			
 
				 ## Indexing
			
 
				 
			
 
				 To use Wukong engine you need to import two packages
			
 
				 
			
 
				-```Go
			
 
				+```go
			
 
				 import (
			
 
				-    "Github.com/huichen/wukong/engine"
			
 
				-    "Github.com/huichen/wukong/types"
			
 
				+    "github.com/huichen/wukong/engine"
			
 
				+    "github.com/huichen/wukong/types"
			
 
				 )
			
 
				 ```
			
 
				 The first one implements the engine, and the second package defines common structs. The engine has to be initialized before use, for example,
			
 
				 
			
 
				-```Go
			
 
				+```go
			
 
				 var search engine.Engine
			
 
				 searcher.Init(types.EngineInitOptions{
			
 
				     SegmenterDictionaries: "../../data/dictionary.txt",
			
 
				     StopTokenFile: "../../data/stop_tokens.txt",
			
 
				-    IndexerInitOptions: & types.IndexerInitOptions {
			
 
				+    IndexerInitOptions: &types.IndexerInitOptions {
			
 
				         IndexType: types.LocationsIndex,
			
 
				     },
			
 
				 })
			
 
				 ```
			
 
				 
			
 
				-[Types.EngineInitOptions](/types/engine_init_options.go) defines options in engine initialization, such as where the to load dictionary file for Chinese word segmentation, stop word lists, type of inverted index table, BM25 parameters, and the default search and pagination options.
			
 
				+[Types.EngineInitOptions](/types/engine_init_options.go) defines options in engine initialization, such as where to load dictionary files for segmentation, stopwords, type of inverted index table, BM25 parameters, default search options, pagination options, etc.
			
 
				 
			
 
				-You must choose IndexerInitOptions.IndexType carefully. There are three different types of index table:
			
 
				+You must choose IndexerInitOptions.IndexType carefully. There are three types of index table:
			
 
				 
			
 
				 1. DocIdsIndex, provides the most basic index where only docids are stored.
			
 
				-2. FrequenciesIndex, in addition to DocIdsIndex, frequency of each tokens are stored to compute BM25 relevance.
			
 
				-3. LocationsIndex, which additionally stores each token's locations in a document, in order to compute token proximity.
			
 
				+2. FrequenciesIndex, in addition to DocIdsIndex, frequencies of tokens are stored so we can compute BM25 relevance.
			
 
				+3. LocationsIndex, which also stores each token's locations in a document, in order to compute token proximity.
			
 
				 
			
 
				-From top to bottom, at the same time of providing more computing power, it also consumes more memory, especially for LocationsIndex when documents are long. Please choose accordingly.
			
 
				+From top to bottom, with more computing power, it also consumes more memory, especially for LocationsIndex when documents are long. Please choose according to your need.
			
 
				 
			
 
				-After initialization you can add documents to the index now. Following example shows you how to do so,
			
 
				+Now you can add documents to the index, like following,
			
 
				 
			
 
				-```Go
			
 
				+```go
			
 
				 searcher.IndexDocument(docId, types.DocumentIndexData{
			
 
				     Content: weibo.Text, // Weibo struct is defined above. Content must be in UTF-8.
			
 
				     Fields: WeiboScoringFields {
			
@@ -97,42 +95,42 @@ searcher.IndexDocument(docId, types.DocumentIndexData{
 
				 })
			
 
				 ```
			
 
				 
			
 
				-DocId must be unique for each doc. Wukong engine allows you to use three types of index data:
			
 
				+DocId must be unique for each doc. Wukong engine allows you to add three types of index data:
			
 
				 
			
 
				-1. Body of the document (Content field), which will be tokenized and added to the index.
			
 
				-2. You can also add tokens (Tokens field)directly when the body is empty, it allows user to bypass the built-in tokenzier and use any tokenization scheme he wants.
			
 
				-3. Document labels (Labels field), such as Weibo author, category, etc. Labels can be words that don't appear in the body of the doc.
			
 
				-4. Custom scoring fields (Fields), which allows you to add any type of data for ranking purpose. The 'Search' section will talk more on this.
			
 
				+1. Content of the document (Content field), which will be tokenized and added to the index.
			
 
				+2. You can also add tokens (Tokens field) directly when the Content is empty. This allows you to bypass the built-in tokenzier and use any tokenization scheme you want.
			
 
				+3. Document labels (Labels field), such as Weibo author, category, etc. Labels can be phrases that don't appear in the content.
			
 
				+4. Custom scoring fields (Fields), which allows you to add any type of data for ranking purpose. 'Search' section below will talk more on this.
			
 
				 
			
 
				-Note that **tokens** and **labels** make up search **keywords**. Please remember the difference of the three concepts since documentation and code will mention them many times. Searching in the index table is a logical operation on docs. For example, say a document has "bicycle" the token and a "fitness" label, but "fitness" doesn't appear in text, when the query "bicycle" + "fitness" combination is searched, the article will be looked up. The purpose of having document labels is to quickly reduce number of documents from non-literal scopes.
			
 
				+Note that **tokens** and **labels** make up search **keywords**. Please remember the difference of the three concepts since documentation and code will mention them many times. For example, say a document has "bicycle" as a token and a "fitness" label, but "fitness" doesn't appear in content, when the query "bicycle" token + "fitness" label combination is searched, the article will be looked up. The purpose of having document labels is to quickly reduce number of documents using non-literal scopes.
			
 
				 
			
 
				-The Engine adds indices asynchronously, so when IndexDocument returns the doc may not yet be actually indexed. If you need to wait before futher operation, call following function
			
 
				+The Engine adds indices asynchronously, so when IndexDocument returns the doc may not yet be actually indexed. If you need to wait before further operation, call following function
			
 
				 
			
 
				-```Go
			
 
				-searcher.FlushIndex ()
			
 
				+```go
			
 
				+searcher.FlushIndex()
			
 
				 ```
			
 
				 
			
 
				 ## Search
			
 
				 
			
 
				-Search is done in two steps. The first step is to look up documents in inverted index table which contains the keywords. This part has been explained in details before. The second step is to score and to rank the documents returned from indexer.
			
 
				+Search is done in two steps. The first step is to look up documents in inverted index table which contains the keywords. This part has been explained in details before. The second step is to score and to rank the indexed documents.
			
 
				 
			
 
				 Wukong engine allows you to use any scoring criteria you want in ranking. Taking the Weibo search as an example, we define scoring rules as follow,
			
 
				 
			
 
				 1. first sort by token proximity, for example for search query "bicycle sports", the phrase will be cut into two tokens, "bicycle" and "sports". A document with the two tokens next to each other in the same order will appear above other docs.
			
 
				-2. and rank by Weibo's creation timestamp roughly (3 days per bucket).
			
 
				+2. then rank roughly by Weibo's creation timestamp (3 days per batch).
			
 
				 3. finally sort by Weibo's BM25 * (1 + <shard count> / 10000)
			
 
				 
			
 
				-To use such rule, we need to store scoring fields in ranker. Following structure is defined
			
 
				-```Go
			
 
				+To use such rule, we need to store scoring fields in the ranker. Following structure is defined
			
 
				+```go
			
 
				 type WeiboScoringFields struct {
			
 
				     Timestamp uint64
			
 
				     RepostsCount uint64
			
 
				 }
			
 
				 ```
			
 
				-You may have noticed, this is exactly the same type of argument passed to the indexer when calling IndexDocument function (in fact, the function argument has the type of interface{}, so you can pass in anything you want).
			
 
				+You may have noticed, this is exactly the same type passed to the indexer when calling IndexDocument function (in fact, the function argument has the type of interface{}, so you can pass in anything you want).
			
 
				 
			
 
				-Now you can define the scoring criteria in code,
			
 
				-```Go
			
 
				+Now define the scoring criteria,
			
 
				+```go
			
 
				 type WeiboScoringCriteria struct {
			
 
				 }
			
 
				 
			
@@ -155,23 +153,23 @@ func (criteria WeiboScoringCriteria) Score (
 
				 ```
			
 
				 WeiboScoringCriteria inherits types.ScoringCriteria interface and implements the Score function. This function takes two arguments:
			
 
				 
			
 
				-1. A types.IndexedDocument slice with information on indexed docs, for example, token frequencies, token locations, BM25 score, token proximity, etc.
			
 
				-2. The second argument has the type of interface{}. You can think this type as the void pointer in C language, which allows it to point to any data type. In our example, the data is a WeiboScoringFields struct, whose type was checked in the Score function using reflection mechanism.
			
 
				+1. A types.IndexedDocument slice with information about indexed docs, such as token frequencies, token locations, BM25 score, token proximity score, etc.
			
 
				+2. The second argument has the type of interface{}. You can think it as the void pointer in C language, which allows it to point to any data type. In our example, the data is a WeiboScoringFields struct, whose type was checked in the Score function using reflection mechanism.
			
 
				 
			
 
				-With custom scoring criteria, search is made easy and customizable. Code example for doing searching,
			
 
				+With custom scoring criteria, search is made easy and customizable. Code example for searching,
			
 
				 
			
 
				-```Go
			
 
				+```go
			
 
				 response: = searcher.Search(types.SearchRequest {
			
 
				     Text: "cycling sports",
			
 
				     RankOptions: &types.RankOptions{
			
 
				-        ScoringCriteria: & WeiboScoringCriteria {},
			
 
				+        ScoringCriteria: & WeiboScoringCriteria{},
			
 
				         OutputOffset: 0,
			
 
				         MaxOutputs: 100,
			
 
				     },
			
 
				 })
			
 
				 ```
			
 
				 
			
 
				-In which, Text is the search phrase (must be in UTF-8). And the same as in indexing, Wukong engine allows you to bypass the built-in segmenter and to enter tokens and labels directly. See types.SearchRequest struct's comment. RankOptions defines options used in scoring and ranking. WeiboScoringCriteria is our scoring criteria defined above. OutputOffset and MaxOutputs parameters are set to control output pagination. Search results are returned in a types.SearchResponse type, which contains query's segmented tokens and their locations in each doc, which can be used to generate search snippets.
			
 
				+Text is the search phrase (must be in UTF-8). As the same as in indexing, Wukong engine allows you to bypass the built-in segmenter and to enter tokens and labels directly. See types.SearchRequest struct in the code. RankOptions defines options used in scoring and ranking. WeiboScoringCriteria is our scoring criteria defined above. OutputOffset and MaxOutputs parameters are set to control output pagination. Search results are returned in a types.SearchResponse struct that contains query's segmented tokens and their locations in each doc, which can be used to generate search snippets.
			
 
				 
			
 
				 ## Rendering
			
 
				 
			
@@ -179,7 +177,7 @@ The final step to complete the search is rendering the ranked docs in front of t
 
				 
			
 
				 ## Summary
			
 
				 
			
 
				-Now you should have gained a basic understanding on how Wukong engine works. I suggest you finishing the example by yourself. However if you do not have patience, here's the completed code [examples/codelab/search_server.go](/examples/codelab/search_server.go), which has less than 200 lines. Running this example is simple. Just enter examples/codelab directory and type
			
 
				+Now you should have all the knowledge to finish the example, and I strongly encourage to do so by yourself. However if you don't have patience, here's the completed code [examples/codelab/search_server.go](/examples/codelab/search_server.go), which has less than 200 lines. Running it is simple. Just enter examples/codelab directory and type
			
 
				 
			
 
				     go run search_server.go
			
 
				 
			
@@ -187,7 +185,7 @@ Wait for a couple seconds until indexing is done and then open [http://localhost
 
				 
			
 
				 http://soooweibo.com
			
 
				 
			
 
				-If you want to learn more about Wukong engine, I suggest you read the code:
			
 
				+If you want to learn more about Wukong engine, I suggest you read the code directly,
			
 
				 
			
 
				     /core       core components, including the indexer and ranker
			
 
				     /data       tokenization dictionary and stopword file