Ver código fonte

Update codelab.md

Hui Chen 12 anos atrás
pai
commit
b01361880d
1 arquivos alterados com 64 adições e 63 exclusões
  1. 64 63
      docs/en/codelab.md

+ 64 - 63
docs/en/codelab.md

@@ -22,57 +22,58 @@ When a document is added, the main goroutine sends the doc through a channel to
 
 **Search pipeline**
 
-Coroutine main user requests received, the request is within the phrase in the main coroutine word, and then sent through the channel to the indexer, indexers find each search key corresponding to the document and logical operations (merge for intersection) to get a streamlined document list, this list through the channel passed to sort, Sequencers the document rating (scoring), filtering and sorting, and then sorted documents by specifying the channel sent to the main coroutine main coroutine returns the results to the user .
+Main goroutine receives a user request, segments the query and passes it to an indexer goroutine. The indexer looks up corresponding documents for each search keyword and applies logic operation (list intersection) to get a set of docs that have all keywords. The list is then passed to a ranker to be scored, filtered and ranked. Finally the ranked docs are passed to the main goroutine through callback channel and returned to user .
 
-Segmentation, indexing and sorting have multiple coroutines completed, the intermediate results are stored in the channel buffer queue in order to avoid congestion. In order to improve search concurrency reduce latency, the document did Wukong engine splitting (splitting number can be specified by the user), indexing and sorting the request will be sent to all splitting on parallel processing, which results in a secondary merge main coroutine sort.
+In order to improve concurrency at search time, documents are sharded based on content and docid (number of shards can be specified by user) -- indexing and search requests are sent to all shards in parallel, and ranked docs are merged in the main goroutine before returning to user.
 
-The above is the general principle Wukong engine. Any complete search system consists of four parts, documents crawled ** ** ** Index **, ** Search ** and ** display **. Separately below explain how to achieve these sections.
+Above gives you basic idea on how Wukong works. A search system often consists of four parts, **documents crawling**, **indexing**, **search** and **rendering**. I will explain them in following sections.
 
-# # Documentation crawl
+## Doc crawling
 
 Document capture technology a lot more to be able to separate out to write an article. Fortunately microblogging grab relatively simple API provided by Sina implemented and already have [Go language SDK] (http://github.com/huichen/gobo) can concurrently fetch and quite fast.
 
 I've caught about one hundred thousand micro-blog on the testdata / weibo_data.txt years, so you do not need to do. Each line of the file stored in a micro-Bo, the following format
 
-    <Weibo id> | | | | <timestamp> | | | | <User id> | | | | <user name> | | | | <posted several> | | | | <Comments> | | | | < Like several> | | | | <small image URL> | | | | <large image URL> | | | | <body>
+    <Weibo id>||||<timestamp>||||<User id>||||<user name>||||<share counts>||||<comment counts>||||<fav counts>||||<thumbnail URL>||||<large image URL>||||<body>
 
 Microblogging saved in the following struct easy access, just loaded the data we need:
 
-`` `Go
+```Go
 type Weibo struct {
-        Id uint64
-        Timestamp uint64
-        UserName string
-        RepostsCount uint64
-        Text string
+    Id           uint64
+    Timestamp    uint64
+    UserName     string
+    RepostsCount uint64
+    Text         string
 }
-`` `
+```
 
 If you are interested in the details, see crawling crawling process [testdata / crawl_weibo_data.go] (/ testdata / crawl_weibo_data.go).
 
-# # Index
+## Indexing
 
 Use Wukong engine you need import two packages
 
-`` `Go
+```Go
 import (
-"Github.com / huichen / wukong / engine"
-"Github.com / huichen / wukong / types"
+    "Github.com/huichen/wukong/engine"
+    "Github.com/huichen/wukong/types"
 )
-`` `
+```
 The first package defines the engine function, and the second package defines the common structure. Before using the engine needs to be initialized, for example,
 
-`` `Go
+```Go
 var search engine.Engine
-searcher.Init (types.EngineInitOptions {
-SegmenterDictionaries: ".. /.. / Data / dictionary.txt",
-StopTokenFile: ".. /.. / Data / stop_tokens.txt",
-IndexerInitOptions: & types.IndexerInitOptions {
-IndexType: types.LocationsIndex,
-},
+searcher.Init(types.EngineInitOptions{
+    SegmenterDictionaries: "../../data/dictionary.txt",
+    StopTokenFile: "../../data/stop_tokens.txt",
+    IndexerInitOptions: & types.IndexerInitOptions {
+        IndexType: types.LocationsIndex,
+    },
 })
-`` `
-[Types.EngineInitOptions] (/ types / engine_init_options.go) defines the initialization engines need to set parameters, such as where the loaded word from the dictionary file, stop word lists, indexes, type, BM25 parameters, and the default rating rules (see "Search" a) and output pagination options. Please read the details of the code structure of the Notes.
+```
+
+[Types.EngineInitOptions] (/types/engine_init_options.go) defines the initialization engines need to set parameters, such as where the loaded word from the dictionary file, stop word lists, indexes, type, BM25 parameters, and the default rating rules (see "Search" a) and output pagination options. Please read the details of the code structure of the Notes.
 
 What must be emphasized is that please choose carefully IndexerInitOptions.IndexType type, there are three different types of index table:
 
@@ -84,52 +85,52 @@ These three indexes from top to bottom at the same time provide more computing p
 
 After a good initialization can add an index, the following example will add a microblogging engine
 
-`` `Go
-searcher.IndexDocument (docId, types.DocumentIndexData {
-Content: weibo.Text, / / ​​Weibo structure see above definition. Must be UTF-8 format.
-Fields: WeiboScoringFields {
-Timestamp: weibo.Timestamp,
-RepostsCount: weibo.RepostsCount,
-},
+```Go
+searcher.IndexDocument(docId, types.DocumentIndexData{
+    Content: weibo.Text, // Weibo struct is defined above. Content must be in UTF-8.
+    Fields: WeiboScoringFields {
+        Timestamp: weibo.Timestamp,
+        RepostsCount: weibo.RepostsCount,
+    },
 })
-`` `
+```
 
 DocId document must be unique, for it can be directly used microblogging microblogging ID. Wukong engine allows you to join three kinds of index data:
 
-1 body of the document (content), will be sub-word as a keyword (tokens) added to the index.
-(2) Documentation of keywords (tokens). When the body is empty, it allows the user to bypass the built-in word Wukong directly input document keywords, which makes the engine outside the document segmentation possible.
-3 Document Properties tab (labels), such as micro-blog author, category, etc. Tag does not appear in the text.
-4 custom score field (scoring fields), which allows you to add documents of any type ** ** ** ** arbitrary data structure used for sorting. "Search" will further introduce a custom score field usage.
+1. body of the document (content), will be sub-word as a keyword (tokens) added to the index.
+2. Documentation of keywords (tokens). When the body is empty, it allows the user to bypass the built-in word Wukong directly input document keywords, which makes the engine outside the document segmentation possible.
+3. Document Properties tab (labels), such as micro-blog author, category, etc. Tag does not appear in the text.
+4. custom score field (scoring fields), which allows you to add documents of any type ** ** ** ** arbitrary data structure used for sorting. "Search" will further introduce a custom score field usage.
 
 Special attention is ** **, keyword (tokens) and labels (labels) formed the indexer in the search key (keywords), documentation and code will be repeated three concepts, please do not be confused. Search for text in the search key is a logical query, such as a body of the document appears in the "bicycle" the key words there are "fitness" This category labels, but the "fitness" of the word does not directly appear in the text, When the query "bicycle" + "fitness" This search key combination, this article will be queried. Design label is intended to facilitate the dimension from the non-literal meaning quickly narrow scope of the query.
 
 Engine uses the index of non-synchronous mode, that is when IndexDocument returns the index may not yet be added to the index table, which facilitate you cycle concurrently added to the index. If you need to wait before you start adding up the index operation, please call the following function
 
-`` `Go
+```Go
 searcher.FlushIndex ()
-`` `
+```
 
-# # Search
+## Search
 
 Search process in two steps, the first step is to look in the index table containing the search key documents, which has been introduced in the last one before. The second step is an index of all documents to be sorted.
 
 Sort the core of the document score. Wukong engine allows you to customize any of the scoring rules (scoring criteria). Search the microblogging example, we define scoring rules are as follows:
 
-1 first sort by keywords close distance, for example search for "cycling", the phrase will be cut into two words, "bicycle" and "movement", there are two words next to the article should be at two key separate article in front of the word.
-(2) and then follow the microblogging Published roughly sort, and every three days as a team, later articles echelon top surface.
-3 score was finally given microblogging BM25 * (1 + forwarding number / 10000)
+1. first sort by keywords close distance, for example search for "cycling", the phrase will be cut into two words, "bicycle" and "movement", there are two words next to the article should be at two key separate article in front of the word.
+2. and then follow the microblogging Published roughly sort, and every three days as a team, later articles echelon top surface.
+3. score was finally given microblogging BM25 * (1 + forwarding number / 10000)
 
 Such rules need to save some of the score for each document data, such as microblogging Published, microblogging forwarding number and so on. The data is stored in the following structure body
-`` `Go
+```Go
 type WeiboScoringFields struct {
-        Timestamp uint64
-        RepostsCount uint64
+    Timestamp uint64
+    RepostsCount uint64
 }
-`` `
+```
 You may have noticed, this is the last one when the document is added to the index passed to the function call IndexDocument parameter type (in fact, that argument is the interface {} type, so you can pass any type of structure). The data stored in the memory sequencer waits for the call.
 
 With these data, we can score, the code is as follows:
-`` `Go
+```Go
 type WeiboScoringCriteria struct {
 }
 
@@ -149,7 +150,7 @@ func (criteria WeiboScoringCriteria) Score (
         output [2] = float32 (doc.BM25 * (1 + float32 (wsf.RepostsCount) / 10000)) / / The third step
         return output
 }
-`` `
+```
 WeiboScoringCriteria actually inherited types.ScoringCriteria interface that implements Score function. This function takes two parameters:
 
 1. Types.IndexedDocument indexer parameters passed from the data obtained, for example, word frequency, word specific location, BM25 value, close to the degrees and other information, see specific [types / index.go] (/ types / index.go) of comments.
@@ -157,7 +158,7 @@ WeiboScoringCriteria actually inherited types.ScoringCriteria interface that imp
 
 With custom scoring data and custom scoring rules, we will be able to search, and see the code below
 
-`` `Go
+```Go
 response: = searcher.Search (types.SearchRequest {
 Text: "cycling"
 RankOptions: & types.RankOptions {
@@ -166,30 +167,30 @@ OutputOffset: 0,
 MaxOutputs: 100,
 },
 })
-`` `
+```
 
 Which, Text is entered search phrase (must be UTF-8 format), will be sub-word as a keyword. And the same index, Wukong engine allows to bypass the built-in word documents directly enter keywords and labels, see types.SearchRequest structure annotation. RankOptions defines the sorting options. WeiboScoringCriteria is our scoring rules defined above. In addition, you can also OutputOffset and MaxOutputs parameters control the paging output. Search results in response variable, the specific content [types / search_response.go] (/ types / search_response.go) SearchResponse defined in the file structure, such as the structure returned keywords appear in the document location, you can used to generate the document summary.
 
-# # Display
+## Rendering
 
 The final step in completing the user search is the search results to the user. The usual practice is to make a background service search engine, and then let the front end of the JSON-RPC way to call it. Front engine itself does not belong to Goku is not so much inked.
 
-# # Summary
+## Summary
 
 Read here, you should use Wukong microblogging search engine have a basic understanding, I suggest you do it yourself to complete it. If you do not have patience, you can see the code has been completed, see [examples / codelab / search_server.go] (/ examples / codelab / search_server.go), total of less than 200 lines. Run this example is very simple, enter the examples / codelab directory input
 
-go run search_server.go
+    go run search_server.go
 
 Waiting terminal in a "indexes, xxx microblogging" after the output in the browser to open [http://localhost:8080] (http://localhost:8080) to enter the search page, which implements a simplified version
 http://soooweibo.com
 
 If you want to learn more about Wukong engine, I suggest you read the code directly. Code directory structure is as follows:
 
-    / Core core components, including the index and sorter
-    / Data dictionary files and stop word file
-    / Docs Documentation
-    / Engine engine, including the main coroutine, word coroutine, indexers, coroutines, and sorter implementation of coroutines
-    / Examples examples and performance testing procedures
-    / Testdata test data
-    / Types commonly used structure
-    / Utils commonly used functions
+    /core       core components, including the index and sorter
+    /data       dictionary files and stop word file
+    /docs       documentation
+    /engine     engine, including the main coroutine, word coroutine, indexers, coroutines, and sorter implementation of coroutines
+    /examples   examples and performance testing procedures
+    /testdata   test data
+    /types      commonly used structure
+    /utils      commonly used functions