"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件,原始版本翻譯自 fxsjy/jieba,目前已經為一個獨立分支,請有興趣的開發者一起加入開發!若想使用 Python 版本請前往 fxsjy/jieba
現在已經可以支援繁體中文!只要將字典切換為 big 模式即可!
中文斷詞目前使用 LLM 大語言模型會得到更好的斷詞結果,但如果要快速、便宜,這個套件仍然有其用處。
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
Scroll down for English documentation.
-
支持三種分詞模式:
-
1)默認精確模式,試圖將句子最精確地切開,適合文本分析;
-
2)全模式,把句子中所有的可以成詞的詞語都掃描出來,但是不能解決歧義。(需要充足的字典)
-
- 搜尋引擎模式,在精確模式的基礎上,對長詞再次切分,提高召回率,適合用於搜尋引擎分詞。
-
支持繁體斷詞
-
支持自定義詞典
-
支持多語言 CJK 文本處理(中文、日文、韓文)
-
支持 TF-IDF 分詞整合和詞性標註
-
支持記憶體管理和快取優化
-
支持自定義詞性標籤
- 自動安裝:使用 composer 安裝後,透過 autoload 引用
代碼示例
composer require fukuball/jieba-php
代碼示例
require_once "/path/to/your/vendor/autoload.php";
- 手動安裝:將 jieba-php 放置適當目錄後,透過 require_once 引用
代碼示例
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
- 基於 Trie 樹結構實現高效的詞圖掃描,生成句子中漢字所有可能成詞情況所構成的有向無環圖(DAG)
- 採用了動態規劃查找最大概率路徑, 找出基於詞頻的最大切分組合
- 對於未登錄詞,採用了基於漢字成詞能力的 HMM 模型,使用了 Viterbi 算法
- BEMS 的解釋 fxsjy/jieba#7
- 組件只提供 jieba.cut 方法用於分詞
- cut 方法接受兩個輸入參數: 1) 第一個參數為需要分詞的字符串 2)cut_all 參數用來控制分詞模式
- 待分詞的字符串可以是 utf-8 字符串
- jieba.cut 返回的結構是一個可迭代的 array
cut
方法接受想個輸入參數: 1) 第一個參數為需要分詞的字符串 2)cut_all 參數用來控制分詞模式cutForSearch
方法接受一個參數:需要分詞的字符串,該方法適合用於搜索引擎構建倒排索引的分詞,粒度比較細- 注意:待分詞的字符串是 utf-8 字符串
cut
以及cutForSearch
返回的結構是一個可迭代的 array
代碼示例 (Tutorial)
ini_set('memory_limit', '1024M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
$seg_list = Jieba::cut("我来到北京清华大学", true);
var_dump($seg_list); #全模式
$seg_list = Jieba::cut("我来到北京清华大学", false);
var_dump($seg_list); #默認精確模式
$seg_list = Jieba::cut("他来到了网易杭研大厦");
var_dump($seg_list);
$seg_list = Jieba::cutForSearch("小明硕士毕业于中国科学院计算所,后在日本京都大学深造"); #搜索引擎模式
var_dump($seg_list);
Output:
array(7) {
[0]=>
string(12) "怜香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "对象"
[6]=>
string(3) "啊"
}
Full Mode:
array(15) {
[0]=>
string(3) "我"
[1]=>
string(3) "来"
[2]=>
string(6) "来到"
[3]=>
string(3) "到"
[4]=>
string(3) "北"
[5]=>
string(6) "北京"
[6]=>
string(3) "京"
[7]=>
string(3) "清"
[8]=>
string(6) "清华"
[9]=>
string(12) "清华大学"
[10]=>
string(3) "华"
[11]=>
string(6) "华大"
[12]=>
string(3) "大"
[13]=>
string(6) "大学"
[14]=>
string(3) "学"
}
Default Mode:
array(4) {
[0]=>
string(3) "我"
[1]=>
string(6) "来到"
[2]=>
string(6) "北京"
[3]=>
string(12) "清华大学"
}
array(6) {
[0]=>
string(3) "他"
[1]=>
string(6) "来到"
[2]=>
string(3) "了"
[3]=>
string(6) "网易"
[4]=>
string(6) "杭研"
[5]=>
string(6) "大厦"
}
(此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了)
Search Engine Mode:
array(18) {
[0]=>
string(6) "小明"
[1]=>
string(6) "硕士"
[2]=>
string(6) "毕业"
[3]=>
string(3) "于"
[4]=>
string(6) "中国"
[5]=>
string(6) "科学"
[6]=>
string(6) "学院"
[7]=>
string(9) "科学院"
[8]=>
string(15) "中国科学院"
[9]=>
string(6) "计算"
[10]=>
string(9) "计算所"
[11]=>
string(3) "后"
[12]=>
string(3) "在"
[13]=>
string(6) "日本"
[14]=>
string(6) "京都"
[15]=>
string(6) "大学"
[16]=>
string(18) "日本京都大学"
[17]=>
string(6) "深造"
}
-
開發者可以指定自己自定義的詞典,以便包含 jieba 詞庫裡沒有的詞。雖然 jieba 有新詞識別能力,但是自行添加新詞可以保證更高的正確率
-
用法: Jieba::loadUserDict(file_name) # file_name 為自定義詞典的絕對路徑
-
詞典格式和 dict.txt 一樣,一個詞佔一行;每一行分為三部分,一部分為詞語,一部分為詞頻,一部分為詞性,用空格隔開
-
範例:
云计算 5 n 李小福 2 n 创新办 3 n
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
說明:"通过用户自定义词典来增强歧义纠错能力" --- fxsjy/jieba#14
- JiebaAnalyse::extractTags($content, $top_k)
- content 為待提取的文本
- top_k 為返回幾個 TF/IDF 權重最大的關鍵詞,默認值為 20
- 可使用 setStopWords 增加自定義 stop words
代碼示例 (關鍵詞提取)
ini_set('memory_limit', '600M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
require_once "/path/to/your/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
Jieba::init(array('mode'=>'test','dict'=>'small'));
Finalseg::init();
JiebaAnalyse::init();
$top_k = 10;
$content = file_get_contents("/path/to/your/dict/lyric.txt", "r");
$tags = JiebaAnalyse::extractTags($content, $top_k);
var_dump($tags);
JiebaAnalyse::setStopWords('/path/to/your/dict/stop_words.txt');
$tags = JiebaAnalyse::extractTags($content, $top_k);
var_dump($tags);
Output:
array(10) {
'沒有' =>
double(1.0592831964595)
'所謂' =>
double(0.90795702553671)
'是否' =>
double(0.66385043195443)
'一般' =>
double(0.54607060161899)
'雖然' =>
double(0.30265234184557)
'來說' =>
double(0.30265234184557)
'肌迫' =>
double(0.30265234184557)
'退縮' =>
double(0.30265234184557)
'矯作' =>
double(0.30265234184557)
'怯懦' =>
double(0.24364586159392)
}
array(10) {
'所謂' =>
double(1.1569129841516)
'一般' =>
double(0.69579963754677)
'矯作' =>
double(0.38563766138387)
'來說' =>
double(0.38563766138387)
'退縮' =>
double(0.38563766138387)
'雖然' =>
double(0.38563766138387)
'肌迫' =>
double(0.38563766138387)
'怯懦' =>
double(0.31045198493419)
'隨便說說' =>
double(0.19281883069194)
'一場' =>
double(0.19281883069194)
}
代碼示例 (Tutorial)
ini_set('memory_limit', '600M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
require_once dirname(dirname(__FILE__))."/class/Posseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Posseg;
Jieba::init();
Finalseg::init();
Posseg::init();
$seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。");
var_dump($seg_list);
// 使用 Posseg::cut() 含 TF-IDF 分數
$scored_result = Posseg::cut("我愛吃蘋果", array('with_scores' => true));
foreach ($scored_result as $item) {
echo sprintf("%-10s [%s] TF: %.4f, TF-IDF: %.4f\n",
$item['word'], $item['tag'], $item['tf'], $item['tfidf']);
}
// 使用 Jieba::cut() 含詞性標註
$pos_result = Jieba::cut("我愛吃蘋果", false, array('with_pos' => true));
foreach ($pos_result as $item) {
echo sprintf("%-10s [%s]\n", $item['word'], $item['tag']);
}
// 使用 Jieba::cut() 含詞性標註和 TF-IDF 分數
$full_result = Jieba::cut("我愛吃蘋果", false, array(
'with_pos' => true,
'with_scores' => true
));
foreach ($full_result as $item) {
echo sprintf("%-10s [%s] TF: %.4f, TF-IDF: %.4f\n",
$item['word'], $item['tag'], $item['tf'], $item['tfidf']);
}
Output:
array(21) {
[0]=>
array(2) {
["word"]=>
string(3) "这"
["tag"]=>
string(1) "r"
}
[1]=>
array(2) {
["word"]=>
string(3) "是"
["tag"]=>
string(1) "v"
}
[2]=>
array(2) {
["word"]=>
string(6) "一个"
["tag"]=>
string(1) "m"
}
[3]=>
array(2) {
["word"]=>
string(18) "伸手不见五指"
["tag"]=>
string(1) "i"
}
[4]=>
array(2) {
["word"]=>
string(3) "的"
["tag"]=>
string(2) "uj"
}
[5]=>
array(2) {
["word"]=>
string(6) "黑夜"
["tag"]=>
string(1) "n"
}
[6]=>
array(2) {
["word"]=>
string(3) "。"
["tag"]=>
string(1) "x"
}
[7]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[8]=>
array(2) {
["word"]=>
string(3) "叫"
["tag"]=>
string(1) "v"
}
[9]=>
array(2) {
["word"]=>
string(9) "孙悟空"
["tag"]=>
string(2) "nr"
}
[10]=>
array(2) {
["word"]=>
string(3) ","
["tag"]=>
string(1) "x"
}
[11]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[12]=>
array(2) {
["word"]=>
string(3) "爱"
["tag"]=>
string(1) "v"
}
[13]=>
array(2) {
["word"]=>
string(6) "北京"
["tag"]=>
string(2) "ns"
}
[14]=>
array(2) {
["word"]=>
string(3) ","
["tag"]=>
string(1) "x"
}
[15]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[16]=>
array(2) {
["word"]=>
string(3) "爱"
["tag"]=>
string(1) "v"
}
[17]=>
array(2) {
["word"]=>
string(6) "Python"
["tag"]=>
string(3) "eng"
}
[18]=>
array(2) {
["word"]=>
string(3) "和"
["tag"]=>
string(1) "c"
}
[19]=>
array(2) {
["word"]=>
string(3) "C++"
["tag"]=>
string(3) "eng"
}
[20]=>
array(2) {
["word"]=>
string(3) "。"
["tag"]=>
string(1) "x"
}
}
代碼示例 (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);
Output:
array(7) {
[0]=>
string(12) "怜香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "对象"
[6]=>
string(3) "啊"
}
array(7) {
[0]=>
string(12) "憐香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "對象"
[6]=>
string(3) "啊"
}
代碼示例 (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);
Output:
array(7) {
[0]=>
string(12) "怜香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "对象"
[6]=>
string(3) "啊"
}
array(7) {
[0]=>
string(12) "憐香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "對象"
[6]=>
string(3) "啊"
}
jieba-php 現在支援更好的多語言 CJK(中文、日文、韓文)文本處理,包括混合語言文本的處理。
代碼示例 (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
// 初始化支援所有 CJK 語言
Jieba::init(array('cjk'=>'all'));
Finalseg::init();
// 韓語文本處理
$seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로");
var_dump($seg_list);
// 日語文本處理
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);
// 混合語言文本處理
$mixed_text = "我喜欢这个世界。私は日本に住んでいます。안녕하세요 세계입니다.";
$seg_list = Jieba::cut($mixed_text);
var_dump($seg_list);
// 複雜混合文本處理
$complex_mixed = "今天weather很好,私たちは공원에 갔습니다。";
$seg_list = Jieba::cut($complex_mixed);
var_dump($seg_list);
// 加載日語詞庫可以對日語進行簡單的分詞
Jieba::loadUserDict("/path/to/your/japanese/dict.txt");
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);
- 基本多語言處理:
php src/cmd/demo_mixed_cjk.php
- TF-IDF 和詞性標註整合:
php src/cmd/demo_tf_idf_pos.php
Output:
array(15) {
[0]=>
string(9) "한국어"
[1]=>
string(6) "또는"
[2]=>
string(12) "조선말은"
[3]=>
string(24) "제주특별자치도를"
[4]=>
string(9) "제외한"
[5]=>
string(9) "한반도"
[6]=>
string(3) "및"
[7]=>
string(3) "그"
[8]=>
string(6) "부속"
[9]=>
string(9) "도서와"
[10]=>
string(9) "한민족"
[11]=>
string(6) "거주"
[12]=>
string(12) "지역에서"
[13]=>
string(9) "쓰이는"
[14]=>
string(9) "언어로"
}
array(21) {
[0]=>
string(6) "日本"
[1]=>
string(3) "語"
[2]=>
string(3) "は"
[3]=>
string(3) "主"
[4]=>
string(3) "に"
[5]=>
string(6) "日本"
[6]=>
string(6) "国内"
[7]=>
string(3) "や"
[8]=>
string(6) "日本"
[9]=>
string(3) "人"
[10]=>
string(6) "同士"
[11]=>
string(3) "の"
[12]=>
string(3) "間"
[13]=>
string(3) "で"
[14]=>
string(3) "使"
[15]=>
string(3) "わ"
[16]=>
string(6) "れて"
[17]=>
string(6) "いる"
[18]=>
string(6) "言語"
[19]=>
string(3) "で"
[20]=>
string(6) "ある"
}
array(17) {
[0]=>
string(9) "日本語"
[1]=>
string(3) "は"
[2]=>
string(6) "主に"
[3]=>
string(9) "日本国"
[4]=>
string(3) "内"
[5]=>
string(3) "や"
[6]=>
string(9) "日本人"
[7]=>
string(6) "同士"
[8]=>
string(3) "の"
[9]=>
string(3) "間"
[10]=>
string(3) "で"
[11]=>
string(3) "使"
[12]=>
string(3) "わ"
[13]=>
string(6) "れて"
[14]=>
string(6) "いる"
[15]=>
string(6) "言語"
[16]=>
string(9) "である"
}
代碼示例 (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'test','dict'=>'big'));
Finalseg::init();
$seg_list = Jieba::tokenize("永和服装饰品有限公司");
var_dump($seg_list);
Output:
array(4) {
[0] =>
array(3) {
'word' =>
string(6) "永和"
'start' =>
int(0)
'end' =>
int(2)
}
[1] =>
array(3) {
'word' =>
string(6) "服装"
'start' =>
int(2)
'end' =>
int(4)
}
[2] =>
array(3) {
'word' =>
string(6) "饰品"
'start' =>
int(4)
'end' =>
int(6)
}
[3] =>
array(3) {
'word' =>
string(12) "有限公司"
'start' =>
int(6)
'end' =>
int(10)
}
}
-
佔用內容較小的詞典 https://github.com/fukuball/jieba-php/blob/master/src/dict/dict.small.txt
-
支持繁體斷詞的詞典 https://github.com/fukuball/jieba-php/blob/master/src/dict/dict.big.txt
- 模型的數據是如何生成的? fxsjy/jieba#7
- 這個庫的授權是? fxsjy/jieba#2
- Demo Site Repo:https://github.com/fukuball/jieba-php.fukuball.com
-
Support three types of segmentation mode:
-
- Accurate Mode, attempt to cut the sentence into the most accurate segmentation, which is suitable for text analysis;
-
- Full Mode, break the words of the sentence into words scanned
-
- Search Engine Mode, based on the Accurate Mode, with an attempt to cut the long words into several short words, which can enhance the recall rate
-
Support Traditional Chinese word segmentation
-
Support custom dictionaries
-
Support multi-language CJK text processing (Chinese, Japanese, Korean)
-
Support TF-IDF integration and POS tagging
-
Support memory management and cache optimization
-
Support custom POS tags
- Installation: Use composer to install jieba-php, then require the autoload file to use jieba-php.
- Based on the Trie tree structure to achieve efficient word graph scanning; sentences using Chinese characters constitute a directed acyclic graph (DAG).
- Employs memory search to calculate the maximum probability path, in order to identify the maximum tangential points based on word frequency combination.
- For unknown words, the character position HMM-based model is used, using the Viterbi algorithm.
- The meaning of BEMS fxsjy/jieba#7.
- The
cut
method accepts two parameters: 1) first parameter is the string to segmentation 2)the second parametercut_all
to control segmentation mode. - The string to segmentation may use utf-8 string.
cutForSearch
accpets only on parameter: the string that requires segmentation, and it will cut the sentence into short wordscut
andcutForSearch
return an segmented array.
Example (Tutorial)
ini_set('memory_limit', '1024M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
seg_list = jieba.cut("我来到北京清华大学", true)
var_dump($seg_list); #全模式
seg_list = jieba.cut("我来到北京清华大学", false)
var_dump($seg_list); #默認精確模式
seg_list = jieba.cut("他来到了网易杭研大厦")
var_dump($seg_list);
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") #搜索引擎模式
var_dump($seg_list);
Output:
array(7) {
[0]=>
string(12) "怜香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "对象"
[6]=>
string(3) "啊"
}
Full Mode:
array(15) {
[0]=>
string(3) "我"
[1]=>
string(3) "来"
[2]=>
string(6) "来到"
[3]=>
string(3) "到"
[4]=>
string(3) "北"
[5]=>
string(6) "北京"
[6]=>
string(3) "京"
[7]=>
string(3) "清"
[8]=>
string(6) "清华"
[9]=>
string(12) "清华大学"
[10]=>
string(3) "华"
[11]=>
string(6) "华大"
[12]=>
string(3) "大"
[13]=>
string(6) "大学"
[14]=>
string(3) "学"
}
Default Mode:
array(4) {
[0]=>
string(3) "我"
[1]=>
string(6) "来到"
[2]=>
string(6) "北京"
[3]=>
string(12) "清华大学"
}
array(6) {
[0]=>
string(3) "他"
[1]=>
string(6) "来到"
[2]=>
string(3) "了"
[3]=>
string(6) "网易"
[4]=>
string(6) "杭研"
[5]=>
string(6) "大厦"
}
(此處,“杭研“並沒有在詞典中,但是也被 Viterbi 算法識別出來了)
Search Engine Mode:
array(18) {
[0]=>
string(6) "小明"
[1]=>
string(6) "硕士"
[2]=>
string(6) "毕业"
[3]=>
string(3) "于"
[4]=>
string(6) "中国"
[5]=>
string(6) "科学"
[6]=>
string(6) "学院"
[7]=>
string(9) "科学院"
[8]=>
string(15) "中国科学院"
[9]=>
string(6) "计算"
[10]=>
string(9) "计算所"
[11]=>
string(3) "后"
[12]=>
string(3) "在"
[13]=>
string(6) "日本"
[14]=>
string(6) "京都"
[15]=>
string(6) "大学"
[16]=>
string(18) "日本京都大学"
[17]=>
string(6) "深造"
}
-
Developers can specify their own custom dictionary to include in the jieba thesaurus. jieba has the ability to identify new words, but adding your own new words can ensure a higher rate of correct segmentation.
-
Usage:
Jieba::loadUserDict(file_name)
# file_name is a custom dictionary path. -
The dictionary format is the same as that of
dict.txt
: one word per line; each line is divided into two parts, the first is the word itself, the other is the word frequency, separated by a space. -
Example:
云计算 5 李小福 2 创新办 3
之前: 李小福 / 是 / 创新 / 办 / 主任 / 也 / 是 / 云 / 计算 / 方面 / 的 / 专家 / 加載自定義詞庫後: 李小福 / 是 / 创新办 / 主任 / 也 / 是 / 云计算 / 方面 / 的 / 专家 /
- JiebaAnalyse::extractTags($content, $top_k)
- content: the text to be extracted
- top_k: to return several TF/IDF weights for the biggest keywords, the default value is 20
Example (keyword extraction)
ini_set('memory_limit', '600M');
require_once "/path/to/your/vendor/multi-array/MultiArray.php";
require_once "/path/to/your/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once "/path/to/your/class/Jieba.php";
require_once "/path/to/your/class/Finalseg.php";
require_once "/path/to/your/class/JiebaAnalyse.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\JiebaAnalyse;
Jieba::init(array('mode'=>'test','dict'=>'small'));
Finalseg::init();
JiebaAnalyse::init();
$top_k = 10;
$content = file_get_contents("/path/to/your/dict/lyric.txt", "r");
$tags = JiebaAnalyse::extractTags($content, $top_k);
var_dump($tags);
Output:
array(10) {
["是否"]=>
float(1.2196321889395)
["一般"]=>
float(1.0032459890209)
["肌迫"]=>
float(0.64654314660465)
["怯懦"]=>
float(0.44762844339349)
["藉口"]=>
float(0.32327157330233)
["逼不得已"]=>
float(0.32327157330233)
["不安全感"]=>
float(0.26548304656279)
["同感"]=>
float(0.23929673812326)
["有把握"]=>
float(0.21043366018744)
["空洞"]=>
float(0.20598261709442)
}
- Word Tagging Meaning:https://gist.github.com/luw2007/6016931
Example (word tagging)
ini_set('memory_limit', '600M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
require_once dirname(dirname(__FILE__))."/class/Posseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Posseg;
Jieba::init();
Finalseg::init();
Posseg::init();
$seg_list = Posseg::cut("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。");
var_dump($seg_list);
Output:
array(21) {
[0]=>
array(2) {
["word"]=>
string(3) "这"
["tag"]=>
string(1) "r"
}
[1]=>
array(2) {
["word"]=>
string(3) "是"
["tag"]=>
string(1) "v"
}
[2]=>
array(2) {
["word"]=>
string(6) "一个"
["tag"]=>
string(1) "m"
}
[3]=>
array(2) {
["word"]=>
string(18) "伸手不见五指"
["tag"]=>
string(1) "i"
}
[4]=>
array(2) {
["word"]=>
string(3) "的"
["tag"]=>
string(2) "uj"
}
[5]=>
array(2) {
["word"]=>
string(6) "黑夜"
["tag"]=>
string(1) "n"
}
[6]=>
array(2) {
["word"]=>
string(3) "。"
["tag"]=>
string(1) "w"
}
[7]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[8]=>
array(2) {
["word"]=>
string(3) "叫"
["tag"]=>
string(1) "v"
}
[9]=>
array(2) {
["word"]=>
string(9) "孙悟空"
["tag"]=>
string(2) "nr"
}
[10]=>
array(2) {
["word"]=>
string(3) ","
["tag"]=>
string(1) "w"
}
[11]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[12]=>
array(2) {
["word"]=>
string(3) "爱"
["tag"]=>
string(1) "v"
}
[13]=>
array(2) {
["word"]=>
string(6) "北京"
["tag"]=>
string(2) "ns"
}
[14]=>
array(2) {
["word"]=>
string(3) ","
["tag"]=>
string(1) "w"
}
[15]=>
array(2) {
["word"]=>
string(3) "我"
["tag"]=>
string(1) "r"
}
[16]=>
array(2) {
["word"]=>
string(3) "爱"
["tag"]=>
string(1) "v"
}
[17]=>
array(2) {
["word"]=>
string(6) "Python"
["tag"]=>
string(3) "eng"
}
[18]=>
array(2) {
["word"]=>
string(3) "和"
["tag"]=>
string(1) "c"
}
[19]=>
array(2) {
["word"]=>
string(3) "C++"
["tag"]=>
string(3) "eng"
}
[20]=>
array(2) {
["word"]=>
string(3) "。"
["tag"]=>
string(1) "w"
}
}
Example (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('mode'=>'default','dict'=>'big'));
Finalseg::init();
$seg_list = Jieba::cut("怜香惜玉也得要看对象啊!");
var_dump($seg_list);
$seg_list = Jieba::cut("憐香惜玉也得要看對象啊!");
var_dump($seg_list);
Output:
array(7) {
[0]=>
string(12) "怜香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "对象"
[6]=>
string(3) "啊"
}
array(7) {
[0]=>
string(12) "憐香惜玉"
[1]=>
string(3) "也"
[2]=>
string(3) "得"
[3]=>
string(3) "要"
[4]=>
string(3) "看"
[5]=>
string(6) "對象"
[6]=>
string(3) "啊"
}
Example (Tutorial)
ini_set('memory_limit', '1024M');
require_once dirname(dirname(__FILE__))."/vendor/multi-array/MultiArray.php";
require_once dirname(dirname(__FILE__))."/vendor/multi-array/Factory/MultiArrayFactory.php";
require_once dirname(dirname(__FILE__))."/class/Jieba.php";
require_once dirname(dirname(__FILE__))."/class/Finalseg.php";
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init(array('cjk'=>'all'));
Finalseg::init();
$seg_list = Jieba::cut("한국어 또는 조선말은 제주특별자치도를 제외한 한반도 및 그 부속 도서와 한민족 거주 지역에서 쓰이는 언어로");
var_dump($seg_list);
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);
// Loading custom Japanese dictionary can do a simple word segmentation
Jieba::loadUserDict("/path/to/your/japanese/dict.txt");
$seg_list = Jieba::cut("日本語は、主に日本国内や日本人同士の間で使われている言語である。");
var_dump($seg_list);
Output:
array(15) {
[0]=>
string(9) "한국어"
[1]=>
string(6) "또는"
[2]=>
string(12) "조선말은"
[3]=>
string(24) "제주특별자치도를"
[4]=>
string(9) "제외한"
[5]=>
string(9) "한반도"
[6]=>
string(3) "및"
[7]=>
string(3) "그"
[8]=>
string(6) "부속"
[9]=>
string(9) "도서와"
[10]=>
string(9) "한민족"
[11]=>
string(6) "거주"
[12]=>
string(12) "지역에서"
[13]=>
string(9) "쓰이는"
[14]=>
string(9) "언어로"
}
array(21) {
[0]=>
string(6) "日本"
[1]=>
string(3) "語"
[2]=>
string(3) "は"
[3]=>
string(3) "主"
[4]=>
string(3) "に"
[5]=>
string(6) "日本"
[6]=>
string(6) "国内"
[7]=>
string(3) "や"
[8]=>
string(6) "日本"
[9]=>
string(3) "人"
[10]=>
string(6) "同士"
[11]=>
string(3) "の"
[12]=>
string(3) "間"
[13]=>
string(3) "で"
[14]=>
string(3) "使"
[15]=>
string(3) "わ"
[16]=>
string(6) "れて"
[17]=>
string(6) "いる"
[18]=>
string(6) "言語"
[19]=>
string(3) "で"
[20]=>
string(6) "ある"
}
array(17) {
[0]=>
string(9) "日本語"
[1]=>
string(3) "は"
[2]=>
string(6) "主に"
[3]=>
string(9) "日本国"
[4]=>
string(3) "内"
[5]=>
string(3) "や"
[6]=>
string(9) "日本人"
[7]=>
string(6) "同士"
[8]=>
string(3) "の"
[9]=>
string(3) "間"
[10]=>
string(3) "で"
[11]=>
string(3) "使"
[12]=>
string(3) "わ"
[13]=>
string(6) "れて"
[14]=>
string(6) "いる"
[15]=>
string(6) "言語"
[16]=>
string(9) "である"
}
a 形容词 (取英语形容词 adjective 的第 1 个字母。)
ad 副形词 (直接作状语的形容词,形容词代码 a 和副词代码 d 并在一起。)
ag 形容词性语素 (形容词性语素,形容词代码为 a,语素代码 g 前面置以 a。)
an 名形词 (具有名词功能的形容词,形容词代码 a 和名词代码 n 并在一起。)
b 区别词 (取汉字「别」的声母。)
c 连词 (取英语连词 conjunction 的第 1 个字母。)
d 副词 (取 adverb 的第 2 个字母,因其第 1 个字母已用于形容词。)
df 副词*
dg 副语素 (副词性语素,副词代码为 d,语素代码 g 前面置以 d。)
e 叹词 (取英语叹词 exclamation 的第 1 个字母。)
eng 外语
f 方位词 (取汉字「方」的声母。)
g 语素 (绝大多数语素都能作为合成词的「词根」,取汉字「根」的声母。)
h 前接成分 (取英语 head 的第 1 个字母。)
i 成语 (取英语成语 idiom 的第 1 个字母。)
j 简称略语 (取汉字「简」的声母。)
k 后接成分
l 习用语 (习用语尚未成为成语,有点「临时性」,取「临」的声母。)
m 数词 (取英语 numeral 的第 3 个字母,n,u 已有他用。)
mg 数语素
mq 数词*
n 名词 (取英语名词 noun 的第 1 个字母。)
ng 名语素 (名词性语素,名词代码为 n,语素代码 g 前面置以 n。)
nr 人名 (名词代码n和「人(ren)」的声母并在一起。)
nrfg 名词*
nrt 名词*
ns 地名 (名词代码 n 和处所词代码 s 并在一起。)
nt 机构团体 (「团」的声母为 t,名词代码 n 和 t 并在一起。)
nz 其他专名 (「专」的声母的第 1 个字母为 z,名词代码 n 和 z 并在一起。)
o 拟声词 (取英语拟声词 onomatopoeia 的第 1 个字母。)
p 介词 (取英语介词 prepositional 的第 1 个字母。)
q 量词 (取英语 quantity 的第 1 个字母。)
r 代词 (取英语代词 pronoun的 第 2 个字母,因 p 已用于介词。)
rg 代词语素
rr 代词*
rz 代词*
s 处所词 (取英语 space 的第 1 个字母。)
t 时间词 (取英语 time 的第 1 个字母。)
tg 时语素 (时间词性语素,时间词代码为 t,在语素的代码 g 前面置以 t。)
u 助词 (取英语助词 auxiliary 的第 2 个字母,因 a 已用于形容词。)
ud 助词*
ug 助词*
uj 助词*
ul 助词*
uv 助词*
uz 助词*
v 动词 (取英语动词 verb 的第一个字母。)
vd 副动词 (直接作状语的动词,动词和副词的代码并在一起。)
vg 动语素
vi 动词*
vn 名动词 (指具有名词功能的动词,动词和名词的代码并在一起。)
vq 动词*
w 标点符号
x 非语素字 (非语素字只是一个符号,字母 x 通常用于代表未知数、符号。)
y 语气词 (取汉字「语」的声母。)
z 状态词 (取汉字「状」的声母的前一个字母。)
zg 状态词*
為了處理大量文本時的記憶體使用問題,jieba-php 提供了新的 JiebaMemory 類別來統一管理所有類別的記憶體使用。
ini_set('memory_limit', '1024M');
use Fukuball\Jieba\JiebaMemory;
// 初始化所有類別
JiebaMemory::initAll();
// 檢查初始化狀態
$status = JiebaMemory::getInitializationStatus();
var_dump($status);
// 獲取記憶體使用統計
$stats = JiebaMemory::getMemoryStats();
echo "當前記憶體使用:" . $stats['current_memory_usage_formatted'] . "\n";
echo "峰值記憶體使用:" . $stats['peak_memory_usage_formatted'] . "\n";
// 清除所有快取但保持類別初始化
JiebaMemory::clearAllCaches();
// 銷毀所有類別釋放記憶體
JiebaMemory::destroyAll();
// 獲取所有類別的快取統計
$cacheStats = JiebaMemory::getAllCacheStats();
echo "Jieba DAG 快取大小:" . $cacheStats['jieba']['dag_cache_size'] . "\n";
echo "Posseg 詞性標籤數量:" . $cacheStats['posseg']['word_tag_size'] . "\n";
echo "JiebaAnalyse IDF 頻率數量:" . $cacheStats['jieba_analyse']['idf_freq_size'] . "\n";
當處理大量文本時,jieba-php 使用內部快取來提高性能。以下功能可用於管理快取記憶體使用:
清除所有內部快取以釋放記憶體。處理多個大型文本文件時很有用。
ini_set('memory_limit', '1024M');
use Fukuball\Jieba\Jieba;
use Fukuball\Jieba\Finalseg;
Jieba::init();
Finalseg::init();
// 處理第一個文件
$text1 = file_get_contents('large_file1.txt');
$seg_list1 = Jieba::cut($text1);
// 處理下一個文件前清除快取
Jieba::clearCache();
// 處理第二個文件
$text2 = file_get_contents('large_file2.txt');
$seg_list2 = Jieba::cut($text2);
返回當前快取使用情況以供監控。
$stats = Jieba::getCacheStats();
echo "DAG 快取大小:" . $stats['dag_cache_size'] . "\n";
echo "Trie 快取大小:" . $stats['trie_cache_size'] . "\n";
echo "記憶體使用:" . round($stats['total_memory_usage'] / 1024 / 1024, 2) . "M\n";
echo "峰值記憶體:" . round($stats['peak_memory_usage'] / 1024 / 1024, 2) . "M\n";
如果快取超過指定大小限制則自動清除。
// 如果 DAG 快取超過 50,000 條目或 trie 快取超過 50,000 條目則清除快取
$cleared = Jieba::clearCacheIfNeeded(50000, 50000);
if ($cleared) {
echo "由於大小限制,快取已被清除\n";
}
// 自定義限制
$cleared = Jieba::clearCacheIfNeeded(10000, 10000);
- 對於處理多個文件的 CLI 應用程序,在每個文件後調用
clearCache()
或使用JiebaMemory::clearAllCaches()
- 使用
getCacheStats()
或JiebaMemory::getMemoryStats()
監控記憶體使用情況 - 考慮使用
clearCacheIfNeeded()
進行自動快取管理 - 注意清除快取會重置性能優化,直到重新建立快取
- 使用
JiebaMemory::destroyAll()
完全釋放記憶體,但需要重新初始化才能再次使用
If you find fuku-ml useful, please consider a donation. Thank you!
- bitcoin: 1BbihQU3CzSdyLSP9bvQq7Pi1z1jTdAaq9
- eth: 0x92DA3F837bf2F79D422bb8CEAC632208F94cdE33
The MIT License (MIT)
Copyright (c) 2015 fukuball
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.