TermExtract::BrillsTagger -- 蟆る摩逕ィ隱樊歓蜃コ繝「繧ク繝・繝シ繝ォ��"Brill's Tagger"迚�)
use TermExtract::BrillsTagger;
蜈・蜉帙ユ繧ュ繧ケ繝医r縲�"Brill's Tagger"�郁恭譁��蜩∬ゥ槭ち繧ー莉倅ク弱�繝ュ繧ー繝ゥ繝��峨↓縺� 縺代√◎縺ョ邨先棡繧偵b縺ィ縺ォ蜈・蜉帙ユ繧ュ繧ケ繝医°繧牙ーる摩逕ィ隱槭r謚ス蜃コ縺吶k繝励Ο繧ー繝ゥ繝�縲� Brill's Tagger繧貞�縺ォ縺励※菴懊i繧後◆ Monty Tagger 縺ォ繧ょッセ蠢懊@縺ヲ縺�k縲�
縺ェ縺翫。rill's Tagger 縺ァ繧ソ繧ー莉倥¢繧定。後≧蝣エ蜷医�縲∽コ句燕縺ォBrill's Tagger莉� 螻槭�Perl繧ケ繧ッ繝ェ繝励ヨ Tokenizer.pl 繧偵°縺代※縺翫¥縺薙→繧呈耳螂ィ縺励※縺�k縲�
蠖薙Δ繧ク繝・繝シ繝ォ縺ョ菴ソ逕ィ豕輔↓縺、縺�※縺ッ縲∬ヲェ繧ッ繝ゥ繧ケ��TermExtract::Calc_Imp)縺九� 莉・荳九�繧オ繝ウ繝励Ν繧ケ繧ッ繝ェ繝励ヨ繧貞盾辣ァ縺ョ縺薙→縲�
#!/usr/local/bin/perl -w # # ex_BT.pl # #縲繝輔ぃ繧、繝ォ縺九iBrill's Tagger 縺ョ蜃ヲ逅�オ先棡繧定ェュ縺ソ蜿悶j # 讓呎コ門�蜉帙↓蟆る摩逕ィ隱槭→縺昴�驥崎ヲ∝コヲ繧定ソ斐☆繝励Ο繧ー繝ゥ繝� # # version 0.14 # # use TermExtract::BrillsTagger; #use strict; my $data = new TermExtract::BrillsTagger; my $InputFile = "BT_out.txt"; # 蜈・蜉帙ヵ繧。繧、繝ォ謖�ョ� # 繝励Ο繧サ繧ケ縺ョ逡ー蟶ク邨ゆコ�凾蜃ヲ逅� # (繝ュ繝�け繝�ぅ繝ャ繧ッ繝医Μ繧剃スソ逕ィ縺励◆蝣エ蜷医�縺ソ�� $SIG{INT} = $SIG{QUIT} = $SIG{TERM} = 'sigexit'; # 蜃コ蜉帙Δ繝シ繝峨r謖�ョ� # 1 竊� 蟆る摩逕ィ隱橸シ矩㍾隕∝コヲ縲�2 竊� 蟆る摩逕ィ隱槭�縺ソ # 3 竊� 繧ォ繝ウ繝槫玄蛻�j my $output_mode = 1; # # 驥崎ヲ∝コヲ險育ョ励〒縲�」謗・隱槭�"蟒カ縺ケ謨ー"縲�"逡ー縺ェ繧頑焚"縲�"繝代�繝励Ξ繧ュ繧キ繝�ぅ"縺ョ縺� # 縺壹l繧偵→繧九°驕ク謚槭ゅヱ繝シ繝励Ξ繧ュ繧キ繝�ぅ縺ッ縲悟ュヲ鄙呈ゥ溯�縲阪r菴ソ縺医↑縺� # 縺セ縺溘�"騾」謗・隱槭�諠��ア繧剃スソ繧上↑縺�"驕ク謚槭b縺ゅj縲√%縺ョ蝣エ蜷医�逕ィ隱槫�迴セ蝗樊焚 # (縺ィ險ュ螳壹&繧後※縺�l縺ーIDF縺ョ邨�∩蜷医o縺幢シ峨〒驥崎ヲ∝コヲ險育ョ励r陦後≧ # �医ョ繝輔か繝ォ繝医�"蟒カ縺ケ謨ー"繧偵→繧� $obj->use_total) # #$data->use_total; # 蟒カ縺ケ謨ー繧偵→繧� #$data->use_uniq; # 逡ー縺ェ繧頑焚繧偵→繧� #$data->use_Perplexity; # 繝代�繝励Ξ繧ュ繧キ繝�ぅ繧偵→繧�(TermExtract 3.04 莉・荳�) #$data->no_LR; # 髫」謗・諠��ア繧剃スソ繧上↑縺� (TermExtract 4.02 莉・荳�) # # 驥崎ヲ∝コヲ險育ョ励〒縲�」謗・諠��ア縺ォ謗帙¢蜷医o縺帙k逕ィ隱槫�迴セ鬆サ蠎ヲ諠��ア繧帝∈謚槭☆繧� # $data->no_LR; 縺ィ縺ョ邨�∩蜷医o縺帙〒逕ィ隱槫�迴セ鬆サ蠎ヲ縺ョ縺ソ縺ョ驥崎ヲ∝コヲ繧らョ怜�蜿ッ閭ス # �医ョ繝輔か繝ォ繝医� "Frequency" $data->use_frq) # TF縺ッ縺ゅk逕ィ隱槭′莉悶�逕ィ隱槭�荳驛ィ縺ォ菴ソ繧上l縺ヲ縺�◆蝣エ蜷医↓繧ゅき繧ヲ繝ウ繝� # Frequency 縺ッ逕ィ隱槭′莉悶�逕ィ隱槭�荳驛ィ縺ォ菴ソ繧上l縺ヲ縺�◆蝣エ蜷医↓繧ォ繧ヲ繝ウ繝医@縺ェ縺� # #$data->use_TF; # TF (Term Frequency) (TermExtract 4.02 莉・荳�) #$data->use_frq; # Frequency縺ォ繧医k逕ィ隱樣�サ蠎ヲ #$data->no_frq; # 鬆サ蠎ヲ諠��ア繧剃スソ繧上↑縺� # # 驥崎ヲ∝コヲ險育ョ励〒縲∝ュヲ鄙呈ゥ溯�繧剃スソ縺�°縺ゥ縺�°驕ク謚� # �医ョ繝輔か繝ォ繝医�縲∽スソ逕ィ縺励↑縺� $obj->no_stat) # #$data->use_stat; # 蟄ヲ鄙呈ゥ溯�繧剃スソ縺� #$data->no_stat; # 蟄ヲ鄙呈ゥ溯�繧剃スソ繧上↑縺� # # 驥崎ヲ∝コヲ險育ョ励〒縲√後ラ繧ュ繝・繝。繝ウ繝井クュ縺ョ逕ィ隱槭�鬆サ蠎ヲ縲阪→縲碁」謗・隱槭�驥崎ヲ∝コヲ縲� # 縺ョ縺ゥ縺。繧峨↓豈秘㍾繧偵♀縺上°繧定ィュ螳壹☆繧九� # 繝�ヵ繧ゥ繝ォ繝亥、縺ッ�� # 蛟、縺悟、ァ縺阪>縺サ縺ゥ縲後ラ繧ュ繝・繝。繝ウ繝井クュ縺ョ逕ィ隱槭�鬆サ蠎ヲ縲阪�豈秘㍾縺碁ォ倥∪繧� # #$data->average_rate(0.5); # # 蟄ヲ鄙呈ゥ溯�逕ィDB縺ォ繝��繧ソ繧定塘遨阪☆繧九°縺ゥ縺�°驕ク謚� # 驥崎ヲ∝コヲ險育ョ励〒縲∝ュヲ鄙呈ゥ溯�繧剃スソ縺�→縺阪�縲√そ繝�ヨ縺励※縺翫>縺溘⊇縺�′ # 辟。髮」縲ょ�逅�ッセ雎。縺ォ蟄ヲ鄙呈ゥ溯�逕ィDB縺ォ逋サ骭イ縺輔l縺ヲ縺�↑縺�ェ槭′蜷ォ縺セ繧後k # 縺ィ豁」縺励¥蜍穂ス懊@縺ェ縺�� # �医ョ繝輔か繝ォ繝医�縲∬塘遨阪@縺ェ縺� $obj->no_storage�� # #$data->use_storage; # 闢�ゥ阪☆繧� #$data->no_storage; # 闢�ゥ阪@縺ェ縺� # # 蟄ヲ鄙呈ゥ溯�逕ィDB縺ォ菴ソ逕ィ縺吶kDBM繧担DBM_File縺ォ謖�ョ� # �医ョ繝輔か繝ォ繝医�縲.B_File縺ョBTREE繝「繝シ繝会シ� # #$data->use_SDBM; # # 驕主悉縺ョ繝峨く繝・繝。繝ウ繝医�邏ッ遨咲オア險医r菴ソ縺��エ蜷医�繝��繧ソ繝吶�繧ケ縺ョ # 繝輔ぃ繧、繝ォ蜷阪r繧サ繝�ヨ # �医ョ繝輔か繝ォ繝医� "stat.db"縺ィ"comb.db"�� # #$data->stat_db("stat.db"); #$data->comb_db("comb.db"); # # 繝��繧ソ繝吶�繧ケ縺ョ謗剃サ悶Ο繝�け縺ョ縺溘a縺ョ荳譎ゅョ繧」繝ャ繧ッ繝医Μ繧呈欠螳� # 繝�ぅ繝ャ繧ッ繝医Μ蜷阪′遨コ譁�ュ怜��医ョ繝輔か繝ォ繝茨シ峨�蝣エ蜷医�繝ュ繝�け縺励↑縺� # #$data->lock_dir("lock_dir"); # # 蜩∬ゥ槭ち繧ー莉倥¢貂医∩縺ョ繝�く繧ケ繝医°繧峨√ョ繝シ繧ソ繧定ェュ縺ソ霎シ縺ソ # 蟆る摩逕ィ隱槭Μ繧ケ繝医r驟榊�縺ォ霑斐☆ # �育エッ遨咲オア險�DB菴ソ逕ィ縲√ラ繧ュ繝・繝。繝ウ繝井クュ縺ョ鬆サ蠎ヲ菴ソ逕ィ縺ォ繧サ繝�ヨ�� # #my @noun_list = $data->get_imp_word($str, 'var'); # 蜈・蜉帙′螟画焚 my @noun_list = $data->get_imp_word($InputFile); # 蜈・蜉帙′繝輔ぃ繧、繝ォ # # 蜑榊屓隱ュ縺ソ霎シ繧薙□蜩∬ゥ槭ち繧ー莉倥¢貂医∩繝�く繧ケ繝医ヵ繧。繧、繝ォ繧貞�縺ォ # 繝「繝シ繝峨r螟峨∴縺ヲ縲∝ーる摩逕ィ隱槭Μ繧ケ繝医r驟榊�縺ォ霑斐☆ #$data->use_stat->no_frq; #my @noun_list2 = $data->get_imp_word(); # 縺セ縺溘√◎縺ョ邨先棡繧貞挨縺ョ繝「繝シ繝峨↓繧医k邨先棡縺ィ謗帙¢蜷医o縺帙k #@noun_list = $data->result_filter (\@noun_list, \@noun_list2, 30, 1000); # # 蟆る摩逕ィ隱槭Μ繧ケ繝医→險育ョ励@縺滄㍾隕∝コヲ繧呈ィ呎コ門�蜉帙↓蜃コ縺� # foreach (@noun_list) { # 謨ー蛟、縺ョ縺ソ縺ッ陦ィ遉コ縺励↑縺� next if $_->[0] =~ /^\d+$/; # 邨先棡陦ィ遉コ printf "%-60s %16.2f\n", $_->[0], $_->[1] if $output_mode == 1; printf "%s\n", $_->[0] if $output_mode == 2; printf "%s,", $_->[0] if $output_mode == 3; }
縺薙�繝「繧ク繝・繝シ繝ォ縺ァ縺ッ縲“et_imp_word 縺ョ縺ソ螳溯」�@縲√◎繧御サ・螟悶�繝。繧ス繝�ラ縺ッ隕ェ 繝「繧ク繝・繝シ繝ォ TermExtract::Calc_Imp 縺ァ螳溯」�&繧後※縺�k縲� get_imp_word 縺ッ蜩∬ゥ槭ち繧ー莉倅ク弱r陦後>謚ス蜃コ縺輔l縺溷腰隱槭r縲∝九��蜊倩ェ槭�隱樣�� 縺ィ蜩∬ゥ樊ュ蝣ア繧貞�縺ォ隍�粋隱槭↓逕滓�縺励※縺�k縲ゅ◎繧御サ・螟悶�繝。繧ス繝�ラ縺ォ縺、縺�※縺ッ縲� TermExtract::Calc_Imp 縺ョPOD繝峨く繝・繝。繝ウ繝医r蜿ら�縺吶k縺薙→縲�
闍ア譁��蜩∬ゥ槭ち繧ー莉倅ク守オ先棡繧呈ャ。縺ョ繝ォ繝シ繝ォ縺ォ繧医j隍�粋隱槭↓逕滓�縺吶k縲らャャ�大シ墓焚縺ッ縲� 蜃ヲ逅�ッセ雎。縺ョ繝��繧ソ縲∫ャャ�貞シ墓焚縺ッ隨ャ�大シ墓焚縺ョ遞ョ蛻・縺ァ縺ゅk縲ゅョ繝輔か繝ォ繝医〒縺ッ縲∫ャャ�� 蠑墓焚縺ッ縲∝刀隧槭ち繧ー莉倥¢貂医∩縺ョ繝�く繧ケ繝医ヵ繧。繧、繝ォ縺ィ縺ェ繧九らャャ�貞シ墓焚縺ォ譁�ュ怜� 'var'縺後そ繝�ヨ縺輔l縺溘→縺阪↓縺ッ縲∫ャャ荳蠑墓焚繧貞刀隧槭ち繧ー莉倥¢貂医�繝�く繧ケ繝医ョ繝シ繧ソ 縺悟�縺」縺溘せ繧ォ繝ゥ繝シ螟画焚縺ィ隗」驥医☆繧九�
�托シ主推蜩∬ゥ槭�谺。縺ョ縺ィ縺翫j邨仙粋縺吶k �茨シ托シ牙錐隧�(NN) 縲縲縲竊偵蜷崎ゥ槭∝ス「螳ケ隧槭∝渕謨ー縲�℃蜴サ蛻�ゥ槭�蜍戊ゥ槭↓ 邨仙粋縺吶k縲り、�粋隱槭�蜈磯�ュ縺ォ縺ェ繧九� �茨シ抵シ牙、匁擂隱�(FW) 縲縲縲竊偵蜊倩ェ槭→縺励※蜃ヲ逅� �茨シ難シ牙渕謨ー(CD) 縲縲縲竊偵隍�粋隱槭�蜈磯�ュ縺ョ縺ソ險ア蜿ッ縺吶k �茨シ費シ牙ス「螳ケ隧�(JJ) 縲縲縲竊偵蠖「螳ケ隧�,謇譛画�シ隱槫ーセ,蝓コ謨ー縺ォ邨仙粋縺吶k縲� 隍�粋隱槭�蜈磯�ュ縺ォ縺ェ繧� (�包シ画園譛画�シ隱槫ーセ(POS)縲 縲竊偵蜷崎ゥ槭↓邨仙粋縺吶k �茨シ厄シ頴f縲縲縲縲縲縲縲縲縲竊偵蜷崎ゥ槭↓邨仙粋縺吶k �茨シ暦シ蛾℃蜴サ蛻�ゥ槭�蜍戊ゥ�(VBN) 竊偵隍�粋隱槭�蜈磯�ュ縺ョ縺ソ險ア蜿ッ縺吶k
�抵シ取隼陦後′縺ゅ▲縺溷�エ蜷医�縲√◎縺薙〒隍�粋隱槭�蛹コ蛻�j縺ィ縺吶k
�難シ取ャ。縺ョ險伜捷繧�焚蛟、縺ァ蟋九∪繧玖ェ槭�蝣エ蜷医�縲√◎縺薙〒隍�粋隱槭�蛹コ蛻�j縺ィ縺吶k
+-%\&\$*#^|
�費シ手、�粋隱槭�蜷崎ゥ槭°螟匁擂隱槭〒邨ゅo繧九b縺ョ縺ィ縺励∽サ・蠕後�蛻�j謐ィ縺ヲ繧�
�包シ主崋譛牙錐隧樔サ・螟悶�蜷崎ゥ槭�縲∝�鬆ュ縺悟、ァ譁�ュ励�蝣エ蜷医↓蟆乗枚蟄励↓螟画鋤縺吶k
�厄シ手、�粋隱槭�蜷崎ゥ�(NNS)繧貞腰謨ー蠖「縺ォ螟峨∴繧�
�暦シ�' �医す繝ウ繧ー繝ォ繧ッ繧ゥ繝シ繝��繧キ繝ァ繝ウ)縺ァ蛹コ蛻�i繧後◆隱槭�蜊倩ェ槭→縺吶k
�假シ手、�粋隱樊忰蟆セ縺ョ , . 縺ッ髯、蜴サ縺吶k
�呻シ朱㍾隕∝コヲ險育ョ励↓縺翫>縺ヲ谺。縺ョ隱槭�辟。隕悶☆繧� of Of OF
TermExtract::Calc_Imp TermExtract::Chasen TermExtract::MeCab TermExtract::EnglishPlainText TermExtract::ChainesPlainTextUC TermExtract::ChainesPlainTextGB TermExtract::ICTCLAS TermExtract::JapanesePlainTextEUC TermExtract::JapanesePlainTextSJIS
縺薙�繝励Ο繧ー繝ゥ繝�縺ッ縲∵擲莠ャ螟ァ蟄ヲ繝サ荳ュ蟾晁」募ソ玲蕗謗医∵ィェ豬懷嵜遶句、ァ蟄ヲ繝サ譽ョ霎ー蜑�勧謨呎肢縺� 菴懈�縺励◆縲悟ーる摩逕ィ隱櫁�蜍墓歓蜃コ繧キ繧ケ繝�Β縲阪�termex_e.pl 繧貞�縺ォ繝「繧ク繝・繝シ繝ォ TermExtract逕ィ縺ォ譖ク縺肴鋤縺医◆繧ゅ�縺ァ縺ゅk縲� 縺薙�菴懈・ュ縺ッ縲∵擲莠ャ螟ァ蟄ヲ繝サ蜑咲伐譛� (maeda@lib.u-tokyo.ac.jp)縺瑚。後▲縺溘�
縺ェ縺翫∵悽繝励Ο繧ー繝ゥ繝�縺ョ菴ソ逕ィ縺ォ縺翫>縺ヲ逕溘§縺溘>縺九↑繧狗オ先棡縺ォ髢「縺励※繧ょス捺婿縺ァ縺ッ 荳蛻�イャ莉サ繧定イ�繧上↑縺��