Lingua::LanguageGuesser - Implementation of ``TextCat'' Language guesser as not perl script but perl module. And support utf8 encode support.
use Lingua::LanguageGuesser;
$text = 'some strings'; @lang_list_sorted_similarity = Lingua::LanguageGuesser ->guess($text) ->eliminate() ->suspect('english', 'japanese-euc_jp') ->result_list(); print "Input is perhaps $lang_list_sorted_similarity[0]";
``guess'' method is constractor of Lingua::LanguageGuesser object.
The first parameter (hash vale) is options (not nessary). ``utf8'' decide treetment of ``utf8 encode mode''. If text strings is guessed as utf8 encode by Encode::Guess moldule, ``auto'' parameter changes utf8 only Language model. ``omit'' parameter omit utf8 language model and work as origianl ``TextCat'' program. ``include'' parameter set language model both origianl and addtional utf8 language models.
'MyModel' determine directory path where your own language model is in. 'MyModel_utf8' do utf8 language model.
'MaxLine' is the upper limit of input text size. If you have created your own language model by create_model()
method, paramter parameter
The second is text string for guessing what languege it is written in. You can call ``new'' method insted of ``guess''. ``new'' method is alias of ``guess''.
use Lingua::LanguageGuesser; my $guesser = Lingua::LanguageGuesser->guess( { utf8 => 'include', # select "auto","omit" or "inculde". MyModel => './my_language_model_directory', MyModel_utf8 => './my_utf8_kanguage_model_directory', MaxLine' => 1000 }, $textstring );
or
my $guesser = Lingua::LanguageGuesser->new( { utf8 => 'include', # select "auto", "omit" or "inculde" MyModel => './my_language_model_directory', MyModel_utf8 => './my_utf8_kanguage_model_directory', MaxLine => 1000 }, $textstring );
You can set text string for guess lanaugages as follows.
my $guesser = Lingua::LanguageGuesser->new( { utf8 => 'include', # select "auto", "omit" or "inculde" MyModel => './my_language_model_directory', MyModel_utf8 => './my_utf8_kanguage_model_directory', MaxLine => 1000 } ); print "$textstring1 is ", $guesser->set_text($textstring1)->best_scoring(), "\n"; print "$textstring2 is ", $guesser->set_text($textstring2)->best_scoring(), "\n";
Support three output methods as follows.
1. ``score_of_lang()'' method return languages with score of similariry. (perl hash value)
%score_of_lang = $guesser->score_of_lang(); $score_of_english = $score_of_lang{'english'};
2. ``result_list()'' method return language list sorted by similarity of the text. (perl array value)
@lang_list_sorted_similarity = $guesser->result_list();
3. ``best_scoring()'' method retruns most similar language name only. (perl scalar value) If you use ``eliminate()'' method and the result has some candidates, it returns ``Two or more suspects remain'' as string.
$most_similar_lang = $guesser->best_scoring()
or
$most_similar_lang = $guesser->eliminate()->best_scoring(); if ($most_similar_lang eq 'Two or more suspects remain') { Any Code }
This module supportes two methods for filtering candidate of language. ``eliminate()'' and ``suspect()''. Order of the methods infuluences the final candidate of language. Becouse ``the first method works first'' in this module. ``eliminate()'' method eliminates the language from candidates by score of simularity about the input text. You can set parameter to remove worster scored languages compared with the best score. Default value is 1.05.
$guesser->eliminate( 1.1 ) ;
``suspect'' method suspects candidates of language form the list only.
$guesser->suspect('english', 'japanese-euc_jp');
You can combine some methods as follows.
@lang_list_sorted_similarity = Lingua::LanguageGuesser ->guess($textstring) ->eliminate() ->suspect('english', 'japanese-euc_jp') ->result_list();
You can create or delete your own language model. If same language name exists in default language model, your language model have priority.
``create_model()'' method creates or remakes language model. Set your language model directory path, source file for create language model, and language name as follows.
use Lingua::LanguageGuesse qw(create_model); create_model('./my_language_model_directory', 'source_file', $language_name );
``delete_model()'' method deletes language model. Set your language model directory path and language name you want to delete.
use Lingua::LanguageGuesser qw(delete_model); delete_model('./my_language_model_directory', $language_name);
``list_my_models()'' method retrun your own langugae model list.
use Lingua::LanguageGuesser qw(list_my_model); my @mymodel = list_my_model('./my_language_model_directory');
The module export nothing by default. ``create_model'', ``delete_model'' and ``list_my_models'' only can be exported on demand, as in
use Lingua::LanguageGuesser qw(create_model delete_model list_my_models);
Another Lnaguage Guesser in CPAN
Lingua::Identify Language::Guess Text::Language::Guess Text::Ngram::LanguageDetermine
"TextCat" infomation
http://www.let.rug.nl/~vannoord/TextCat/
Our ``Gensen'' project Home Page
http://gensen.dl.itc.u-tokyo.ac.jp/
Akira Maeda <maeda@lib.u-tokyo.ac.jp>
Copyright (C) 2006 by Akira Maeda (maeda@lib.u-tokyo.ac.jp) Original Souce Code ``TextCat'' was written by Gertjan van Noord.
This library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation (http://www.fsf.org/); either version 2 of the License, or (at your option) any later version.