NAME

Lingua::LanguageGuesser - Implementation of ``TextCat'' Language guesser as not perl script but perl module. And support utf8 encode support.


SYNOPSIS

  use Lingua::LanguageGuesser;
  $text = 'some strings';
  @lang_list_sorted_similarity =
      Lingua::LanguageGuesser
           ->guess($text)
           ->eliminate()
           ->suspect('english', 'japanese-euc_jp')
           ->result_list();
  print "Input is perhaps $lang_list_sorted_similarity[0]";


DESCRIPTION

``guess'' method is constractor of Lingua::LanguageGuesser object. The first parameter (hash vale) is options (not nessary). ``utf8'' decide treetment of ``utf8 encode mode''. If text strings is guessed as utf8 encode by Encode::Guess moldule, ``auto'' parameter changes utf8 only Language model. ``omit'' parameter omit utf8 language model and work as origianl ``TextCat'' program. ``include'' parameter set language model both origianl and addtional utf8 language models. 'MyModel' determine directory path where your own language model is in. 'MyModel_utf8' do utf8 language model. 'MaxLine' is the upper limit of input text size. If you have created your own language model by create_model() method, paramter parameter The second is text string for guessing what languege it is written in. You can call ``new'' method insted of ``guess''. ``new'' method is alias of ``guess''.

  use Lingua::LanguageGuesser;
  my $guesser = Lingua::LanguageGuesser->guess(
                    {
                      utf8         => 'include', # select "auto","omit" or "inculde".
                      MyModel      => './my_language_model_directory',
                      MyModel_utf8 =>  './my_utf8_kanguage_model_directory',
                      MaxLine'      => 1000
                    },
                    $textstring
                );

or

  my $guesser = Lingua::LanguageGuesser->new(
                    {
                      utf8         => 'include', # select "auto", "omit" or "inculde"
                      MyModel      => './my_language_model_directory',
                      MyModel_utf8 => './my_utf8_kanguage_model_directory',
                      MaxLine      => 1000
                    },
                    $textstring
                );

You can set text string for guess lanaugages as follows.

  my $guesser = Lingua::LanguageGuesser->new(
                    {
                      utf8         => 'include', # select "auto", "omit" or "inculde"
                      MyModel      => './my_language_model_directory',
                      MyModel_utf8 => './my_utf8_kanguage_model_directory',
                      MaxLine      => 1000
                    }
                );
  print "$textstring1 is ", $guesser->set_text($textstring1)->best_scoring(), "\n";
  print "$textstring2 is ", $guesser->set_text($textstring2)->best_scoring(), "\n";

Support three output methods as follows.

1. ``score_of_lang()'' method return languages with score of similariry. (perl hash value)

  %score_of_lang = $guesser->score_of_lang();
  $score_of_english = $score_of_lang{'english'};

2. ``result_list()'' method return language list sorted by similarity of the text. (perl array value)

  @lang_list_sorted_similarity = $guesser->result_list();

3. ``best_scoring()'' method retruns most similar language name only. (perl scalar value) If you use ``eliminate()'' method and the result has some candidates, it returns ``Two or more suspects remain'' as string.

  $most_similar_lang = $guesser->best_scoring()

or

  $most_similar_lang = $guesser->eliminate()->best_scoring();
  if ($most_similar_lang eq 'Two or more suspects remain') { Any Code }

This module supportes two methods for filtering candidate of language. ``eliminate()'' and ``suspect()''. Order of the methods infuluences the final candidate of language. Becouse ``the first method works first'' in this module. ``eliminate()'' method eliminates the language from candidates by score of simularity about the input text. You can set parameter to remove worster scored languages compared with the best score. Default value is 1.05.

  $guesser->eliminate( 1.1 ) ;

``suspect'' method suspects candidates of language form the list only.

  $guesser->suspect('english', 'japanese-euc_jp');

You can combine some methods as follows.

  @lang_list_sorted_similarity =
       Lingua::LanguageGuesser
           ->guess($textstring)
           ->eliminate()
           ->suspect('english', 'japanese-euc_jp')
           ->result_list();

You can create or delete your own language model. If same language name exists in default language model, your language model have priority.

``create_model()'' method creates or remakes language model. Set your language model directory path, source file for create language model, and language name as follows.

  use Lingua::LanguageGuesse qw(create_model);
  create_model('./my_language_model_directory', 'source_file', $language_name );

``delete_model()'' method deletes language model. Set your language model directory path and language name you want to delete.

  use Lingua::LanguageGuesser qw(delete_model);
  delete_model('./my_language_model_directory', $language_name);

``list_my_models()'' method retrun your own langugae model list.

  use Lingua::LanguageGuesser qw(list_my_model);
  my @mymodel = list_my_model('./my_language_model_directory');

LANGUAGE SUPPORTED (not utf8 mode)

afrikaans
albanian
amharic-utf
arabic-iso8859_6
arabic-windows1256
armenian
basque
belarus-windows1251
bosnian
breton
bulgarian-iso8859_5
catalan
chinese-big5
chinese-gb2312
croatian-ascii
czech-iso8859_2
danish
dutch
english
esperanto
estonian
finnish
french
frisian
georgian
german
greek-iso8859-7
hebrew-iso8859_8
hindi
hungarian
icelandic
indonesian
irish
italian
japanese-euc_jp
japanese-shift_jis
korean
latin
latvian
lithuanian
malay
manx
marathi
middle_frisian
mingo
nepali
norwegian
persian
polish
portuguese
quechua
romanian
rumantsch
russian-iso8859_5
russian-koi8_r
russian-windows1251
sanskrit
scots
scots_gaelic
serbian-ascii
slovak-ascii
slovak-windows1250
slovenian-ascii
slovenian-iso8859_2
spanish
swahili
swedish
tagalog
tamil
thai
turkish
ukrainian-koi8_u
vietnamese
welsh
yiddish-utf

LANGUAGE SUPPORTED (utf8 mode)

amharic-utf
basque
bosnian
chinese_simple-utf8
croatian-ascii
english
finnish-utf8
french-utf8
german-utf8
indonesian
italian-utf8
japanese-utf8
latin
malay
manx
norwegian-utf8
romanian
russian-iso8859_5
sanskrit
scots
serbian-ascii
slovak-ascii
slovenian-ascii
spanish-utf8
swahili
swedish-utf8
tagalog
welsh
yiddish-utf

EXPORT

The module export nothing by default. ``create_model'', ``delete_model'' and ``list_my_models'' only can be exported on demand, as in

  use Lingua::LanguageGuesser qw(create_model delete_model list_my_models);


SEE ALSO

Another Lnaguage Guesser in CPAN

  Lingua::Identify
  Language::Guess
  Text::Language::Guess
  Text::Ngram::LanguageDetermine

"TextCat" infomation
  http://www.let.rug.nl/~vannoord/TextCat/

Our ``Gensen'' project Home Page

  http://gensen.dl.itc.u-tokyo.ac.jp/


AUTHOR

Akira Maeda <maeda@lib.u-tokyo.ac.jp>


COPYRIGHT AND LICENSE

Copyright (C) 2006 by Akira Maeda (maeda@lib.u-tokyo.ac.jp) Original Souce Code ``TextCat'' was written by Gertjan van Noord.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation (http://www.fsf.org/); either version 2 of the License, or (at your option) any later version.