Actualización

2025-04-10 12:49:05 +02:00
parent 4aff98e77b
commit 1cdd00920f
9151 changed files with 1800913 additions and 0 deletions
@@ -0,0 +1,17 @@
+Libbrary of statistical profiles for language recognition
+---------------------------------------------------------
+
+The sample texts for dieffernt languages have been taken from
+Perl module: Lingua::LanguageGuesser - http://gensen.dl.itc.u-tokyo.ac.jp/LanguageGuesser/LanguageGuesser_demo.html
+Statistical Text Analysis - http://boxoffice.ch/pseudo/
+Some random sample texts have been taken from Wikiedia - http://wikipedia.org/
+
+All the sample texts should be UTF-8 encoded!
+
+To understand how does language recognition work you need to read the following remarkable work:
+W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
+http://citeseer.ist.psu.edu/cache/papers/cs/810/http:zSzzSzwww.info.unicaen.frzSz~giguetzSzclassifzSzcavnar_trenkle_ngram.pdf/n-gram-based-text.pdf
+
+License: GNU General Public License 3 as published by the Free Software Foundation (http://www.fsf.org/).
+Assembled by Ivan Tcholakov, <ivantcholakov@gmail.com>
+November, 2009