Short: DBACL - digramic Bayesian classifier Author: Laird A. Breyer Uploader: Diego Casorran Type: text/misc Version: 1.3 Architecture: m68k-amigaos PURPOSE dbacl is a command line program which can be used to categorize several types of text documents. Each document category is constructed as a maximum entropy language model, with respect to a reference measure based on digrams (character pairs). Before recognition can take place, a number of text corpora must be "learned". For example, an English category could be based on a text file containing the collected works of Shakespeare. The Gutenberg project (http://promo.net/pg/) makes freely available many public domain works in electronic form. After learning, any number of text files can be compared, in terms of Bayesian posterior probabilities, with up to 128 learned categories. The actual number of categories is limited only by available memory. dbacl is bundled with a few other utilities: - bayesol is a postprocessor which takes the dbacl output and computes an optimal decision based on costs of misclassification. Together with dbacl, this allows the construction of sophisticated, multilingual, classification scripts, if you're not afraid of some shell scripting. - mailcross performs email classification cross validation. It can be used to assess the performance of custom email classification scripts based on dbacl and bayesol. - mailinspect reads an mbox style mail folder and displays the emails in sorted order, based on similarity to any given category. DOCUMENTATION See the bundled manpage. Generic instructions can be found in the file INSTALL. A tutorial is to be found in the file tutorial.html, and an exposition of the algorithms is in dbacl.ps. LICENSE DBACL is distributed under the terms of the GNU General Public License (GPL) which can be found in the file COPYING. The hash function code used in the file jenkins.c is public domain, by Bob Jenkins. BUILDING There are several configuration options you can change in the file dbacl.h, if you want to increase the maximum number of categories or optimize hash table overhead. To build and install the program, you can execute the following steps from within the source DBACL directory: ./configure make make install The last part should be executed with superuser privileges for system wide installation. Alternatively ./configure --prefix=/home/xyzzy make make install builds and installs in user xyzzy's home directory, without the need for root privileges. In this case, the following environment variables should be set permanently (e.g. in the file .profile): PATH=$PATH:/home/xyzzy/bin MANPATH=$MANPATH:/home/xyzzy/man INTERNATIONALIZATION dbacl uses the current locale for processing. 8-bit clean multibyte character sets (such as UTF-8) are supported in the default mode, and arbitrary multibyte character sets require the -i command line option. If you intend to use the -i option together with regular expressions, you must build with a wide character POSIX regex library: ensure that the BOOST library is present on the system and type ./configure WIDE_REGEX=1 make make install Warning: there is a large performance penalty if you build dbacl this way, which shows up whenever you use regular expressions. Only build this way if you need correct regular expressions in a multibyte environment which isn't 8-bit clean. OTHER DEPENDENCIES The main filter programs dbacl and bayesol have no special dependencies, and can always be compiled. mailinspect uses the readline and slang libraries for screen management in interactive mode. The configure script will check for these libraries and if it can't find them, mailinspect will be compiled without interactive support. mailcross is a bash shell script which calls awk and formail at various points. It will test for the existence of these programs in your path and refuse to run if it can't find them. RUNNING There is a tutorial which you can read with any web browser, point it to the file tutorial.html. For command line options and examples of possible use, type after installation: man dbacl man bayesol man mailcross man mailinspect You can also find a technical description of the algorithms and statistics in the postscript file dbacl.ps TUTORIAL SAMPLES The tutorial.html document comes with several sample text files: - sample1.txt and sample4.txt are extracts from Mark Twain, Huckleberry Finn - sample2.txt, sample3.txt, sample5.tx are extracts from Douglas Adams, The Hitchhikers' Guide to the Galaxy AUTHOR Laird A. Breyer ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°` `°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø Latest update of this package can be found at http://amiga.sourceforge.net/ ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°` `°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø ·············································A·r·c·h·i·v·e··C·o·n·t·e·n·t·s·· LhA Freeware Version 2.2 Copyright © 1991-94 by Stefan Boberg. Copyright © 1998-2000 by Jim Cooper and David Tritscher. Listing of archive 'dbacl-1.3.lha': Original Packed Ratio Date Time Name -------- ------- ----- --------- -------- ------------- 77560 37453 51.7% 15-Jan-03 21:19:12 +bayesol.040 162288 60196 62.9% 15-Jan-03 21:17:22 +dbacl.040 1566 1023 34.6% 05-Dec-02 01:34:44 +japanese.txt 5977 2189 63.3% 15-Dec-02 06:51:14 +mailcross 5757 2368 58.8% 29-Dec-02 01:32:38 +mailcross.1 228180 86699 62.0% 15-Jan-03 21:07:48 +mailinspect 226292 85822 62.0% 15-Jan-03 21:27:36 +mailinspect.040 6199 2593 58.1% 29-Dec-02 01:36:30 +mailinspect.1 0 0 0.0% 21-Nov-02 11:36:26 +NEWS 168 124 26.1% 07-Dec-02 13:08:36 +prop.pl 4525 2150 52.4% 29-Dec-02 01:25:44 +README 3318 1674 49.5% 06-Dec-02 00:05:42 +sample1.txt 2605 1318 49.4% 06-Dec-02 00:02:34 +sample2.txt 3073 1535 50.0% 05-Dec-02 23:57:12 +sample3.txt 3283 1653 49.6% 06-Dec-02 00:53:20 +sample4.txt 3757 1851 50.7% 08-Dec-02 08:38:06 +sample5.txt 4055 1869 53.9% 08-Dec-02 05:15:14 +sample6.txt 136 96 29.4% 06-Dec-02 10:52:08 +toy.risk 29582 10326 65.0% 15-Dec-02 07:01:20 +tutorial.html 3274 1580 51.7% 12-Aug-02 04:16:10 +ylwrap 31 31 0.0% 17-Oct-02 13:41:40 +AUTHORS 77816 37613 51.6% 15-Jan-03 21:02:54 +bayesol 4202 1844 56.1% 29-Dec-02 01:32:06 +bayesol.1 1267 655 48.3% 29-Dec-02 01:26:08 +ChangeLog 17992 7014 61.0% 12-Aug-02 04:16:10 +COPYING 161124 59441 63.1% 15-Jan-03 20:59:38 +dbacl 14851 5694 61.6% 29-Dec-02 01:31:36 +dbacl.1 435463 182427 58.1% 29-Nov-02 01:13:30 +dbacl.ps 318 166 47.7% 08-Dec-02 09:06:46 +example1.risk 452 236 47.7% 08-Dec-02 09:06:46 +example2.risk 492 258 47.5% 08-Dec-02 09:06:46 +example3.risk -------- ------- ----- --------- -------- 1485603 597898 59.7% Operation successful. _____________________________ .Readme created with: MRea \ ============================================================================== >»>»>»>»> Some additional info about this archive: Source: http://prdownloads.sf.net/amiga/dbacl-1.3.lha?download FileSize: 599252 Bytes CRC: EBCC5E0C MD5: 7D06389E578478190ECF577E3B6F7F1E SHA: 17F53C8D799561B112250241692A392E72135851 ==============================================================================