****************************************************************************** * * * * * Eichstaett-to-Beta-Code-Text-Conversion-Program * * * * (c) B.Meissner, 1992-2010 * * * ****************************************************************************** Preface: New Version ==================== mc has been ported to C-SNOBOL (Phil Budne), because of this system's wide acceptance. mc has been augmented to include all the extended codes which have been introduced into the Eichstaett text files, long line conventions and other oddities. All I/O routines in mc. have been changed from SPITBOL conventions to SNOBOL4. It would be easy to revert this at some stage (or introduce system-aware alternatives as we did in betautf8.sno, which is completely transparent to all variants of the SNOBOL4 language), but given the rarity with which mc is probably run, we did not want to invest time on this. Thus you should use snobol4 -b -d 4096k -S 500k -P 500k ./mc.sno instead of mc.exe. Snobol4 can be retrieved in source format from: http://www.snobol4.org/csnobol4/curr We provide a couple of SNOBOL4 versions in the snobol4.dir directory. There are calling examples in snobol4.bat, mc.bat and mc. MC and Errors in the Text Files =============================== mc cannot, however, correct errors which arre present in the Eichstaett text files themselves. In the most recent versions (2008 and 2009/2010), there are three such errors which should be corrected at some stage: 1) ILS_III2 contains erroneous lines and line lengths in text no. 9259a. These errors make the text completely incomprehensible, even to the ConcEyst program itself. The ConcEyst program's rendering of the text is documented in the screenshot "fehler.png" and shows that something must have gone very wrong with the codes. In beta code, this text looks like: 98.9259a.1.1.1.1.1 $:E)\A(|U)\& undoc. `0 $*SHA)=*|T& undoc. `0 $*R&r[1centuria]1$I)HU)\;[1centuria]1$*T& undoc. `0 $*E&r$E)\A(| H&[1milliaria]1$*T:&[1milliaria]1$*A*E&g$*SI*)T& undoc. `0 $*S*T&[1milliaria]1O$*T:&[1milliaria]1$*AHA*(| T&[1milliaria]1x$*E&[1milliaria]1$U)\E)\A*(|R&dSVO AT EXORNATIONEM$A)=| &BALINEI$H\|&DONO DEDIT Note that mc inserts an " undoc. " tag, followed by the numerical value of the code in error (in this case: zero=0) to make identification of such errors easier. 2) SEG_52 seems as if there was an error after inscription number 1021, since ConcEyst does not display any inscriptions after that. The reason for this seems to a weird ordering of the texts in the .NDX file for this file: The texts are not ordered according to their numbers, but to their respective dates. This inconsistency is corrected on-the-fly by mc, which sorts the texts according to their numbers. Mc uses a complicated queue-structure, a multiply linked list to re-arrange the texts. 3) There are a couple of files which are referred to differently internally and externally by ConcEyst. These are the McCrum, Meiggs&Lewis and Rhodes_Osborne files which have differing uppercase/lowercase renderings of their file names in the corresponding table files. On case-aware operating systems (all modern systems are!) this may lead to errors. A small unix shell script provides a remedy by externally linking the files to symbolic names which more precisely correspond to the names these files have in the internal data bases ("rename_files"). Thus, mc cannot be used only to make the ConcEyst text files accessible by beta-code-aware programs (like V&F), but also to check and debug the ConcEyst text files themselves. Purpose ======= The Catholic University of Eichstaett, Germany (Prof. Dr. J.Malitz) provides a set of Greek and Latin epigraphical corpora on disk. They come with concordance routines and browser facilities, but the files use a special format. To adapt them to the V&F program for beta-coded texts to get more liberty in using the texts, a conversion routine is therefore necessary. For this purpose, Burkhard Meissner (University of Halle) has written the MC conversion program that performs the task. MC converts the Eichstaett files in beta code format to be read by Burkhard Meissner's V&F program. Other programs which use beta-coded text files (as distributed by the Thesaurus Linguae Graecae and the Packard Humanities Institute) might also be used to view and search the resulting text files, but the program has been specifically designed for the V&F environment. Since there is no printed documentation for MC.EXE, you should print out this file on your standard line printer to get a reading copy: COPY MC.DOC LPT1 Legal Conditions ================ MC.EXE forms an integral part of V&F. All legal conditions that apply to the other software items of the V&F package apply to MC.EXE, too: 1) Conditions of Use MC is a program that its author, Burkhard Meissner, has written privately for his own purposes. However, the author of the MC software grants the right to use that software to any legitimate user of MC, under the following terms and conditions: A legitimate user is defined as any natural or legal person that is already a legitimate user of the V&F software. This entails that the user may have paid the license fee of 1000 DM (institutions) or 800 DM (individuals) to Burkhard Meissner for the use of the V&F software or ob- tained the software by downloading from the Ancient History home page at the the Helmut-Schmidt-University web site. The legitimate user does not obtain the software as his property; instead, all rights of copying and distributing the software remain with the author. However, the legitimate user acquires the right to use the software on any single machine at a time and to make as many backup copies for his own purposes as he thinks appropriate. The legimtimate user being a non-private institutional (i.e.: legal) person, it may use the software on any machine that is physically contained within the institution's building. Therefore, the right to use the MC software is transferred under the condition that the aforementioned fee is paid to the author, and the right to use the MC software is transferred only to the person that has paid the fee. Using MC in an networking environment or on different machines within one physically coherent institution does conform to these conditions of legitimate use of MC; copying and distributing MC to more than one separate machine to use the software on more than one machine by different persons at the same time does not. The author of the MC software, Burkhard Meissner holds and will hold the copyright of the software, and he is its material, intellectual and spiritual owner. This entails that users have to give due credit to the author and his MC software when they publish results produced from the use of MC. Therefore, the right to use MC is transferred under the explicit condition that the user mentions the MC software and its author in any publication that results from serious use of the MC software. Any use of the MC software that is not rendered legitimate according to these conditions is considered illegitimate and illegal. 2) Warranty Conditions a) Limited Warranty The MC software is provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the program is with the user. Should the program prove defective, the user assumes the entire cost of all necessary servicing, repair or correction. The author, Burkhard Meissner, does not warrant that the functions contained in the program will meet the user's requirements or that the operation of the program will be uninterrupted or error-free. However, he warrants the diskette(s) on which the program is furnished, if it is so furnished, to be free from defects in materials and workmanship under normal use for a period of ninety (90) days from the date of delivery to the user. b) Limitations of Remedies The author's entire liability and the users exclusive remedy shall be: 1) The replacement of any diskette not meeting our "Limited Warranty" and which is returned to the author together with a copy of the user's receipt, or 2) if the author, Burkhard Meissner, is unable to deliver a replacement diskette which is free of defects in materials or workmanship, the user may terminate the Agreement between author and user by returning the software, upon which his money shall be refunded. In no event will the author, Burkhard Meissner, be liable to the user for any damages, including any lost profits, lost savings or other incidental or consequential damages arising out of the use or inability to use the program even if the author has been advised of the possibility of such damages, or for any claim by any other party. Using MC ======== To use this program, you should know something about the organization of the Eichstaett text files: Each text corpus is organized around a corpus (.CPS) file which contains pointers to a set of index (.NDX) and text (.TXT) files. In order to collect all texts of such a corpus into one beta code file, one has to resolve the file pointers in the corpus file and the pointers to the single texts in the .NDX files. Our program does all this for single corpora. This means: You must re-run it for each .CPS file to obtain a full conversion of all existing text files. It assumes that the .TXT, .NDX and .CPS which belong together are located in the present directory and writes its output into the same directory. It creates AUTHTAB.DIR automatically, the beta code author list. Suppose, you have DFG.CPS with its adjacent files. To create a beta code file set CV_001, start our conversion program MC: MC Then enter the name of the input file (e.d.: DFG) and the output file (e.d.: CV_001). For Latin texts, enter Y upon the third question. The Y overrides the default Greek font of the Eichstaett files. This is necessary with texts which are mainly in Latin. As a new feature, you can then enter a string of up to six reference level names which will be used instead of the default strings (Inscription.....line). E.d.: Inscription....section line Confirm with Y, or else re-enter all informations. The program will create CV_001.txt, CV_001.idt and AUTHTAB.dir. Be sure there are no files with the same names which you do not want to be overwritten. An ongoing MC conversion process can always be interrupted using the Ctrl-Break key. In this case you should allow some time to the MC.EXE virtual memory manager system to restore the memory state to DOS standards, close the swap file and close the files written so far. When the process finishes, you can start V&F, make the directory, where the new files are, your CDdrive directory (Options menu, number 6, do not forget the back slash), Re-Read the author list and work on this file. To append the new files to existing sets of beta code texts, you may use our AUTHCONV utility with the /A switch and the DOS APPEND command (see below). There are no restrictions as to the use of the resulting files; they can be read, searched and indexed like any other TLG or PHI beta code text. The program sorts the texts by names. It uses a special algorithm which compares the text names with respect to the positional value of their first numeric component, without comparing the text names only alphabetically. Thus it is ensured that 400a sorts after 400 but before 4000. A text name like 2347-9 sorts after 2347 but before 2349. The complicated string and list conversion operations during the sorting process make it quite slow with large text collections. In addition, the memory requirements are gigantic. There is no way of running it on a 1 MByte machine; a machine with only 2 or 3 MBytes may come into serious difficulties with excessive swapping. For practical applications, 4 MBytes seem the minimum. MC - An Example Session ======================= Suppose you have a set of Eichstaett files. These come as text files (.TXT), index files (.NDX) and corpus files (.CPS). The dictionary files (.DCT and alike) are of no interest to the conversion process and are not needed for MC (but you must not delete them: They are necessary for the Eichstaett concordance routines themselves). You should copy all these files in one sub directory (for security reasons, you should perform the conversions, using copies of the files, and not the original ones). Now you should detemine how many and which .CPS files there are: DIR *.CPS For security reasons, create a sub directory for each single .CPS file with its distinct name (but empty as yet). Suppose you have .CPS files ann_epig.cps 73 14.12.91 11:09 dfg .cps 109 10.09.92 16:54 germania.cps 118 22.01.88 17:04 ils .cps 82 23.04.91 14:56 Now create sub directories: MD ann_epig MD dfg MD germania MD ils Now start the conversion process. Call MC from the DOS prompt and enter "ann_epig" when being asked for a .CPS file name. Enter a distinct name for the output files when asked (e.d.: "cv_001"). You should know, if most of the texts are in Greek or in Latin. If they are in Greek, you should simply press <ENTER> when being asked, if the texts should be forced to have Latin default font. If most of them are in Latin, enter "Y" or "y" as the answer to this third question. This makes the program enter the "l" language code in the resulting files instead of "g". This, in turn, allows for much less language codes within the texts themselves, with the consequence of making them more compact and all V&F processes upon them faster. To start the process, enter "Y" or "y" upon the fourth question, any other letter if you want to make corrections. In this case, you must start MC anew. In the above example, MC converts the texts and writes three files: CV_001.TXT CV_001.IDT AUTHTAB.DIR For security reasons, you should copy (or move, if you have a directory entry mover like UNIX' mv) these files to your ANN_EPIG sub directory: COPY CV_001.TXT ANN_EPIG COPY CV_001.IDT ANN_EPIG COPY AUTHTAB.DIR ANN_EPIG The same processes should be performed for all four .CPS files separately. Of course, you should select divers file names for the output files (e.d.: CV_002 CV_003 CV_004 and so forth). After the last conversion has been performed (CV_004 being the last output in our above example), you should have all resulting files in the sub directories (four in our case). Now copy the files of the first sub directory in your present directory: COPY ANN_EPIG\*.* Now you have AUTHTAB.DIR (with one entry) in your present directory. Call AUTHCONV with this file specified AS THE OUTPUT FILE and the single AUTHTAB.DIR files within the other sub directories as INPUT FILES: AUTHCONV c:\DFG\AUTHTAB.DIR AUTHTAB.DIR /A AUTHCONV c:\GERMANIA\AUTHTAB.DIR AUTHTAB.DIR /A AUTHCONV c:\ILS\AUTHTAB.DIR AUTHTAB.DIR /A Finally, you should move or copy all .TXT and .IDT files as well as the AUTHTAB.DIR file to one separate sub directory where they can be accessed by V&F: MD D:\MALITZ XCOPY ANN_EPIG\*.IDT D:\MALITZ XCOPY ANN_EPIG\*.TXT D:\MALITZ XCOPY DFG\*.IDT D:\MALITZ XCOPY DFG\*.TXT D:\MALITZ XCOPY GERMANIA\*.IDT D:\MALITZ XCOPY GERMANIA\*.TXT D:\MALITZ XCOPY ILS\*.IDT D:\MALITZ XCOPY ILS\*.TXT D:\MALITZ COPY AUTHTAB.DIR D:\MALITZ The old files in the four sub directories can be deleted: DEL ANN_EPIG DEL ILS DEL DFG DEL GERMANIA DEL CV_*.* DEL AUTHTAB.DIR If you want to have these files available together with one of your CD-ROM texts, you should use the DOS command APPEND. Insert your CD into the CD drive. Let's suppose, the drive be G:. Change to the drive, where your new inscription files are located: CD C:\MALITZ Now create a common AUTHTAB.DIR for both sets of texts. Keep the old ones for reference: AUTHCONV G:\AUTHTAB.DIR AUTHTAB.DIR /A Now use the DOS command APPEND to make both directories accessible: APPEND G:\ You can append multiple directories, separated by ";". In your V&F profile PROFILE.V&F you should make 'C:\MALITZ\' your CDdrive text file directory. In this case you will have both sets of texts available at one time. If you insert the DOS APPEND command in your V&F.BAT file to start the program, you can automate the process for the future. To revoke the APPEND assignments, give the command with only one ";": APPEND ; This should be given in your V&F.BAT batch file before returning you to the DOS prompt. Using this strategy, we have connected all TLG and PHI texts with the Malitz texts at Erlangen in one large text file inventory. MC - Technical Details and Possible Problems ============================================ MC uses complicated lists with a multiple-pointer structure to store work lists, text lists and single texts in memory. To sort the texts, these structures must temporarily be converted to linear arrays to be sorted and re-created afterwards. This uses large amounts of memory and takes much time. Once the text lists are sorted and the conversion process is under way (i.e. when the CONVERTING ... lines flicker on the screen), your computer's memory is enough for the process. If, however, you get the SYSTEM ERROR #204 message, informing you of the presence of insufficient memory, your present configuration cannot be used for the job. In this case, you should use MODXCONF as described in the V&F documentation. Just apply it to MC (MODXCONF MC.EXE) and follow the menu instructions. Increase the region size and/or virtual memory size. The swapping process of the virtual memory manager uses large amounts of time and hard disk space. Therefore we have decided to set the MC.EXE region size to only 3 MBytes. This should be enough for most processes. As yet we have found that all conversion processes could be fitted in a memory region that was larger than 1456 KBytes and smaller than 1584 Kbytes. With MC.EXE, you have twice that amount of free space which should be enough by all standards. Acknowledgements ================ The development of MC has been made possible by Prof. Dr. Jaergen Malitz, Eichstaett Catholic University, who disclosed his file formats, provided ample documentation upon the internal structure of his texts and of the programs to handle them. He also invested much time in beta-testing MC. Some errors have been found during this activity, and the program owes many features to Professor Malitz' suggestions. Professor Malitz' work on the computerization of Greek inscriptions of Asia is to be commended for its yielding positive results. Mark Emmer of Catspaw, Inc. has been exceptionally cooperative in responding to demands for improving his already excellent programming tools (SPITBOL-386, SPITBOL-8088). MC Application Error Messages and Numbers ========================================= 1 INPUT .CPS FILE CANNOT BE OPENED 2 OUTPUT FILE CANNOT BE OPENED 3 INPUT .CPS FILE DOES NOT HAVE CORRECT FORMAT 4 INPUT .CPS FILE CORRUPT: NON-MATCHING NUMBER OF ENTRIES 5 INPUT .NDX FILE CANNOT BE OPENED 6 INPUT .NDX FILE CORRUPT: ERRONEOUS FILE FORMAT 7 INPUT .NDX FILE CORRUPT: NON-MATCHING NUMBER OF ENTRIES 8 INPUT .TXT FILE CANNOT BE OPENED 9 INPUT .TXT FILE CORRUPT: TEXT OFFSET CANNOT BE REACHED 10 INPUT .TXT FILE CORRUPT: MISSING END MARKER (FFh) 11 INPUT .TXT FILE CORRUPT: ERRONEOUS LINE LENGTH 12 OUTPUT .TXT FILE CANNOT BE WRITTEN TO 13 OUTPUT .NDX FILE CANNOT BE WRITTEN TO 14 AUTHTAB.DIR CANNOT BE OPENED FOR WRITING 15 CONFLICTING I/O FILE NAMES Technical Details ================= MC used to be built with the INTEL DOS extender (DPMI-compatible). This meant, it could be run on 32-bit DOS machines only, and not under DESQView. Since MC uses large amounts of memory, this seemed reasonably. Nowadays, machines are much faster and much larger: An interpreter implementation is feasible under these circumstances.