Slovo Project: Towards a Digital Library of South Slavic Manuscripts

Slavic Manuscripts in the Electronic Form (the Repertorium Intitiative and the Slovo Project)

Anissava Miltenova (Institute of Literature Bulgarian Academy of Sciences)

The application of computer technologies to store, publish and—most importantly—investigate written sources belongs to the most promising tasks at the boundary between the technical sciences and the humanities. The Repertorium Initiative was founded in 1994 at the Department of Old Bulgarian Literature of the Institute of Literature of the Bulgarian Academy of Sciences in collaboration with the University of Pittsburgh (US). The Repertorium is a universal database that incorporates archeographic, paleographic, codicological, textological, and literary-historical data concerning the original and translated medieval texts distributed through Slavic manuscripts between the eleventh and the seventeenth centuries. These data include both parts of actual texts and the results of their scientific investigation, with particular attention to the study manuscripts typology, a traditional aspect of philological scholarship that has been reinvigorated by the introduction, through the Repertorium Initiative, of computational methodologies.

Bulgarian-American project "Computer Supported Processing of Old Slavic Manuscripts" begun in 1994, sponsored by IREX – Washington (1994–1995). A new type of software was built, which was based on the SGML (Standard Generalized Markup Language) in its TEI (Text Encoding Initiative) implementation. The goal of the project was to create a sophisticated system of processing Slavonic Manuscripts in the universal format with multiple using.

The system for computer analytical description of medieval Slavic manuscripts on the level of modern archeography, palaeography, codicology and textology (from now on – TSM = Template for Slavonic Manuscripts) was carried out in the process of the teamwork of David Birnbaum, Beirend van Dijk, who was then a post-graduate student in Groningen (The Netherlands) Milena Dobreva, Institute of Mathematics and Computing in BAS and Harry Gaylord, who taught computer systems in Groningen. The description used here is specifically intended for the developing of a Repertory of the Old Bulgarian literature and letters and is adopted for Medieval Slavic texts. The development of fonts for writing the original texts in Medieval Cyrillic belongs to research associate Rumyan Lazov from the Institute of mathematics and computing in BAS. The searching program on this stage of the project was created by Stanimir Velev.

The movement from rules of data-bases framework to SGML marked a significant reorientation in the conceptualisation of computer-assisted manuscript description. More importantly, though, our SGML-based undertaking was oriented not only toward preparing manuscript descriptions that might be suitable for printing, electronic rendering, and searching, as was the case with the data-base’s approach. Rather, we anticipated even at that stage that the manuscript description files would be suitable for direct analysis, so that we would be able, for example, to identify patterns of structural similarity within a corpus of manuscripts on the basis of the same raw data files that we would also use to generate traditional printed manuscript descriptions.

The team has followed five main principles, formulated by David J. Burnham (see – http://www.slavic.pitt.edu/~djb/): 1. Standardizing of document file formats; 2. Multiple use (data should be separated from processing); 3. Portability of electronic texts (independence of local platforms); 4. Necessity of preservation of manuscripts in electronic form; 5. Orientation to the well-structured divisions of data according to the Slavic traditions of codicology, orthography, paleography, textology, etc.

During the period from 1996 through 1999, a team of scholars supervised by me based primarily at the Institute of Literature at the Bulgarian Academy of Sciences produced SGML descriptions of some 200 medieval Slavic manuscripts of all types. They were processed by using TSM system in the SGML environment with the corresponding interface A/E (Author/Editor, SoftQuad, Canada) software package.

At the same time, the Institute of Literature entered into a project with Ralph Cleminson at the Central European University entitled “Computer-Supported Processing of Slavonic Manuscripts and Early Printed Books,” which led to the encoding of additional manuscript descriptions and the publication of several articles addressing the technology underlying the project. Ralph, David, and others presented the results of their research at the Twelfth International Congress of Slavists in Kraków in 1998, where the International Committee of Slavists established a Special Commission to the Executive Council of the Committee for the Computer-Supported Processing of Slavic Manuscripts and Early Printed Books, with David Birnbaum, Ralph Cleminson, Andrej Bojadžiev, and Anissava Miltenova as officers. The Commission’s authorization was renewed at the Thirteenth International Congress of Slavists in Ljubljana in 2003.

The amount of described manuscripts at this stage increased to three hundred. Members of the team were: Anna Stojkova, Nina Georgieva, Elena Tomova, Adelina Angusheva, Andrej Bojadžiev, Margaret Dimitrova, Dimitrinka Dimitrova, and Diljana Radoslavova. The book under the title: “Medieval Slavic Manuscripts and SGML: Problems and Perspectives” (Sofia, 2000) is sponsored by IREX and Central European University). The articles in the book not only put into scientific circulation the achieved results from the analysis of the manuscripts, but also mark the problems that are waiting to be solved.

In recognition of the ground-breaking achievements of The Repertorium Initiative, its directors and principal researchers were appointed in 1998 by the International Committee of Slavists (the most important such international association) to head a special Commission for the Computer Processing of Slavic Manuscripts and Early Printed Books. Other evidence of the achievements of this project include, the organization three international conferences (Blagoevgrad 1994, Pomorie 2002, Sofia 2005) and the publication by the Bulgarian Academy of Sciences of three anthologies (1995, 2000, 2003).

A current continuation of the original project, “Electronic description and edition of Slavic sources” (2002–2003, sponsored by UNESCO), is in a transitional stage of migrating from SGML to XML technology. In 1994–1995, when the SGML DTD for the project was first constructed, Extensible Markup Language had not yet been conceived. Since then electronic and web technologies have changed very rapidly, and now we have tools that are very convenient for direct browsing and editing the markup files. Direct access to XML documents from such popular browsers as Internet Explorer, Opera, or the Gecko-engine powered ones, as Mozilla, Doczilla, and Netscape, provide more control and efficiency. This fact, together with the development of special recommendations for the markup languages produced under the auspices of the W3 consortium, Unicode, and other institutions and international initiatives, has led to a rapid growth of academic applications based on XML technology. So, this stage is characterized not only by the accumulation of still more manuscript descriptions, but also by the conversion of our materials from SGML to XML.

Repertorium today in outline:

Model for highly structured description of manuscript materials based on XML format (designed in its last version by Andrej Boyadžiev
Corpus of over 350 Slavic manuscripts (11th--18th centuries)
Model for electronic edition of Old Slavonic texts in XML format
Model for comparison of the content of miscellanies in SVG format
Model for linguistic analysis (under preparing)

The notion ‘Repertorium’ means not only recording arrangements and storing the facts and phenomena, but also discovering, analysing and synthesizing the data. More specifically, in the present case, ‘Repertorium’ should be understood as “a place, environment, where the scientific descriptions of medieval manuscripts and texts are stored”. These books and texts are not merely enumerated or copied; the data concerning them reflects the results of analysis, which gives them the possibility of being structured (arranged/ combined with other data) in particular format.

Template is a most important part of the Repertorium. The descriptions and examples of real texts are based on the XML (Extensive Markup Language), an informatics standard that incorporates special “markup” characters within natural language texts. The markup tags demarcate certain parts of the texts (elements) and signal what the data represents, simplifying the identification and extraction of data from the text not just during conversion for rendering (the most common procedure in humanities projects), but also during data-mining for analysis. The most recent model of description of manuscripts in an XML format derived from the TEI (Text Encoding Initiative) guidelines has been developed by Andrej Bojadžiev (Sofia University).

The working team in the Institute of Literature has already developed a digital library of over 350 electronic documents. Since its inception as a joint Bulgarian-US project over ten years ago, the Repertorium Initiative has expanded to include a joint Bulgarian-British project describing Slavic manuscripts in the collection of the British Library (London), as well as a project with University of Gothenburg (Sweden) concerning the study of late medieval Slavic manuscripts with computer tools. The Repertorium Initiative has grown not only in terms of its geography and its participants; it has also come to include a unique set of possibilities for linking the primary data to a standardized terminological apparatus for the description, study, edition, and translation of medieval texts, as well as to key words and terms used in the bibliographic descriptions. This combination of structured descriptions of primary sources with a sophisticated network of descriptive materials permits, for example, the extraction of different types of indices that go well beyond traditional field-based querying. The Internet presence of the Repertorium Initiative is located at http://clover.slavic.pitt.edu/~repertorium/ .

Because the Repertorium Initiative goes beyond manuscript studies in seeking to provide a broad and encyclopedic source of information about the Slavic medieval heritage, it also incorporates such auxiliary materials as bibliographic information and other authority files. In this capacity the Repertorium Initiative is closely coordinated with three other projects: the project for Authority Files, which defines the terms and ontology necessary for medieval Slavic manuscript studies; Libri Slavici, a joint undertaking of the Bulgarian Academy of Sciences and the University of Sofia in the field of bibliography on medieval written heritage; and identifying the typology of the content of manuscripts and texts with the aid of computational tools (Repertorium Workstation). All three of these share the common structure of the TEI documents and use a common XSLT (Extensible Stylesheet Language for Transformations) library for transforming documents to a variety of formats (including XML, HTML [Hypertext Markup Language], and SVG [Scalable Vector Graphics]) thus providing a sound base for the exchange of information and for electronic publishing.

The relationship among the three projects could be described in the following way:

The Repertorium Initiative is a innovative from both philological and technological perspectives in its approach to the description and edition of medieval texts. It takes its metadata for description from its Authority Files and its bibliographic references from the Libri Slavici.
The glossaries and thesaury of terminology, which are very important for fill in and use of metadata.
The Authority Files project gathers its preliminary information on the basis of descriptions and prepares guidelines in the form of authority lists for the use the metadata by researchers.
Libri Slavici accumulates its data from various sources, including descriptions and authority files, and shares common metadata with both of them.
Visualization of typology is radically new non-textual representations of manuscript structures. This development demonstrates that computers have done more than provide a new way of performing such traditional tasks as producing manuscript descriptions. Rather, the production of electronic manuscript descriptions has enabled new and innovative philological perspectives on the data.

The current realization of the principles of all these projects is thought as part European initiatives of preserving of the cultural heritage in European libraries and archives to provide data and metadata search and retrieval on the basis of paleographic, linguistic, textologic, and historical and other cultural characteristics. The connections among the different subprojects thus lead to a digital library that is suitable for the use of a wide community of specialists, and, in the same time, continues to inspire related new projects and initiatives.The future of the Repertorium Initiative is to continue integrate into a network full text databases of medieval Slavic manuscripts, electronic description of codices, and electronic reference books with terminology. These topics are included in the Slovo-ASO project.

South Slavic Monastic Culture

Guidelines

Introduction
Slavic manuscripts in electronic form
How to ...
Terminology

Documents

White Paper:
Character Set Standartization for Early Cyrillic Writing after Unicode 5.1
XML and electronic manuscript description
Introduction to XML Model for Manuscript description
XML model documentation
XML templates
Stylesheet files and scripts