Project Log
Incorrectly Placed Quotations
One of the first problems we encountered was that some of quotations in the texts contained words that were not part of the speech being quoted. For archeographic reasons we reproduced the texts exactly as they appeared in the source. However, the punctuation conventions employed in that source often failed to distinguish speech from descriptions of speech acts. For example:
“hello, said Ivan. I’ve come to slay your dragon,”
includes the phrase “said Ivan” within the quotation marks, even though it is not part of the quoted text. Initially we completely removed the quotation marks from the text and replaced them with a ]]> element to denote instances of direct speech. We later decided not to tamper with any of the original text, including punctuation, which meant that we needed to develop an alternative strategy for identify text that was actually spoken as part of a quotation. We dealt with the issue by tagging the quote as two separate speech acts, linking the portions together by assigning them a shared speech number @sn attribute. For example:
]]>"hello]]>, said Ivan. ]]>I’ve come to slay your dragon."]]>.
This markup strategy uses the shared @sn attribute value to formalize the fact that the two instances of speech are parts of a single speech act. The use of quotation marks in the original source erroneously suggests that “said Ivan” was part of the speech; our strategy ensures that when we need to retrieve the contents of a speech for analysis, we will not also retrieve that descriptive phrase. One of the reasons that we wanted to exclude phrases such like “said Ivan” from the actual content of a ]]> element, was to enable searches to return more accurate results. As with similar problems, the solution does not compromise the integrity of the original text and keeps the quotation marks in their original context.
Categorizing Speech Verbs
One of the trickiest and most pervasive problems we experienced in marking up the tales was determining which words should be considered verbs of speech. We quickly discovered that there were more types of speech verbs in the text than we had originally expected. This raised the question of how should we go about determining the classification system for verbs. For example, does “They agreed” count as an act of indirect speech? Should it be assumed that the agreement was vocalized, or since vocalization is not explicit should the act be considered non-verbal? Here is another example of the type of verb in question:
“присудил” which means “to award” (tale #365)
Even more than accuracy, the main concern surrounding verb classification was maintaining consistency, which is especially important given the number of tales. Obviously context plays a large role and some verbs will be considered speech verbs in some instances and not in others. These decisions depend on how they are being used in the text.
Linking Verbs with Speeches
The first major philosophical issue we encountered concerned how the verbs and speech acts should be tagged in relation to one another. This question arose in response to the appearance of increasingly complex instances of speech within the texts. Our original method for marking up the tales treated the speeches and verbs independently. For example, the quote,
Baba Yaga shouted: “hand over the watermelons and no one gets hurt,”
was originally to be tagged the following way:
Baba Yaga ]]>shouted]]>: ]]>“hand over the watermelons and no one gets hurt.”]]>
In this example, the verb is tagged separately from the speech, it is tagged as a ]]> element, and the attribute @type includes the verb's infinitive form. There are several problems with this. Since there is no connection between the ]]> element and the ]]> element, the system has no direct way of recognizing the connection between the verb and the speaker. Each character has a unique gender which is identified in a character list at the beginning of the xml document, and each ]]> element contains a @gender attribute (m, f, or mx for mixed gender groups) and a unique @id attribute that distinguishes the character from all others in the corpus. The @id attribute matches the @speaker attribute in a ]]> element, thus linking each speaker with a gender. Since the speaker has a unique gender that is identified earlier in the character attribute list, gender is tied to speech. We needed a way to tie the appearance of verbs with characters, so that we could automatically recognize the correlation between verb type and frequency with gender. One way to approach this problem would be to include a speaker attribute in the ]]> element, as we did in the ]]> element. However, we ultimate dealt with the problem by tagging each verb as a ]]> element, and assigning it an @infinitive attribute. The value of this attribute would be the same as the value for the @verb attribute in the associated ]]> element.
Multiple Verbs
After we began marking up the verbs in the text, we noticed that in some situations multiple verbs were linked with the same speech. This can occur in two slightly different ways. In the first, two different verbs refer to the same speech act:
Baba Yaga said, “go over there and sit down!” “You always order me around,” whined Ivan.
In this example, both the verb “said” and the verb to “order” are associated with Baba Yaga’s initial speech act. Because of this it would be incorrect to tag the example as if it contained two speeches by Baba Yaga, one associated with each verb. It would also be incorrect to associate only one of the two verbs with the speech, since each verb has its own connotations and one is not more relevant than the other in terms of our analysis. Instead, we decided to allow the @verb attribute in a ]]> element to contain multiple verbs and to tag each verb as a ]]> element.
The second way in which multiple verbs can connect back to the same speech involves instances of the same lexeme referring back to a single instance of speech multiple times. For example:
The girl sat down and started crying. “Why are you crying?” asked the bird. “How can I help but cry?” said the little girl.
Here, the verb to “cry” is associated with the same speech act three separate times. Again, since the speech act, which in this case is indirect, occurs once, it should not be tagged as if it occurred three times. However, in this case the verbs in question do not have different connotations. The verb to “cry” is not associated with the girl three times because the girl was crying on three separate occasions. Rather, it is associated with her once and then that association is referenced by the characters. Because of this we decided not to list the verbs that repeat themselves in reference to a single speech multiple times in the @verb attribute. However, each instance of the verb is still tagged as a ]]> element.