1 - Data Preparation

Preparation of the pre-cleaned and speaker/genre-tagged play text.

Lars Hinrichs https://larshinrichs.site (The University of Texas at Austin)https://liberalarts.utexas.edu/english
10-18-2020

The data: gold-standard preparation

We have pre-cleaned the data and hand-coded it for genre. That is the gold standard. In addition, speech prefixes were retained and systematically set to all-caps. Stage directions were removed. Below is a screenshot of the current version.

Current version of the text: pre-cleaned and hand-tagged for genre.

Figure 1: Current version of the text: pre-cleaned and hand-tagged for genre.

Loading the file into R

Reading in the file, the head looks as below.

Table 1: The text as initial R object.
linenumber text
65 And when he says he is, say that he dreams,
66 For he is nothing but a mighty lord.
67 This do, and do it kindly, gentle sirs;
68 It will be pastime passing excellent,
69 If it be husbanded with modesty. < /verse >
70 < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part
71 As he shall think by our true diligence
72 He is no less than what we say he is. < /verse >
73 < LORD > < verse > Take him up gently and to bed with him,
74 And each one to his office when he wakes.
75 Sirrah, go see what trumpet ’tis that sounds.

Current size

The text has been read line-by-line. The current length is 2 799 lines. As we process it, we’ll find out the number of words.

Further processing

The following things need to happen before any analysis.

  1. We’ll want to convert the text to one-word-per-row format. This step typically (a) removes all punctuation and (b) changes upper-case to lower-case letters. We don’t really want either of these, but getting rid of punctution is darn useful. So we will want punctuation removed, but we need to do something to retain the valuable genre-tagging. To address a), then, I’ll change our XML format tags (e.g. <prose>...</prose>) to lexical tags, i.e. made-up words I can be sure Shakespeare didn’t use in the play: prosestart ... proseend, versestart ... verseend. This will allow me to still remove any punctuation while retaining the genre tagging.
  1. Edit those speech prefixes that, like the huntsmen, consist of more than one token.
  2. Mark each line for current act.
  3. At this point, “tokenize”, i.e. change format to one-word-per-row.
  4. Identify all words in all-caps spelling – we know these are the speech prefixes. Use them to mark each row for the current speaker.
  5. Identify all genre tags. Use them to mark each row for prose or verse as genre.

I’ll now work down this list and show the state of the text after each conversion.

Change genre tags to lexical tags

Table 2: Data after genre tags have been changed from XML format to lexical forms.
linenumber text
65 And when he says he is, say that he dreams,
66 For he is nothing but a mighty lord.
67 This do, and do it kindly, gentle sirs;
68 It will be pastime passing excellent,
69 If it be husbanded with modesty. < /verse >
70 < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part
71 As he shall think by our true diligence
72 He is no less than what we say he is. < /verse >
73 < LORD > < verse > Take him up gently and to bed with him,
74 And each one to his office when he wakes.
75 Sirrah, go see what trumpet ’tis that sounds.

Edit speech prefixes

I went back to the DOCX file to be able to read better, and tried to find any character name that consisted of more than one token. Here are the ones I came up with.

No need to display the result here.

Mark acts

Based on the tagging of the current text, we will recognize act beginnings by the sequence |A at the beginning of a line.

Table 3: Data after act has been assigned.
linenumber text act
65 And when he says he is, say that he dreams, 1
66 For he is nothing but a mighty lord. 1
67 This do, and do it kindly, gentle sirs; 1
68 It will be pastime passing excellent, 1
69 If it be husbanded with modesty. < /verse > 1
70 < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part 1
71 As he shall think by our true diligence 1
72 He is no less than what we say he is. < /verse > 1
73 < LORD > < verse > Take him up gently and to bed with him, 1
74 And each one to his office when he wakes. 1
75 Sirrah, go see what trumpet ’tis that sounds. 1

Tokenize

We have covered our bases now and can use the default setting of the unnest_tokens() function with regard to removal of punctuation, but not with regard to lower-casing across the board.

Table 4: Data after tokenization, with option to_lower = FALSE.
linenumber act word
9 1 a
9 1 denier
9 1 Go
9 1 by
9 1 Saint
9 1 Jeronimy
9 1 go
10 1 to
10 1 thy
10 1 cold
10 1 bed

We can now determine that the play has 23 242 words, as that is the number of rows in the text object after tokenization.

Mark each row for current speaker

We’re adding a new column and marking for each word who the speaker is.

Table 5: Data after speaker has been assigned.
linenumber act word speaker
9 1 a SLY
9 1 denier SLY
9 1 Go SLY
9 1 by SLY
9 1 Saint SLY
9 1 Jeronimy SLY
9 1 go SLY
10 1 to SLY
10 1 thy SLY
10 1 cold SLY
10 1 bed SLY
10 1 and SLY
10 1 warm SLY
10 1 thee SLY
10 1 proseend SLY
11 1 HOSTESS HOSTESS
11 1 prosestart HOSTESS
11 1 I HOSTESS
11 1 know HOSTESS
11 1 my HOSTESS
11 1 remedy HOSTESS

Now that the speaker tagging in a dedicated column is in place, we will also remove those rows that contain only a speech prefix.

Mark each row for current genre

Table 6: Data after genre has been assigned.
linenumber act word speaker genre
10 1 to SLY prose
10 1 thy SLY prose
10 1 cold SLY prose
10 1 bed SLY prose
10 1 and SLY prose
10 1 warm SLY prose
10 1 thee SLY prose
10 1 proseend SLY prose
11 1 prosestart HOSTESS prose
11 1 know HOSTESS prose
11 1 my HOSTESS prose
11 1 remedy HOSTESS prose
11 1 must HOSTESS prose
11 1 go HOSTESS prose
11 1 fetch HOSTESS prose
11 1 the HOSTESS prose
12 1 thirdborough HOSTESS prose
12 1 proseend HOSTESS prose
13 1 prosestart SLY prose
13 1 Third SLY prose
13 1 or SLY prose

Now that genres are tagged as well, we will also remove those rows that contain only a (lexical) genre tag.

The final file

Here is a last view of the data, this time a bit longer.

Table 7: Data after processing.
linenumber act word speaker genre
11 1 my HOSTESS prose
11 1 remedy HOSTESS prose
11 1 must HOSTESS prose
11 1 go HOSTESS prose
11 1 fetch HOSTESS prose
11 1 the HOSTESS prose
12 1 thirdborough HOSTESS prose
13 1 Third SLY prose
13 1 or SLY prose
13 1 fourth SLY prose
13 1 or SLY prose
13 1 fift SLY prose
13 1 borough SLY prose
13 1 I’ll SLY prose
13 1 answer SLY prose
14 1 him SLY prose
14 1 by SLY prose
14 1 law SLY prose
14 1 I’ll SLY prose
14 1 not SLY prose
14 1 budge SLY prose
14 1 an SLY prose
14 1 inch SLY prose
14 1 boy SLY prose
14 1 let SLY prose
14 1 him SLY prose
14 1 come SLY prose
15 1 and SLY prose
15 1 kindly SLY prose
16 1 Huntsman LORD verse
16 1 charge LORD verse
16 1 thee LORD verse
16 1 tender LORD verse
16 1 well LORD verse
16 1 my LORD verse
16 1 hounds LORD verse
17 1 Brach LORD verse
17 1 Merriman LORD verse
17 1 the LORD verse
17 1 poor LORD verse
17 1 cur LORD verse
17 1 is LORD verse
17 1 emboss’d LORD verse
18 1 And LORD verse
18 1 couple LORD verse
18 1 Clowder LORD verse
18 1 with LORD verse
18 1 the LORD verse
18 1 deep LORD verse
18 1 mouth’d LORD verse
18 1 brach LORD verse
19 1 Saw’st LORD verse
19 1 thou LORD verse
19 1 not LORD verse
19 1 boy LORD verse
19 1 how LORD verse
19 1 Silver LORD verse
19 1 made LORD verse
19 1 it LORD verse
19 1 good LORD verse
20 1 At LORD verse

Citation

For attribution, please cite this work as

Hinrichs (2020, Oct. 18). Genre and Character in The Taming of the Shrew: 1 - Data Preparation. Retrieved from https://titus-and-shrew.netlify.app

BibTeX citation

@misc{hinrichs2020shrew-1,
  author = {Hinrichs, Lars},
  title = {Genre and Character in The Taming of the Shrew: 1 - Data Preparation},
  url = {https://titus-and-shrew.netlify.app},
  year = {2020}
}