Preparation of the pre-cleaned and speaker/genre-tagged play text.
We have pre-cleaned the data and hand-coded it for genre. That is the gold standard. In addition, speech prefixes were retained and systematically set to all-caps. Stage directions were removed. Below is a screenshot of the current version.
Reading in the file, the head looks as below.
linenumber | text |
---|---|
65 | And when he says he is, say that he dreams, |
66 | For he is nothing but a mighty lord. |
67 | This do, and do it kindly, gentle sirs; |
68 | It will be pastime passing excellent, |
69 | If it be husbanded with modesty. < /verse > |
70 | < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part |
71 | As he shall think by our true diligence |
72 | He is no less than what we say he is. < /verse > |
73 | < LORD > < verse > Take him up gently and to bed with him, |
74 | And each one to his office when he wakes. |
75 | Sirrah, go see what trumpet ’tis that sounds. |
The text has been read line-by-line. The current length is 2 799 lines. As we process it, we’ll find out the number of words.
The following things need to happen before any analysis.
<prose>...</prose>
) to lexical tags, i.e. made-up words I can be sure Shakespeare didn’t use in the play: prosestart ... proseend
, versestart ... verseend
. This will allow me to still remove any punctuation while retaining the genre tagging.prose
or verse
as genre.I’ll now work down this list and show the state of the text after each conversion.
linenumber | text |
---|---|
65 | And when he says he is, say that he dreams, |
66 | For he is nothing but a mighty lord. |
67 | This do, and do it kindly, gentle sirs; |
68 | It will be pastime passing excellent, |
69 | If it be husbanded with modesty. < /verse > |
70 | < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part |
71 | As he shall think by our true diligence |
72 | He is no less than what we say he is. < /verse > |
73 | < LORD > < verse > Take him up gently and to bed with him, |
74 | And each one to his office when he wakes. |
75 | Sirrah, go see what trumpet ’tis that sounds. |
I went back to the DOCX file to be able to read better, and tried to find any character name that consisted of more than one token. Here are the ones I came up with.
No need to display the result here.
Based on the tagging of the current text, we will recognize act beginnings by the sequence |A
at the beginning of a line.
linenumber | text | act |
---|---|---|
65 | And when he says he is, say that he dreams, | 1 |
66 | For he is nothing but a mighty lord. | 1 |
67 | This do, and do it kindly, gentle sirs; | 1 |
68 | It will be pastime passing excellent, | 1 |
69 | If it be husbanded with modesty. < /verse > | 1 |
70 | < 1. HUNTSMAN > < verse > My lord, I warrant you we will play our part | 1 |
71 | As he shall think by our true diligence | 1 |
72 | He is no less than what we say he is. < /verse > | 1 |
73 | < LORD > < verse > Take him up gently and to bed with him, | 1 |
74 | And each one to his office when he wakes. | 1 |
75 | Sirrah, go see what trumpet ’tis that sounds. | 1 |
We have covered our bases now and can use the default setting of the unnest_tokens()
function with regard to removal of punctuation, but not with regard to lower-casing across the board.
linenumber | act | word |
---|---|---|
9 | 1 | a |
9 | 1 | denier |
9 | 1 | Go |
9 | 1 | by |
9 | 1 | Saint |
9 | 1 | Jeronimy |
9 | 1 | go |
10 | 1 | to |
10 | 1 | thy |
10 | 1 | cold |
10 | 1 | bed |
We can now determine that the play has 23 242 words, as that is the number of rows in the text object after tokenization.
We’re adding a new column and marking for each word who the speaker is.
linenumber | act | word | speaker |
---|---|---|---|
9 | 1 | a | SLY |
9 | 1 | denier | SLY |
9 | 1 | Go | SLY |
9 | 1 | by | SLY |
9 | 1 | Saint | SLY |
9 | 1 | Jeronimy | SLY |
9 | 1 | go | SLY |
10 | 1 | to | SLY |
10 | 1 | thy | SLY |
10 | 1 | cold | SLY |
10 | 1 | bed | SLY |
10 | 1 | and | SLY |
10 | 1 | warm | SLY |
10 | 1 | thee | SLY |
10 | 1 | proseend | SLY |
11 | 1 | HOSTESS | HOSTESS |
11 | 1 | prosestart | HOSTESS |
11 | 1 | I | HOSTESS |
11 | 1 | know | HOSTESS |
11 | 1 | my | HOSTESS |
11 | 1 | remedy | HOSTESS |
Now that the speaker tagging in a dedicated column is in place, we will also remove those rows that contain only a speech prefix.
linenumber | act | word | speaker | genre |
---|---|---|---|---|
10 | 1 | to | SLY | prose |
10 | 1 | thy | SLY | prose |
10 | 1 | cold | SLY | prose |
10 | 1 | bed | SLY | prose |
10 | 1 | and | SLY | prose |
10 | 1 | warm | SLY | prose |
10 | 1 | thee | SLY | prose |
10 | 1 | proseend | SLY | prose |
11 | 1 | prosestart | HOSTESS | prose |
11 | 1 | know | HOSTESS | prose |
11 | 1 | my | HOSTESS | prose |
11 | 1 | remedy | HOSTESS | prose |
11 | 1 | must | HOSTESS | prose |
11 | 1 | go | HOSTESS | prose |
11 | 1 | fetch | HOSTESS | prose |
11 | 1 | the | HOSTESS | prose |
12 | 1 | thirdborough | HOSTESS | prose |
12 | 1 | proseend | HOSTESS | prose |
13 | 1 | prosestart | SLY | prose |
13 | 1 | Third | SLY | prose |
13 | 1 | or | SLY | prose |
Now that genres are tagged as well, we will also remove those rows that contain only a (lexical) genre tag.
Here is a last view of the data, this time a bit longer.
linenumber | act | word | speaker | genre |
---|---|---|---|---|
11 | 1 | my | HOSTESS | prose |
11 | 1 | remedy | HOSTESS | prose |
11 | 1 | must | HOSTESS | prose |
11 | 1 | go | HOSTESS | prose |
11 | 1 | fetch | HOSTESS | prose |
11 | 1 | the | HOSTESS | prose |
12 | 1 | thirdborough | HOSTESS | prose |
13 | 1 | Third | SLY | prose |
13 | 1 | or | SLY | prose |
13 | 1 | fourth | SLY | prose |
13 | 1 | or | SLY | prose |
13 | 1 | fift | SLY | prose |
13 | 1 | borough | SLY | prose |
13 | 1 | I’ll | SLY | prose |
13 | 1 | answer | SLY | prose |
14 | 1 | him | SLY | prose |
14 | 1 | by | SLY | prose |
14 | 1 | law | SLY | prose |
14 | 1 | I’ll | SLY | prose |
14 | 1 | not | SLY | prose |
14 | 1 | budge | SLY | prose |
14 | 1 | an | SLY | prose |
14 | 1 | inch | SLY | prose |
14 | 1 | boy | SLY | prose |
14 | 1 | let | SLY | prose |
14 | 1 | him | SLY | prose |
14 | 1 | come | SLY | prose |
15 | 1 | and | SLY | prose |
15 | 1 | kindly | SLY | prose |
16 | 1 | Huntsman | LORD | verse |
16 | 1 | charge | LORD | verse |
16 | 1 | thee | LORD | verse |
16 | 1 | tender | LORD | verse |
16 | 1 | well | LORD | verse |
16 | 1 | my | LORD | verse |
16 | 1 | hounds | LORD | verse |
17 | 1 | Brach | LORD | verse |
17 | 1 | Merriman | LORD | verse |
17 | 1 | the | LORD | verse |
17 | 1 | poor | LORD | verse |
17 | 1 | cur | LORD | verse |
17 | 1 | is | LORD | verse |
17 | 1 | emboss’d | LORD | verse |
18 | 1 | And | LORD | verse |
18 | 1 | couple | LORD | verse |
18 | 1 | Clowder | LORD | verse |
18 | 1 | with | LORD | verse |
18 | 1 | the | LORD | verse |
18 | 1 | deep | LORD | verse |
18 | 1 | mouth’d | LORD | verse |
18 | 1 | brach | LORD | verse |
19 | 1 | Saw’st | LORD | verse |
19 | 1 | thou | LORD | verse |
19 | 1 | not | LORD | verse |
19 | 1 | boy | LORD | verse |
19 | 1 | how | LORD | verse |
19 | 1 | Silver | LORD | verse |
19 | 1 | made | LORD | verse |
19 | 1 | it | LORD | verse |
19 | 1 | good | LORD | verse |
20 | 1 | At | LORD | verse |
For attribution, please cite this work as
Hinrichs (2020, Oct. 18). Genre and Character in The Taming of the Shrew: 1 - Data Preparation. Retrieved from https://titus-and-shrew.netlify.app
BibTeX citation
@misc{hinrichs2020shrew-1, author = {Hinrichs, Lars}, title = {Genre and Character in The Taming of the Shrew: 1 - Data Preparation}, url = {https://titus-and-shrew.netlify.app}, year = {2020} }