4 - Part-of-Speech Distribution

Analysis of POS frequencies by character and genre.

Lars Hinrichs https://larshinrichs.site (The University of Texas at Austin)https://liberalarts.utexas.edu/english
10-15-2020

Applying POS-tags

The Shrew text was annotated using the functionality of the cleanNLP package for R (Arnold 2017). Tagging for speaker and genre at the word level was included as in previous steps of the analysis. Below is a preview of the data after part-of-speech tagging.

doc_id sid tid token token_with_ws lemma upos xpos tid_source relation linenumber speaker genre
1 1 1 I I -PRON- PRON PRP 3 nsubj 1 SLY prose
1 1 2 ’ll ’ll will VERB MD 3 aux 1 SLY prose
1 1 3 pheeze pheeze pheeze VERB VB 0 root 1 SLY prose
1 1 4 you you -PRON- PRON PRP 3 dobj 1 SLY prose
1 1 6 in in in ADP IN 3 prep 1 SLY prose
1 1 7 faith faith faith NOUN NN 6 pobj 1 SLY prose
2 1 1 A A a DET DT 2 det 2 HOSTESS prose
2 1 2 pair pair pair NOUN NN 0 root 2 HOSTESS prose
2 1 3 of of of ADP IN 2 prep 2 HOSTESS prose
2 1 4 stocks stocks stock NOUN NNS 3 pobj 2 HOSTESS prose
2 1 6 you you -PRON- PRON PRP 7 nsubj 2 HOSTESS prose
2 1 7 rogue rogue rogue VERB VBP 2 appos 2 HOSTESS prose
3 1 1 Y Y Y PROPN NNP 3 nsubj 3 SLY prose
3 1 3 are are be AUX VBP 9 ccomp 3 SLY prose
3 1 4 a a a DET DT 5 det 3 SLY prose
3 1 5 baggage baggage baggage NOUN NN 3 attr 3 SLY prose
3 1 7 the the the DET DT 8 det 3 SLY prose
3 1 8 Slys Slys Slys PROPN NNP 9 nsubj 3 SLY prose
3 1 9 are are be AUX VBP 0 root 3 SLY prose
3 1 10 no no no DET DT 11 det 3 SLY prose

Data selection

We’ll extract the 14 speakers with the most words from the data. They are:

speaker n
PETRUCHIO 4557
TRANIO 2361
KATHARINA 1832
HORTENSIO 1773
GRUMIO 1657
LUCENTIO 1443
BAPTISTA 1271
GREMIO 1187
LORD 1080
BIONDELLO 813
SLY 539
BIANCA 518
PEDANT 393
VINCENTIO 341

Primer: why POS frequencies are interesting

Verbal style

Higher frequencies of verbs (and attendant parts of speech) indicate dynamic communication, social intelligence, action-focused modes of thought, relational psychology (i.e. relating to other characters) (Pennebaker et al. 2014; Pennebaker, Mehl, and Niederhoffer 2003; Biber 1991).

Nominal style

Higher frequencies of nouns (and attendant parts of speech) indicate conceptual thinking, declarative intelligence, epistemological interest, fact-oriented modes of thought, investigative/academic psychology.

Note on significance

Because of the high frequencies of tokens in the analysis of POS-tags, even small differences in proportions can be considered significant (Hinrichs, Smith, and Waibel 2010).

POS frequencies in the data

The data has been part-of-speech tagged in the background. The tagger we’re using assigns a set of 36 different tags.

 [1] "ADD"  "CC"   "CD"   "DT"   "EX"   "FW"   "IN"   "JJ"   "JJR" 
[10] "JJS"  "MD"   "NN"   "NNP"  "NNPS" "NNS"  "PDT"  "POS"  "PRP" 
[19] "PRP$" "RB"   "RBR"  "RBS"  "RP"   "TO"   "UH"   "VB"   "VBD" 
[28] "VBG"  "VBN"  "VBP"  "VBZ"  "WDT"  "WP"   "WP$"  "WRB"  "XX"  

Thankfully, it also has a set of meta-categories, so I won’t need to define any myself They are:

 [1] "ADJ"   "ADP"   "ADV"   "AUX"   "CCONJ" "DET"   "INTJ"  "NOUN" 
 [9] "NUM"   "PART"  "PRON"  "PROPN" "SCONJ" "VERB"  "X"    

There are 15 of them.

Definition of POS indices

We are interested in “verbal” vs. “nominal” style. These can be measured in the frequencies of the “VERB” and the “NOUN” tags, respectively, but I want to also include the attendant POS groups that co-vary with those two (Mair 1997; Hinrichs, Smith, and Waibel 2010):

So I’ll form two index groups:

verb_index <- c("VERB", "AUX", "ADV")
noun_index <- c("NOUN", "PROPN", "ADP", "DET", "ADJ")

Here are the relationships between verbal and nominal indices by the top-14 characters.

Tokens in the index groups for nouns, verbs, and "other".

Figure 1: Tokens in the index groups for nouns, verbs, and “other.”

This graph does not communicate very clearly what we actually want to know, which is the ratio between frequency of tags in the noun group and those in the verb group. So let’s eliminate “other” and focus only on the verb and noun tags.

Relative frequency of POS=tags in n/v index groups, shown as ratio between n/v.

Figure 2: Relative frequency of POS=tags in n/v index groups, shown as ratio between n/v.

Finally, we can break up the noun:verb ratio for each speaker by genre.

Relative frequency of tags in n/v index groups, by genre.

Figure 3: Relative frequency of tags in n/v index groups, by genre.

Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” The R Journal 9 (2): 1–20. https://journal.r-project.org/archive/2017/RJ-2017-035/index.html.
Biber, Douglas. 1991. Variation Across Speech and Writing. Cambridge University Press.
Hinrichs, Lars, Nicholas Smith, and Birgit Waibel. 2010. “Manual of Information for the Part-of-Speech-Tagged, Post-Edited ’brown’ Corpora.” ICAME Journal 34: 189–231. https://pdfs.semanticscholar.org/9f36/194ba486ea9f785da5a9a1bc5ac9198932c1.pdf.
Mair, Christian. 1997. “Parallel Corpora: A Real-Time Approach to the Study of Language Change in Progress.” In, edited by M. Ljung, 195–209. Amsterdam: Rodopi.
Pennebaker, James W., Cindy K. Chung, Joey Frazee, Gary M. Lavergne, and David I. Beaver. 2014. “When Small Words Foretell Academic Success: The Case of College Admissions Essays.” PloS One 9 (12): e115844. https://doi.org/10.1371/journal.pone.0115844.
Pennebaker, James W., Matthias R. Mehl, and Kate G. Niederhoffer. 2003. “Psychological Aspects of Natural Language Use: Our Words, Our Selves.” Annual Review of Psychology 54 (1): 547–77. https://doi.org/10.1146/annurev.psych.54.101601.145041.

References

Citation

For attribution, please cite this work as

Hinrichs (2020, Oct. 15). Genre and Character in The Taming of the Shrew: 4 - Part-of-Speech Distribution. Retrieved from https://titus-and-shrew.netlify.app

BibTeX citation

@misc{hinrichs2020shrew-4,
  author = {Hinrichs, Lars},
  title = {Genre and Character in The Taming of the Shrew: 4 - Part-of-Speech Distribution},
  url = {https://titus-and-shrew.netlify.app},
  year = {2020}
}