Genre and Character in The Taming of the Shrew: 4 - Part-of-Speech Distribution

Lars Hinrichs

Applying POS-tags

The Shrew text was annotated using the functionality of the cleanNLP package for R (Arnold 2017). Tagging for speaker and genre at the word level was included as in previous steps of the analysis. Below is a preview of the data after part-of-speech tagging.

doc_id	sid	tid	token	token_with_ws	lemma	upos	xpos	tid_source	relation	linenumber	speaker	genre
1	1	1	I	I	-PRON-	PRON	PRP	3	nsubj	1	SLY	prose
1	1	2	’ll	’ll	will	VERB	MD	3	aux	1	SLY	prose
1	1	3	pheeze	pheeze	pheeze	VERB	VB	0	root	1	SLY	prose
1	1	4	you	you	-PRON-	PRON	PRP	3	dobj	1	SLY	prose
1	1	6	in	in	in	ADP	IN	3	prep	1	SLY	prose
1	1	7	faith	faith	faith	NOUN	NN	6	pobj	1	SLY	prose
2	1	1	A	A	a	DET	DT	2	det	2	HOSTESS	prose
2	1	2	pair	pair	pair	NOUN	NN	0	root	2	HOSTESS	prose
2	1	3	of	of	of	ADP	IN	2	prep	2	HOSTESS	prose
2	1	4	stocks	stocks	stock	NOUN	NNS	3	pobj	2	HOSTESS	prose
2	1	6	you	you	-PRON-	PRON	PRP	7	nsubj	2	HOSTESS	prose
2	1	7	rogue	rogue	rogue	VERB	VBP	2	appos	2	HOSTESS	prose
3	1	1	Y	Y	Y	PROPN	NNP	3	nsubj	3	SLY	prose
3	1	3	are	are	be	AUX	VBP	9	ccomp	3	SLY	prose
3	1	4	a	a	a	DET	DT	5	det	3	SLY	prose
3	1	5	baggage	baggage	baggage	NOUN	NN	3	attr	3	SLY	prose
3	1	7	the	the	the	DET	DT	8	det	3	SLY	prose
3	1	8	Slys	Slys	Slys	PROPN	NNP	9	nsubj	3	SLY	prose
3	1	9	are	are	be	AUX	VBP	0	root	3	SLY	prose
3	1	10	no	no	no	DET	DT	11	det	3	SLY	prose

Data selection

We’ll extract the 14 speakers with the most words from the data. They are:

speaker	n
PETRUCHIO	4557
TRANIO	2361
KATHARINA	1832
HORTENSIO	1773
GRUMIO	1657
LUCENTIO	1443
BAPTISTA	1271
GREMIO	1187
LORD	1080
BIONDELLO	813
SLY	539
BIANCA	518
PEDANT	393
VINCENTIO	341

Primer: why POS frequencies are interesting

Verbal style

Higher frequencies of verbs (and attendant parts of speech) indicate dynamic communication, social intelligence, action-focused modes of thought, relational psychology (i.e. relating to other characters) (Pennebaker et al. 2014; Pennebaker, Mehl, and Niederhoffer 2003; Biber 1991).

Nominal style

Higher frequencies of nouns (and attendant parts of speech) indicate conceptual thinking, declarative intelligence, epistemological interest, fact-oriented modes of thought, investigative/academic psychology.

Note on significance

Because of the high frequencies of tokens in the analysis of POS-tags, even small differences in proportions can be considered significant (Hinrichs, Smith, and Waibel 2010).

POS frequencies in the data

The data has been part-of-speech tagged in the background. The tagger we’re using assigns a set of 36 different tags.

 [1] "ADD"  "CC"   "CD"   "DT"   "EX"   "FW"   "IN"   "JJ"   "JJR" 
[10] "JJS"  "MD"   "NN"   "NNP"  "NNPS" "NNS"  "PDT"  "POS"  "PRP" 
[19] "PRP$" "RB"   "RBR"  "RBS"  "RP"   "TO"   "UH"   "VB"   "VBD" 
[28] "VBG"  "VBN"  "VBP"  "VBZ"  "WDT"  "WP"   "WP$"  "WRB"  "XX"

Thankfully, it also has a set of meta-categories, so I won’t need to define any myself They are:

 [1] "ADJ"   "ADP"   "ADV"   "AUX"   "CCONJ" "DET"   "INTJ"  "NOUN" 
 [9] "NUM"   "PART"  "PRON"  "PROPN" "SCONJ" "VERB"  "X"

There are 15 of them.

Definition of POS indices

We are interested in “verbal” vs. “nominal” style. These can be measured in the frequencies of the “VERB” and the “NOUN” tags, respectively, but I want to also include the attendant POS groups that co-vary with those two (Mair 1997; Hinrichs, Smith, and Waibel 2010):

verbs co-vary with auxiliary verbs and adverbs; and
nouns co-vary with determiners, adjectives, and prepositions.

So I’ll form two index groups:

verb_index <- c("VERB", "AUX", "ADV")
noun_index <- c("NOUN", "PROPN", "ADP", "DET", "ADJ")

Here are the relationships between verbal and nominal indices by the top-14 characters.

Figure 1: Tokens in the index groups for nouns, verbs, and “other.”

This graph does not communicate very clearly what we actually want to know, which is the ratio between frequency of tags in the noun group and those in the verb group. So let’s eliminate “other” and focus only on the verb and noun tags.

Figure 2: Relative frequency of POS=tags in n/v index groups, shown as ratio between n/v.

Finally, we can break up the noun:verb ratio for each speaker by genre.

Figure 3: Relative frequency of tags in n/v index groups, by genre.

Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” The R Journal 9 (2): 1–20. https://journal.r-project.org/archive/2017/RJ-2017-035/index.html.

Biber, Douglas. 1991. Variation Across Speech and Writing. Cambridge University Press.

Hinrichs, Lars, Nicholas Smith, and Birgit Waibel. 2010. “Manual of Information for the Part-of-Speech-Tagged, Post-Edited ’brown’ Corpora.” ICAME Journal 34: 189–231. https://pdfs.semanticscholar.org/9f36/194ba486ea9f785da5a9a1bc5ac9198932c1.pdf.

Mair, Christian. 1997. “Parallel Corpora: A Real-Time Approach to the Study of Language Change in Progress.” In, edited by M. Ljung, 195–209. Amsterdam: Rodopi.

Pennebaker, James W., Cindy K. Chung, Joey Frazee, Gary M. Lavergne, and David I. Beaver. 2014. “When Small Words Foretell Academic Success: The Case of College Admissions Essays.” PloS One 9 (12): e115844. https://doi.org/10.1371/journal.pone.0115844.

Pennebaker, James W., Matthias R. Mehl, and Kate G. Niederhoffer. 2003. “Psychological Aspects of Natural Language Use: Our Words, Our Selves.” Annual Review of Psychology 54 (1): 547–77. https://doi.org/10.1146/annurev.psych.54.101601.145041.

4 - Part-of-Speech Distribution