0 - A Look at the Toolbox

Some background on methods and possibilities of digital text analysis, specifically in studying plays.

Lars Hinrichs https://larshinrichs.site (The University of Texas at Austin)https://liberalarts.utexas.edu/english
10-20-2020

What is digital text analysis?

Having read the NYT piece about social media text analysis, and perhaps from your own life in this digital world of ours!!, you have at least heard about some of the methods that a “computational sociolinguist” or a “digital humanist” might apply to text.

Sentiment analysis

“Positive” versus “negative” words

Sentiment analysis: n pos v. n neg per text.
Sentiment analysis: comparing pos:neg ratios.
Sentiment analysis: split each text into 100 slices, then track sentiment curve.
Sentiment analysis: same as above, but with act boundaries marked.

Tag for more than 2 emotions

Sentiment analysis: pos, neg, and 8 other sentiment categories.

Tag for parts of speech

knitr::include_graphics("pos_tagset.png", dpi=100)

Tag for semantic classes

knitr::include_graphics("semantic_tagset.png", dpi=100)

Start by loading our data

First we’ll get an inventory of all the .DOCX files in our project directory.

(textfiles <- list.files(pattern = ".docx"))
[1] "prose and verse csv 500 words.docx"
[2] "Prose and Verse from Shrew.docx"   

The second file has more formatting and may, for the time being, be more useful. We’ll read it in. Here is the top of its content.

f <- read_docx(textfiles[2])
f %>% head(10)
 [1] "I am  Christophero  Sly; call not me ' honour ' nor 'lordship.' I"
 [2] "ne'er drank sack in my life; and if you give me any conserves,"   
 [3] "give me conserves of beef. Ne'er ask me what raiment I'll wear,"  
 [4] "for I have no more doublets than backs, no more stockings than"   
 [5] "legs, nor no more shoes than feet- nay, sometime more feet than"  
 [6] "shoes, or such shoes as my toes look through the  overleather ."  
 [7] "Lord. Heaven cease this idle  humour  in your  honour !"          
 [8] "What, would you make me mad? Am not I Christopher Sly, old"       
 [9] "Sly's  son of Burton Heath; by birth a  pedlar , by education a"  
[10] "cardmaker , by transmutation a bear-herd, and now by present"     

Here it is, converted to tabular format.

f <- 
  f %>% 
  enframe() %>% 
  select(line = 2) %>% 
  mutate(file = textfiles[2])
f %>% 
  head(10) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left") 
line file
I am Christophero Sly; call not me ’ honour ’ nor ‘lordship.’ I Prose and Verse from Shrew.docx
ne’er drank sack in my life; and if you give me any conserves, Prose and Verse from Shrew.docx
give me conserves of beef. Ne’er ask me what raiment I’ll wear, Prose and Verse from Shrew.docx
for I have no more doublets than backs, no more stockings than Prose and Verse from Shrew.docx
legs, nor no more shoes than feet- nay, sometime more feet than Prose and Verse from Shrew.docx
shoes, or such shoes as my toes look through the overleather . Prose and Verse from Shrew.docx
Lord. Heaven cease this idle humour in your honour ! Prose and Verse from Shrew.docx
What, would you make me mad? Am not I Christopher Sly, old Prose and Verse from Shrew.docx
Sly’s son of Burton Heath; by birth a pedlar , by education a Prose and Verse from Shrew.docx
cardmaker , by transmutation a bear-herd, and now by present Prose and Verse from Shrew.docx

Narrowing down our interest

I have just received the latest notes from class: the interest seems to be in the difference between verse and prose crossed with an interest in character. I interpret this to mean that there are characters that appear both in verse and in prose sections, and we want to know how the language associated with each character differs from verse to prose.

In the text samples I have, character speech is not marked. We might be better off if we pull Taming of the Shrew from www.gutenberg.org.

Luckily there is an R module that interfaces with the Gutenberg archives. It is called gutenbergr. I will pull the whole text from there.

shakespeare <- 
  gutenberg_works(author == "Shakespeare, William") 

shakespeare                    %>% 
  mutate(entry = row_number()) %>% 
  select(entry, title)  %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left") 
entry title
1 Shakespeare’s Sonnets
2 Venus and Adonis
3 King Henry VI, First Part
4 History of King Henry the Sixth, Second Part
5 The History of King Henry the Sixth, Third Part
6 The Tragedy of King Richard III
7 The Comedy of Errors
8 The Rape of Lucrece
9 The Tragedy of Titus Andronicus
10 The Taming of the Shrew
11 The Two Gentlemen of Verona
12 Love’s Labour’s Lost
13 King John
14 The Tragedy of King Richard the Second
15 Romeo and Juliet
16 A Midsummer Night’s Dream
17 The Merchant of Venice
18 King Henry IV, the First Part
19 The Merry Wives of Windsor
20 King Henry IV, Second Part
21 Much Ado about Nothing
22 The Life of King Henry V
23 Julius Caesar
24 As You Like It
25 Hamlet, Prince of Denmark
26 The Phoenix and the Turtle
27 Twelfth Night; Or, What You Will
28 The History of Troilus and Cressida
29 All’s Well That Ends Well
30 Measure for Measure
31 Othello, the Moor of Venice
32 The Tragedy of King Lear
33 Macbeth
34 Antony and Cleopatra
35 The Tragedy of Coriolanus
36 The Life of Timon of Athens
37 Pericles, Prince of Tyre
38 Cymbeline
39 The Winter’s Tale
40 The Tempest
41 The Life of Henry the Eighth
42 A Lover’s Complaint
43 The Passionate Pilgrim
44 Twelfth Night
45 Richard II
46 Henry IV, Part 1
47 Henry V
48 Henry VI, Part 1
49 Henry VI, Part 2
50 Henry VI, Part 3
51 Richard III
52 Henry VIII
53 Coriolanus
54 Titus Andronicus
55 Timon of Athens
56 Hamlet
57 King Lear
58 Othello
59 Shakespeare’s First Folio
60 The Tragicall Historie of Hamlet, Prince of Denmarke The First (‘Bad’) Quarto
61 The Tragedie of Hamlet, Prince of Denmark A Study with the Text of the Folio of 1623
62 Shakespeare’s play of the Merchant of Venice Arranged for Representation at the Princess’s Theatre, with Historical and Explanatory Notes by Charles Kean, F.S.A.
63 King Henry the Fifth Arranged for Representation at the Princess’s Theatre
64 The Works of William Shakespeare [Cambridge Edition] [9 vols.] Introduction and Publisher’s Advertising
65 The Tempest The Works of William Shakespeare [Cambridge Edition] [9 vols.]
66 Two Gentlemen of Verona The Works of William Shakespeare [Cambridge Edition] [9 vols.]
67 The Merry Wives of Windsor The Works of William Shakespeare [Cambridge Edition] [9 vols.]
68 Measure for Measure The Works of William Shakespeare [Cambridge Edition] [9 vols.]
69 The Comedy of Errors The Works of William Shakespeare [Cambridge Edition] [9 vols.]
70 The New Hudson Shakespeare: Julius Cæsar
71 The Works of William Shakespeare [Cambridge Edition] [Vol. 2]
72 Shakespeare’s Comedy of The Tempest
73 The Works of William Shakespeare [Cambridge Edition] [Vol. 7 of 9]
74 Shakespeare’s Tragedy of Romeo and Juliet
75 The Works of William Shakespeare [Cambridge Edition] [Vol. 6 of 9 vols.]
76 The Works of William Shakespeare [Cambridge Edition] [Vol. 8 of 9 vols.]
77 The Works of William Shakespeare [Cambridge Edition] [Vol. 5 of 9] I, II, and III King Henry Sixth; King Richard III; and two other related plays.
78 The Works of William Shakespeare - Cambridge Edition (4 of 9) (1863)
79 The Works of William Shakespeare - Cambridge Edition (3 of 9) (1863)

So our text is there and available, at index no. 10.

IDs <-  shakespeare[c(10),]$gutenberg_id

shakespeare %>% 
  filter(gutenberg_id %in% IDs) %>% 
  select(gutenberg_id, title)
# A tibble: 1 x 2
  gutenberg_id title                  
         <int> <chr>                  
1         1508 The Taming of the Shrew

Let’s download the play and store it in an R object. I will call it plays (plural in case we add more texts or something.)

plays <- gutenberg_download(IDs, meta_fields = "title")

plays %>% as_tibble()
# A tibble: 4,829 x 3
   gutenberg_id text                       title                  
          <int> <chr>                      <chr>                  
 1         1508 "THE TAMING OF THE SHREW"  The Taming of the Shrew
 2         1508 ""                         The Taming of the Shrew
 3         1508 "by William Shakespeare"   The Taming of the Shrew
 4         1508 ""                         The Taming of the Shrew
 5         1508 ""                         The Taming of the Shrew
 6         1508 ""                         The Taming of the Shrew
 7         1508 ""                         The Taming of the Shrew
 8         1508 "Dramatis Personae"        The Taming of the Shrew
 9         1508 ""                         The Taming of the Shrew
10         1508 "Persons in the Induction" The Taming of the Shrew
# … with 4,819 more rows

I’ll convert the data to one-word-per row now. It’s the format we need for any of these analyses.

sentiments <- plays %>%
  #group_by(title) %>%
  mutate(line = row_number()) %>% 
  unnest_tokens(word, text,
                to_lower = F) 

sentiments %>% 
  slice(150:190) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left") 
gutenberg_id title line word
1508 The Taming of the Shrew 68 Therefore
1508 The Taming of the Shrew 68 paucas
1508 The Taming of the Shrew 69 pallabris
1508 The Taming of the Shrew 69 let
1508 The Taming of the Shrew 69 the
1508 The Taming of the Shrew 69 world
1508 The Taming of the Shrew 69 slide
1508 The Taming of the Shrew 69 Sessa
1508 The Taming of the Shrew 71 HOSTESS
1508 The Taming of the Shrew 72 You
1508 The Taming of the Shrew 72 will
1508 The Taming of the Shrew 72 not
1508 The Taming of the Shrew 72 pay
1508 The Taming of the Shrew 72 for
1508 The Taming of the Shrew 72 the
1508 The Taming of the Shrew 72 glasses
1508 The Taming of the Shrew 72 you
1508 The Taming of the Shrew 72 have
1508 The Taming of the Shrew 72 burst
1508 The Taming of the Shrew 74 SLY
1508 The Taming of the Shrew 75 No
1508 The Taming of the Shrew 75 not
1508 The Taming of the Shrew 75 a
1508 The Taming of the Shrew 75 denier
1508 The Taming of the Shrew 75 Go
1508 The Taming of the Shrew 75 by
1508 The Taming of the Shrew 75 Saint
1508 The Taming of the Shrew 75 Jeronimy
1508 The Taming of the Shrew 75 go
1508 The Taming of the Shrew 75 to
1508 The Taming of the Shrew 75 thy
1508 The Taming of the Shrew 75 cold
1508 The Taming of the Shrew 75 bed
1508 The Taming of the Shrew 76 and
1508 The Taming of the Shrew 76 warm
1508 The Taming of the Shrew 76 thee
1508 The Taming of the Shrew 78 HOSTESS
1508 The Taming of the Shrew 79 I
1508 The Taming of the Shrew 79 know
1508 The Taming of the Shrew 79 my
1508 The Taming of the Shrew 79 remedy

With my well-trained eye, I notice that “character names designating who’s speaking” are all set in all-caps. Given this material property, we can insert a column into the data that states, for each word, who is speaking it.

play_with_speaker <- 
  sentiments %>% 
  mutate(speaker = NA)

two_word_name <- FALSE

speakername <- NA

for (i in (1:nrow(play_with_speaker))){
  
  # define the word that this loop looks at
  myword = play_with_speaker[i,] %>% 
    pull(word)
  
  # check whether word is identitcal to its all-caps conversion
  # if yes, then set it as speaker name
  if (myword == toupper(myword) & nchar(myword) > 1){
    speakername = myword
  }
  
  # write speaker name in speaker column
  play_with_speaker$speaker[i] = speakername
  cat("\014", i, "of", nrow(play_with_speaker))
}
cat("\014")

We’ll export the dataset we’ve created so that next time the chunk won’t need to run. We’ll only need to import the data then.

play_with_speaker %>% 
  export("play_with_speaker.csv")
play_with_speaker <- 
  import("play_with_speaker.csv")

Let’s get a printout of the number of words spoken by each character.

play_with_speaker %>% 
  count(speaker, sort = T) %>% 
  filter(n > 25) %>% 
  kbl(caption = "Words per character (showing only those with more than 25)") %>% 
  kable_styling(position = "left",
                bootstrap_options = "striped", 
                full_width = F)
Table 1: Words per character (showing only those with more than 25)
speaker n
PETRUCHIO 4494
TRANIO 2423
KATHERINA 1900
HORTENSIO 1882
LUCENTIO 1625
GRUMIO 1540
GREMIO 1354
BAPTISTA 1177
BIONDELLO 933
LORD 808
BIANCA 677
SERVANT 600
SLY 598
VINCENTIO 469
PEDANT 453
PLAYERS 214
CURTIS 199
TAILOR 139
PAGE 136
SERVANTS 112
WIDOW 112
HUNTSMAN 110
SCENE 57
PLAYER 35
HOSTESS 34
NATHANIEL 26

For later use, let’s retain a list of the speakers who have significant enough amounts of speech. Let’s set the cutoff even a bit higher, at 400 words.

sig_speakers <- 
  play_with_speaker %>% 
  count(speaker) %>% 
  arrange(desc(n)) %>% 
  filter(n >400) %>% 
  pull(speaker)
sig_speakers
 [1] "PETRUCHIO" "TRANIO"    "KATHERINA" "HORTENSIO" "LUCENTIO" 
 [6] "GRUMIO"    "GREMIO"    "BAPTISTA"  "BIONDELLO" "LORD"     
[11] "BIANCA"    "SERVANT"   "SLY"       "VINCENTIO" "PEDANT"   

Below is a preview of the data format in which every row shows in the speaker column which character speaks it.

play_with_speaker %>% 
  slice(340:400) %>% 
  kbl() %>% 
  kable_styling(position = "left",
                full_width = F)
gutenberg_id title line word speaker
1508 The Taming of the Shrew 108 esteem LORD
1508 The Taming of the Shrew 108 him LORD
1508 The Taming of the Shrew 108 worth LORD
1508 The Taming of the Shrew 108 a LORD
1508 The Taming of the Shrew 108 dozen LORD
1508 The Taming of the Shrew 108 such LORD
1508 The Taming of the Shrew 109 But LORD
1508 The Taming of the Shrew 109 sup LORD
1508 The Taming of the Shrew 109 them LORD
1508 The Taming of the Shrew 109 well LORD
1508 The Taming of the Shrew 109 and LORD
1508 The Taming of the Shrew 109 look LORD
1508 The Taming of the Shrew 109 unto LORD
1508 The Taming of the Shrew 109 them LORD
1508 The Taming of the Shrew 109 all LORD
1508 The Taming of the Shrew 110 To LORD
1508 The Taming of the Shrew 110 morrow LORD
1508 The Taming of the Shrew 110 I LORD
1508 The Taming of the Shrew 110 intend LORD
1508 The Taming of the Shrew 110 to LORD
1508 The Taming of the Shrew 110 hunt LORD
1508 The Taming of the Shrew 110 again LORD
1508 The Taming of the Shrew 112 FIRST FIRST
1508 The Taming of the Shrew 112 HUNTSMAN HUNTSMAN
1508 The Taming of the Shrew 113 I HUNTSMAN
1508 The Taming of the Shrew 113 will HUNTSMAN
1508 The Taming of the Shrew 113 my HUNTSMAN
1508 The Taming of the Shrew 113 lord HUNTSMAN
1508 The Taming of the Shrew 115 LORD LORD
1508 The Taming of the Shrew 116 Sees LORD
1508 The Taming of the Shrew 116 Sly LORD
1508 The Taming of the Shrew 116 What’s LORD
1508 The Taming of the Shrew 116 here LORD
1508 The Taming of the Shrew 116 One LORD
1508 The Taming of the Shrew 116 dead LORD
1508 The Taming of the Shrew 116 or LORD
1508 The Taming of the Shrew 116 drunk LORD
1508 The Taming of the Shrew 117 See LORD
1508 The Taming of the Shrew 117 doth LORD
1508 The Taming of the Shrew 117 he LORD
1508 The Taming of the Shrew 117 breathe LORD
1508 The Taming of the Shrew 119 SECOND SECOND
1508 The Taming of the Shrew 119 HUNTSMAN HUNTSMAN
1508 The Taming of the Shrew 120 He HUNTSMAN
1508 The Taming of the Shrew 120 breathes HUNTSMAN
1508 The Taming of the Shrew 120 my HUNTSMAN
1508 The Taming of the Shrew 120 lord HUNTSMAN
1508 The Taming of the Shrew 120 Were HUNTSMAN
1508 The Taming of the Shrew 120 he HUNTSMAN
1508 The Taming of the Shrew 120 not HUNTSMAN
1508 The Taming of the Shrew 120 warm’d HUNTSMAN
1508 The Taming of the Shrew 120 with HUNTSMAN
1508 The Taming of the Shrew 120 ale HUNTSMAN
1508 The Taming of the Shrew 121 This HUNTSMAN
1508 The Taming of the Shrew 121 were HUNTSMAN
1508 The Taming of the Shrew 121 a HUNTSMAN
1508 The Taming of the Shrew 121 bed HUNTSMAN
1508 The Taming of the Shrew 121 but HUNTSMAN
1508 The Taming of the Shrew 121 cold HUNTSMAN
1508 The Taming of the Shrew 121 to HUNTSMAN
1508 The Taming of the Shrew 121 sleep HUNTSMAN

Tag for sentiment

For now, we’ll just tag for pos. and neg. sentiment words

sentiment <- 
  play_with_speaker %>% 
  mutate(word = tolower(word)) %>% 
  anti_join(stop_words) %>%          
  inner_join(get_sentiments("bing")) %>% 
  select(speaker, word, sentiment)

sentiment %>% 
  slice(300:330) %>% 
  kbl() %>% 
  kable_styling(position = "left",
                full_width = F)
speaker word sentiment
LUCENTIO sweet positive
LUCENTIO beauty positive
LUCENTIO humble positive
TRANIO scold negative
TRANIO din negative
LUCENTIO sweet positive
TRANIO love positive
TRANIO master positive
TRANIO love positive
LUCENTIO cruel negative
TRANIO master positive
LUCENTIO master positive
LUCENTIO master positive
LUCENTIO charm positive
TRANIO pleasure positive
TRANIO love positive
LUCENTIO loves positive
LUCENTIO slave negative
LUCENTIO rogue negative
BIONDELLO master positive
LUCENTIO quarrel negative
LUCENTIO fear negative
TRANIO faith positive
TRANIO master positive
SLY sly negative
SLY saint positive
SLY sly negative
SLY excellent positive
PETRUCHIO beloved positive
PETRUCHIO knock negative
GRUMIO knock negative

With this data available, we can now plot the sentiment-per-speaker.

First we’ll make a second dataset that is “wider”: it has one row per speaker and it summarizes n_neg, n_pos, n_total, and neg_ratio.

sentiment_wide <- 
  sentiment %>% 
  filter(speaker %in% sig_speakers) %>% 
  count(speaker, sentiment) %>% 
  pivot_wider(
    names_from = sentiment,
    values_from = n
  ) %>% 
  mutate(n_total = negative + positive,
         neg_ratio = negative/positive) %>% 
  rename(n_neg = negative,
         n_pos = positive)

Some of this info will be useful to have in the main dataset sentiment. Therefore we’ll left_join() the two.

sentiment %>% 
  filter(speaker %in% sig_speakers) %>% 
  count(speaker, sentiment) %>% 
  left_join(select(sentiment_wide, speaker, n_total)) %>% 
  #---start plot code---#
  ggplot(aes(x = sentiment,
             y = n/n_total,
             fill = speaker)) +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_bar(stat = "identity",
           position = "dodge") +
  scale_fill_viridis_d() +
  facet_wrap(~speaker) + 
  labs(y="sentiment words scored") +
  theme_classic(base_family = "Roboto Condensed") +
  theme(legend.position = "none")
The ratio between neg. and pos.-coded words by character.

Figure 1: The ratio between neg. and pos.-coded words by character.

Let’s now plot just one bar for each speaker, the neg:pos ratio (or neg_ratio, as I will call it). So the highest value will be for the character with the most negative words.

sentiment_wide %>% 
  #---start plot code---#
  ggplot(aes(x = reorder(speaker, neg_ratio),
             y = neg_ratio,
             fill = -(neg_ratio)
             )) +
   geom_hline(yintercept = 1,
             colour="black", 
             linetype="dashed") +
  geom_col(alpha=0.8) +
  scale_fill_viridis_b() +
  labs(
    title="Sentiment ratio by character",
    subtitle="Values > 1 indicate greater negative counts",
    caption="The Taming of the Shrew",
    y="negative vs. positive words",
       x=NULL) +
  coord_flip() +
  theme_classic(base_family = "Roboto Condensed") +
  theme(legend.position = "none")
Sly must be miserable to be around.

Figure 2: Sly must be miserable to be around.

Citation

For attribution, please cite this work as

Hinrichs (2020, Oct. 20). Genre and Character in The Taming of the Shrew: 0 - A Look at the Toolbox. Retrieved from https://titus-and-shrew.netlify.app

BibTeX citation

@misc{hinrichs2020shrew-0,
  author = {Hinrichs, Lars},
  title = {Genre and Character in The Taming of the Shrew: 0 - A Look at the Toolbox},
  url = {https://titus-and-shrew.netlify.app},
  year = {2020}
}