class: center, middle, inverse, title-slide # Lec13: Text ## Stat41: Data Viz ### Prof Amanda Luby ### Swarthmore College --- # Today: (1) Announcements (2) Basics of Text Analysis (3) Some cautionary notes (4) (If time) Discussion of Examples --- # Announcements Meeting sign-up sheet going "live" in the google drive at the beginning of lab -- **Every** project must sign up for a 15 minute meeting - we may not need the whole time, but I want to check in before the final presentations. -- Schedule for final presentations will be out tomorrow -- All missing work and re-submissions are due on Thursday -- Final papers due on Sunday --- # Tomorrow: Part 1: Formatting .rmd output files, code style guide, etc. + Any specific Q's from comments, please send to me via slack! -- Part 2: Making work public + Bring 3 .rmd files from class that you are most happy with the output --- class: inverse, center, middle # Text as Data --- # Basic steps: 1. Tidy the text data - Each row should be a text element (often a word, n-gram, line, etc.) This is called a *token*. - Other variables could include title, chapter, line, etc. -- 2. Remove *stopwords* - Words like "the" and "is" usually aren't that interesting - This also makes the dataset smaller and easier to work with -- 3. Count the words - Usually within each book or chapter -- 4. Count the words, but make it fancy - Standardize based on total number of words - How common is a word in a book *relative to other books* - How common are "good words" or "bad words" -- 5. Plot the counts --- class: inverse, center, middle # Example: Harry Potter Books --- # Regular text .small-code[ ``` THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters a... ``` ] --- # Tidy text .small-code[ ``` # A tibble: 6 x 3 chapter book text <int> <chr> <chr> 1 1 Harry Potter and the Phil… "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number … 2 2 Harry Potter and the Phil… "THE VANISHING GLASS Nearly ten years had passed si… 3 3 Harry Potter and the Phil… "THE LETTERS FROM NO ONE The escape of the Brazilia… 4 4 Harry Potter and the Phil… "THE KEEPER OF THE KEYS BOOM. They knocked again. D… 5 5 Harry Potter and the Phil… "DIAGON ALLEY Harry woke early the next morning. Al… 6 6 Harry Potter and the Phil… "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS … ``` ] --- # Tokens .pull-left.small-code[ ``` # A tibble: 6 x 3 word chapter book <chr> <int> <chr> 1 the 1 Harry Potter... 2 boy 1 Harry Potter... 3 who 1 Harry Potter... 4 lived 1 Harry Potter... 5 mr 1 Harry Potter... 6 and 1 Harry Potter... ``` ] -- .pull-right.small-code[ ``` # A tibble: 6 x 3 bigram chapter book <chr> <int> <chr> 1 the boy 1 Harry Potter... 2 boy who 1 Harry Potter... 3 who lived 1 Harry Potter... 4 lived mr 1 Harry Potter... 5 mr and 1 Harry Potter... 6 and mrs 1 Harry Potter... ``` ] --- # Stopwords .center.small-code[ ``` # A tibble: 1,149 x 2 word lexicon <chr> <chr> 1 a SMART 2 a's SMART 3 able SMART 4 about SMART 5 above SMART 6 according SMART 7 accordingly SMART 8 across SMART 9 actually SMART 10 after SMART # … with 1,139 more rows ``` ] --- # Counting + plot <img src="Lec13_files/figure-html/hp-words-1.png" style="display: block; margin: auto;" /> --- # *Fancy* counting + plot <img src="Lec13_files/figure-html/hp-se-she-1.png" style="display: block; margin: auto;" /> --- # Sentiment analysis .pull-left[ ```r get_sentiments("bing") ``` ``` # A tibble: 6,786 x 2 word sentiment <chr> <chr> 1 2-faces negative 2 abnormal negative 3 abolish negative 4 abominable negative 5 abominably negative 6 abominate negative 7 abomination negative 8 abort negative 9 aborted negative 10 aborts negative # … with 6,776 more rows ``` ] -- .pull-right[ ```r get_sentiments("afinn") ``` ``` # A tibble: 2,477 x 2 word value <chr> <dbl> 1 abandon -2 2 abandoned -2 3 abandons -2 4 abducted -2 5 abduction -2 6 abductions -2 7 abhor -3 8 abhorred -3 9 abhorrent -3 10 abhors -3 # … with 2,467 more rows ``` ] --- # *Fancy* counting + Plot 2 <img src="Lec13_files/figure-html/hp-net-sentiment-1.png" style="display: block; margin: auto;" /> --- # tf-idf Stands for: Term frequency-inverse document frequency Idea: How important a term is *within* a document, compared to the rest of the documents $$ `\begin{aligned} \text{tf(term)} &= \frac{\text{times term appears in document}}{\text{total words in document}} \\ \text{idf}(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ \text{tf-idf}(\text{term}) &= \text{tf}(\text{term}) \times \text{idf}(\text{term}) \end{aligned}` $$ `tidytext` package does this for you, which you'll see in lab today! --- # *Fancy* counting + Plot 3 <img src="Lec13_files/figure-html/hp-tf-idf-1.png" style="display: block; margin: auto;" /> --- # Of course, there's so much more! + Part-of-Speech tagging -- + Topic Modeling/Latent Dirichlet Allocation - Idea: find clusters of related words - Label the clusters - Track topics over, eg, time -- + Fingerprinting - Punctuation, vocabulary, sentence length - Can be used to make guesses about authorship --- # A Quick Note on Word Clouds From [blog post](https://www.niemanlab.org/2011/10/word-clouds-considered-harmful/) by former NYT software architect > ...I’ve seen this pattern across many news organizations: reporters sidestepping their limited knowledge of the subject material by peering for patterns in a word cloud — like reading tea leaves at the bottom of a cup. What you’re left with is a shoddy visualization that fails all the principles I hold dear. > Every time I see a word cloud presented as insight, I die a little inside. --- # Two visualizations, same data .pull-left[ <img src="images/nyt-baghdad-print.jpg" width="1365" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="images/baghdad-wordcloud.jpg" width="800" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # With great power comes great responsibility --- # Cautions - Be sure to follow all guidelines for taking data from websites (including scraping) -- - Different stopword lexicons can result in different results -- - Don't trust translated texts -- - Careful with sentiment analysis -- - Counting words does not make us experts on texts --- ### Example: Stop words + Sentiment Analysis -- "I told you she was not happy" -- [told, happy] --- ### Example: Translations from Turkish with a single, gender-neutral pronoun [A poem](https://qz.com/1141122/google-translates-gender-bias-pairs-he-with-hardworking-and-she-with-lazy-and-other-examples/#:~:text=In%20the%20Turkish%20language%2C%20there,not%20the%20case%20in%20English.) (caveat: this is more than 2 years old) <img src="images/turkish-translation.png" width="2619" style="display: block; margin: auto;" /> --- # Timnit Gebru - Famous AI Ethics researcher, (former) head of ethical AI at Google -- - Wrote a paper on risks of large language models trained on text data -- - Google blocked the paper (through internal review process) -- - Gebru questioned process, was fired -- - [Overview](https://www.technologyreview.com/2020/12/04/1013294/google-ai-ethics-research-paper-forced-out-timnit-gebru/) of the paper -- - Moral of the story: text as data is not going away, but there's still a lot we don't understand. Recognize the subjectivity that we insert into analyses through pre-processing decisions. Be cautious when making conclusions. --- # Examples + [She giggles, he gallops](https://pudding.cool/2017/08/screen-direction/) + [Kavanaugh and Ford Question Dodging](https://www.vox.com/policy-and-politics/2018/9/28/17914308/kavanaugh-ford-question-dodge-hearing-chart) + [Crosswords](https://pudding.cool/2020/11/crossword/) + [Applying PCA to Fictional Character Personalities](https://www.alexcookson.com/post/2020-11-19-applying-pca-to-fictional-character-personalities/) ....