# Web scraping Reply All transcripts

One of my favorite podcasts is Reply All, a show (roughly) about technology and the internet. Hosts PJ Vogt and Alex Goldman, and a rotating cast of fantastic reporters and producers tell some of the most fascinating stories about the way we interact with technology. The show has been in production since 2014, and for a time felt like a great little secret, but their website now indicates that the show is downloaded “around 3.5 million times per month.” If you haven’t listened, I’d highly recommend checking it out. Some of my favorite episodes are (in no particular order):

I thought it would be a fun project to take the transcripts from every episode of Reply All and see what we can learn about the show. As is often the case in data science, 80% of the challenge is to gather and clean the data.

### Part 3: Pull transcripts for each episode

Now we can use purrr to iterate through every episode and ‘map’ the function getTranscript to each episode link. I learned a lot about iterating with purrr from this tutorial from Jenny Brian and the chapter from R for Data Science on Iteration by Garrett Grolemund and Hadley Wickham. This takes ~3 minutes to run, depending on your internet connection.

# use purrr to map the 'getTranscript' function over all of the URLs in the ep_data data frame
ep_data <- ep_data %>%
mutate(
)

# unnest the results into one big data frame
tidy_ep_data <- ep_data %>%
unnest(transcript)

knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

### Part 4: Clean the transcripts

Now that we have a big data frame, we can do a little more cleaning of the data. Arguably, this is avoidable with more intelligent regex and string work earlier, but this cleanup will have to do for now. I briefly use the zoo package to fill in some NA values in the speaker column using the previous non-NA value (inspired by this StackoverFlow answer).

# turn missing values to NA and then fill using
# the na.locf (last observation carried forward) function from the 'zoo' package
tidy_ep_data <- tidy_ep_data %>%
mutate(
speaker = if_else(speaker == "", NA_character_, speaker),
speaker = zoo::na.locf(speaker)
)

# get the list of speakers clean
tidy_ep_data_clean <- tidy_ep_data %>%
filter(
!grepl("CREDIT", speaker), # remove credit chit-chat
!grepl("THEME", speaker), # remove theme chit-chat
speaker != "OUTPJ",
speaker != "OUTALEX"
) %>%
mutate(
speaker = trimws(speaker),
speaker = case_when(
speaker == "ALEX" ~ "ALEX GOLDMAN",
speaker == "REPLY ALL ALEX GOLDMAN" ~ "ALEX GOLDMAN",
speaker == "GOLDMAN" ~ "ALEX GOLDMAN",
speaker == "AG" ~ "ALEX GOLDMAN",
speaker == "PJ" ~ "PJ VOGT",
speaker == "REPLY ALL PJ VOGT" ~ "PJ VOGT",
speaker == "BLUMBERG" ~ "ALEX BLUMBERG",
speaker == "AB" ~ "ALEX BLUMBERG",
speaker == "SRUTHI" ~ "SRUTHI PINNAMANENI",
TRUE ~ speaker
)
)

And after all of that, we now have some sort of nice text data from every episode of Reply All!

glimpse(tidy_ep_data_clean)
## Observations: 704,321
## Variables: 6
## $link <chr> "/reply-all/135-the-robocall-conundrum#episode-pl… ##$ episode_number <dbl> 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,…
## $full_link <chr> "https://www.gimletmedia.com/reply-all/135-the-ro… ##$ speaker        <chr> "PJ VOGT", "PJ VOGT", "PJ VOGT", "PJ VOGT", "PJ V…
## $linenumber <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3… ##$ word           <chr> "from", "gimlet", "this", "is", "reply", "all", "…
head(tidy_ep_data_clean) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
We can write the data to a .csv for anyone to use in the future.
readr::write_csv(tidy_ep_data_clean, "reply_all_text_data.csv")