How much does breed influence a dog's name?
Do you meet more Beagles than Bulldogs named "Bailey"? Fewer Pugs and more Dachshunds named "Nathan"?
The physical characteristics and personality traits associated with each breed probably make a bigger impact on your dog's name than you think. (Why didn't you consider "Gizmo" for your German Shepherd?)
In 2013, WNYC published an extensive dataset of dogs in New York City as part of their Dogs of NYC project. The dataset includes the name, gender, breed, color and borough of more than 50,000 dogs.
There are plenty of interesting questions to ask this dataset - Do the boroughs prefer different dogs breeds? Do younger dogs have different names than older dogs? - but I decided to focus this analysis on the relationship between dog name and breed. More specifically, I looked at which names were more likely to be given to certain breeds and clustered breeds into groups that are given similar names. I included some code snippets, but you can see the full R as a markdown page in my Github repo. These methods are very similar to those laid out in an earlier post, Tidy Text Mining Beer Reviews, which clustered beer styles using the text in beer reviews.
What are the most common dog names and breeds?
The most common dog breeds in NYC are Yorkshire Terriers and Shih Tzus.
The most popular names are "Max," "Bella," and "Rocky."
Are certain names more common to certain dog breeds?
Term Frequency-Inverse Document Frequency, or TF-IDF, is a statistic typically used to identify keywords for document retrieval by search engines and in recommender systems to suggest similar items. It looks for terms that are frequent in a particular document but rare in other documents.
Here, TF-IDF was used to identify the most characteristic names by breed by treating each breed as a document and the associated names as the terms. It pushes names that are more likely to be given to a particular breed than other breeds to the top. For example, "Max" is the most popular name in the dataset; overall, about 1.3% of dogs are named "Max." If "Max" is a top result for specific breed, it means (roughly) that a higher percentage of that breed are named "Max" than the overall stat.
Beagles are more likely to be named "Snoopy" than other dogs, while Pugs are more likely to be named "Biggie." Bichon Frises are more likely to be named "Snowball," "Fluffy" or "Snow," while Jack Russell Terriers are most likely to be named "Jack" or "Jackie." Dachshunds (a.k.a. "Wiener dogs") are more likely to be named "Nathan" (as in Nathan's Hot Dogs?) and "Oscar," (as in Oscar Mayer Wieners?).
Source: Dogs of NYC
# get tf-idf for dog names (terms) by breed (document) pets_names_tf <- pets %>% count(dog_name, breed) %>% bind_tf_idf(term = dog_name, document = breed, n) %>% subset(tf_idf > 0 & n >= 5) %>% group_by(breed) %>% arrange(desc(tf_idf)) %>% mutate(n1 = 1, rank = 1:length(n1), max = max(rank)) %>% ungroup() # plot the top 5 names by tf-idf score for each breed pets_names_tf %>% subset(rank <=5 & max>=5) %>% ggplot(aes(rank, tf_idf)) + geom_bar(stat="identity", fill="maroon4", alpha=0.33) + geom_text(aes(label=dog_name, x=rank), color="black", y=0,hjust=0) + scale_x_reverse() + coord_flip() + facet_wrap(~breed, ncol = 3) + labs(title="Most Likely Dog Names by Breed", subtitle = "TF-IDF score for names (terms) within breeds (documents)", y="TF-IDF", x="") + theme(axis.ticks.y=element_blank(), axis.text.y=element_blank(), panel.grid.major.y = element_blank(), plot.title = element_text(size = rel(2)), plot.subtitle = element_text(size = rel(1.25)))
Which dog breeds are given similar names?
There are some names that are more likely in several breeds, such as "Oreo" (Havanese, Boston Terrier, Shih Tzu) and "Bailey" (Puggle, Wheaton Terrier, Labrador Retriever). What's similar about these dog breeds that causes them to be given similar names?
Which breeds are related by the names dog owners give them?
Hierarchical clustering is a tool that begins by treating each group distinctly and then merges the closest groups together repeatedly until all have combined. It produces a dendrogram, or tree plot, that makes visualizing the distance between breeds easy.
In the plot below, breeds that share a branch are given more similar names. Breeds that have similar names tend to have shared physical characteristics, such as Akitas and Chow Chows, or similar temperaments, such as Cocker Spaniels and Poodles.
Source: Dogs of NYC
pet_cluster <- pets_names %>% subset(breed %in% top_breeds$Breed[top_breeds$Freq>150]) %>% group_by(breed) %>% mutate(percent = n/sum(n)) %>% ungroup() %>% filter(dog_name %in% top_names$Name & breed %in% top_breeds$Breed) %>% dcast(dog_name ~ breed, value.var="percent") pet_cluster[is.na(pet_cluster)] <- 0 pet_cor <- cor(pet_cluster[,-1]) pet_dist <- dist(pet_cor, method="euclidean") fit <- hclust(pet_dist, method="ward.D") plot(fit, main="Which Dog Breeds Are Given Similar Names?", family="Avenir")
Dog names are related to breed. Both physical and temperament characteristics that come with dog breeds influence names, such as "Gizmo" or "Oreo" for little lap dogs and "Rocky" for larger dogs.
TL;DR: Your dog's name probably isn't as original as you thought. (Definitely true for Jack Russell Terriers named "Jack"/"Jackie" and Pugs named "Pugsley.")
See the full R markdown page in the Github repo.