Billboard has published a Year-End Hot 100 every December since 1958. The chart measures the performance of singles in the U.S. throughout the year.
Using R, I’ve combined the lyrics from 50 years of Billboard Year-End Hot 100 (1965-2015) into one dataset for analysis. You can download that dataset from my Github repository here.
Getting the Lyrics
The songs used for analysis were scraped from Wikipedia’s entry for each Billboard Year-End Hot 100 Songs (e.g., 2014). This is the year-end chart, not weekly rankings. Many artists have made the weekly chart but not the final year end chart. The final chart is calculated using an inverse point system based on the weekly Billboard charts (100 points for a week at number one, 1 point for a week at number 100, etc).
I used the xml and RCurl packages to scrape song and artist names from each Wikipedia entry. I then used that list to scrape lyrics from sites that had predictable URL strings (for example, metrolyrics.com uses metrolyrics.com/SONG-NAME-lyrics-ARTIST-NAME.html). If the first site scrape failed, I moved onto the second, and so on. About 78.9% of the lyrics were scraped from metrolyics.com, 15.7% from songlyrics.com, 1.8% from lyricsmode.com. About 3.6% (187/5100) were unavailable.
The dataset features 5100 observations with the features rank (1-100), song, artist, year, lyrics, and source. The artist feature is fairly standardized thanks to Wikipedia, but there is still quite a bit of noise when it comes to artist collaborations (Justin Timberlake featuring Timbaland, for example). If there were any errors in the lyrics that were scraped, such as spelling errors or derivatives like "nite" instead of "night," they haven't been corrected.
Exploring the Data
Most Frequent Words
58% One-Hit Wonders
1154 out of 1989 artists (58%) appearing on a year-end chart failed to make it with a second hit. The figure on the right was computed by aggregating songs by artist; any "featured" artist was stripped out ("Rihanna featuring Drake" -> "Rihanna"). This means that only the first artist in the list got credit for the song.
Marathon vs. Sprint Careers
It was surprising to see the relatively short career spans of some of the most-charted artists (Rihanna has 28 charted songs in only 10 years) so I took a look at the relationship between career length and average number of songs charted per year and found it to be negative. For each year increase in career span, the average songs charted per year decreases by 94% (linear model fitted after taking the log of average songs per year).
*The dataset does not include the first year The Beatles made the year-end chart in 1964 - so technically their career span would be 12 years.
Lyrics Over Time
Growing Vocabulary and Song Length
The songs in the dataset average 332 total/114 unique words. Average word counts (both unique and total) have increased over time. Variance in word counts has also increased, perhaps due to greater genre diversity in the chart rankings over time. Inconstant variance was corrected with a log transform of word count and two linear models fit, producing the coefficient 0.01873 for total and 0.0136 for unique word counts. For each year increase, the total word count increases on average by 1.87% and the unique word count increases by 1.36%.
The increase could be due to longer songs - from 2.5 to 4 minutes since the 1960s , faster-paced music styles or songs featuring more than one artist.
From Boogie to Bitch: Most Characteristic Lyrics by Decade
Using the log likelihood statistic outlined in my earlier post (Text Mining South Park), I was able to identify the most characteristic words in decade-specific lyrics. In short, words that appear more often than would be expected in a corpus have higher log likelihood. The 25 strongest results (all > 81; 10.83 is significant at p < 0.001).
It's clear that individual songs that are heavy on repetition ("club", "go head", and "shorty" in the 2000s are almost surely from 50 Cent's "In da Club") influence the results. It raises a good question about log likelihood's applicability to song lyrics - does a single highly repetitive song skew the results?
Billboard Year-End Hot 100 Ranking Policy Changes
General changes in popular song content can be at least partially attributed to the evolution of the ranking method for the Top 100 over time. Billboard stayed relevant by changing its ranking policy  with the changing methods of discovering and purchasing music.
- 1958-1991: ranking determined by ratio of singles sales and airplay
- 1991: Billboard begins collecting sales data digitally (using SoundScan) for quicker and more accurate charts
- 1998: Billboard drops requirement that song must be released as a single to appear on the chart
- 2005: Digital downloads (iTunes) included
- 2012: On-demand streaming services (Spotify, Rhapsody) included
- 2013: Video views (YouTube) included
Consumers now have more say than ever in which songs chart. Prior to the 2005, consumers could impact the charts by purchasing the single or requesting the song on the radio. Now, consumers can view the video, stream the song, download the single or purchase a physical copy to have a say in what’s popular.
See further discussion at Mic.com: These Charts Show How Drastically Pop Has Changed Over the Past 50 Years