50 Years of Pop Music

50 Years of Pop Music

Billboard has published a Year-End Hot 100 every December since 1958. The chart measures the performance of singles in the U.S. throughout the year.

Using R, I’ve combined the lyrics from 50 years of Billboard Year-End Hot 100 (1965-2015) into one dataset for analysis. You can download that dataset from my Github repo here.

Getting the Lyrics

The songs used for analysis were scraped from Wikipedia’s entry for each Billboard Year-End Hot 100 Songs (e.g., 2014). This is the year-end chart, not weekly rankings. Many artists have made the weekly chart but not the final year end chart. The final chart is calculated using an inverse point system based on the weekly Billboard charts (100 points for a week at number one, 1 point for a week at number 100, etc).

I used the xml and RCurl packages to scrape song and artist names from each Wikipedia entry. I then used that list to scrape lyrics from sites that had predictable URL strings (for example, metrolyrics.com uses metrolyrics.com/SONG-NAME-lyrics-ARTIST-NAME.html). If the first site scrape failed, I moved onto the second, and so on. About 78.9% of the lyrics were scraped from metrolyics.com, 15.7% from songlyrics.com, 1.8% from lyricsmode.com. About 3.6% (187/5100) were unavailable.

The dataset features 5100 observations with the features rank (1-100), song, artist, year, lyrics, and source. The artist feature is fairly standardized thanks to Wikipedia, but there is still quite a bit of noise when it comes to artist collaborations (Justin Timberlake featuring Timbaland, for example). If there were any errors in the lyrics that were scraped, such as spelling errors or derivatives like "nite" instead of "night," they haven't been corrected.

Exploring the Data

Most Frequent Words


58% One-Hit Wonders

1154 out of 1989 artists (58%) appearing on a year-end chart failed to make it with a second hit. The figure on the right was computed by aggregating songs by artist; any "featured" artist was stripped out ("Rihanna featuring Drake" -> "Rihanna"). This means that only the first artist in the list got credit for the song.

Songs 1 2 3 4 5 6 7 8 9 10
Artists 1154 319 160 90 70 61 31 23 13 18

Songs per artist

Marathon vs. Sprint Careers

It was surprising to see the relatively short career spans of some of the most-charted artists (Rihanna has 28 charted songs in only 10 years) so I took a look at the relationship between career length and average number of songs charted per year and found it to be negative. For each year increase in career span, the average songs charted per year decreases by 94% (linear model fitted after taking the log of average songs per year).
*The dataset does not include the first year The Beatles made the year-end chart in 1964 - so technically their career span would be 12 years.

Career Spans of Top 20 Artists

Songs and Career Length

Lyrics Over Time

Growing Vocabulary and Song Length

The songs in the dataset average 332 total/114 unique words. Average word counts (both unique and total) have increased over time. Variance in word counts has also increased, perhaps due to greater genre diversity in the chart rankings over time. Inconstant variance was corrected with a log transform of word count and two linear models fit, producing the coefficient 0.01873 for total and 0.0136 for unique word counts. For each year increase, the total word count increases on average by 1.87% and the unique word count increases by 1.36%.

Words per Song

The increase could be due to longer songs - from 2.5 to 4 minutes since the 1960s [1], faster-paced music styles or songs featuring more than one artist.

Songs with two or more artists

From Boogie to Bitch: Most Characteristic Lyrics by Decade

Using the log likelihood statistic outlined in my earlier post (Text Mining South Park), I was able to identify the most characteristic words in decade-specific lyrics. In short, words that appear more often than would be expected in a corpus have higher log likelihood. The 25 strongest results (all > 81; 10.83 is significant at p < 0.001).

It's clear that individual songs that are heavy on repetition ("club", "go head", and "shorty" in the 2000s are almost surely from 50 Cent's "In da Club") influence the results. It raises a good question about log likelihood's applicability to song lyrics - does a single highly repetitive song skew the results?

Most Characteristic Lyrics by Decade


Billboard Year-End Hot 100 Ranking Policy Changes
General changes in popular song content can be at least partially attributed to the evolution of the ranking method for the Top 100 over time. Billboard stayed relevant by changing its ranking policy [2] with the changing methods of discovering and purchasing music.

  • 1958-1991: ranking determined by ratio of singles sales and airplay
  • 1991: Billboard begins collecting sales data digitally (using SoundScan) for quicker and more accurate charts
  • 1998: Billboard drops requirement that song must be released as a single to appear on the chart
  • 2005: Digital downloads (iTunes) included
  • 2012: On-demand streaming services (Spotify, Rhapsody) included
  • 2013: Video views (YouTube) included

Consumers now have more say than ever in which songs chart. Prior to the 2005, consumers could impact the charts by purchasing the single or requesting the song on the radio. Now, consumers can view the video, stream the song, download the single or purchase a physical copy to have a say in what’s popular.

See further discussion at Mic.com: These Charts Show How Drastically Pop Has Changed Over the Past 50 Years

Show Comments