Introduction

Dear CEO and CFO of Budweiser,

We are pleased to present the results of our analysis of the data you provided. Our objective was to gain a deeper understanding of the relationship between alcohol by volume (ABV) and various other factors, such as the number of breweries in each state, the bitterness of beer, and the correlation between bitterness and alcohol content.

In conducting our analysis, we utilized various statistical tools and techniques, including scatter plots, bar charts, and KNN classifiers. Our findings have uncovered several interesting insights that we believe will be of great value to Budweiser.

Please find below a summary of our key findings. If you have any questions or comments, we would be happy to discuss them further. Our aim is to provide you with results that are clear, concise, and actionable.

Breweries by State

The data set contained 558 breweries across the United States, with the top five states accounting for 31% of the total (175 breweries). The states were ranked in order of largest to smallest, with Colorado being the largest with 8.42% of the total, followed by California (6.99% of the total), Michigan (5.73% of the total), Oregon (5.20% of the total), and Texas (5.02% of the total)

Missing Values

Several values were missing for both the ABV and IBU columns. We needed to make sure that we were approaching these missing values in the correct way. First we determined that these were Missing-At-Random. Then we determined that the proportion of missing rows to total was somewhat significant at 41.7%. Unfortunately, this high number means some of our predictions using IBU may be slightly skewed towards the mean.

At this point, to deal with the missing values, we used an algorithm known as multivariate imputation by chaining predictive means equations. This method works by predicting missing values using the average value and basing this on surrounding factors that are known (e.g. predicting missing ABV using IBU):

summary(beer_brews[,c("ABV", "IBU")])
##       ABV               IBU        
##  Min.   :0.00100   Min.   :  4.00  
##  1st Qu.:0.05000   1st Qu.: 21.00  
##  Median :0.05600   Median : 35.00  
##  Mean   :0.05977   Mean   : 42.71  
##  3rd Qu.:0.06700   3rd Qu.: 64.00  
##  Max.   :0.12800   Max.   :138.00  
##  NA's   :62        NA's   :1005
(nrow(beer_brews[!complete.cases(beer_brews),]) / nrow(beer_brews)) * 100.0 # percentage of rows that are missing from the data
## [1] 41.70124
cor(beer_brews$ABV, beer_brews$IBU, use = "complete.obs") # 0.67 correlation between ABV and IBU, good predictor for multi-imputation, "mice"
## [1] 0.6706215
summary(beer_brew_whole[,c("ABV", "IBU")])
##       ABV               IBU       
##  Min.   :0.00100   Min.   :  4.0  
##  1st Qu.:0.05000   1st Qu.: 22.0  
##  Median :0.05600   Median : 35.0  
##  Mean   :0.05975   Mean   : 42.9  
##  3rd Qu.:0.06700   3rd Qu.: 64.0  
##  Max.   :0.12800   Max.   :138.0

Median ABV and IBU by State

This chart presents a summary of the median Alcohol by Volume (ABV) and International Bitterness Unit (IBU) values of beers produced in different states across the United States. To compile the information, the Beer and Breweries datasets were merged and the median values of ABV and IBU were calculated for each state of production. The states with the highest median IBU values are Montana (80 IBU), Delaware (77.5 IBU), and Vermont (75 IBU). The states with the highest median ABV values are Nevada (0.085), South Carolina (0.0765), Vermont (0.0715), and Kansas (0.0715).

Highest ABV & IBU Beers

ABV

We could pick the states which produce beers with the greatest ABV and IBU, respectively, but it might be more interesting to determine trends of states that are producing these high-scoring beers. To make sure we were only picking beers with true ABV/IBU numbers, we used the original data set without missing values imputed. From these results, we found the states that tended to produce higher ABV beers were Colorado, Kentucky, Indiana, New York, and Michigan.

From the plot below, the data appears to indicate that while Colorado produces the single highest-ABV beer, “Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale,” it is outpaced by Kentucky in concentration of high-ABV beers produced:

beer_brew_abv[order(beer_brew_abv$ABV, decreasing = TRUE )[1:6],c("Beer_Name",  "State", "ABV")] 
##                                                Beer_Name State   ABV
## 384 Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale    CO 0.128
## 9                                         London Balling    KY 0.125
## 149                                                 Csar    IN 0.120
## 387     Lee Hill Series Vol. 4 - Manhattan Style Rye Ale    CO 0.104
## 344                                               4Beans    NY 0.100
## 57                                  Wizard Burial Ground    MI 0.099
hi_abv = beer_brew_abv[beer_brew_abv$State %in% c("CO", "MI", "KY", "IN", "NY"),]
par(mar=c(10,10,0,0)) #it's important to have that in a separate chunk

ggplot(data = hi_abv, aes(x = ABV * 100, fill = State, linetype = (State != "CO"), color = (State != "CO"))) +
  geom_density(alpha = 0.3, position="identity") +
    labs(x = "Alcohol By Volume (%)", y = "Density", title="Density of High-ABV \nBeers by State") +
  guides(color = "none", linetype = "none") +
  theme_wsj()+
  theme(axis.title=element_text(size=12))

Alcohol By Volume

In this analysis, the distribution of Alcohol by Volume (ABV) in 2,410 beers was examined. The histogram of the ABV revealed a right-skewed distribution, with the mean ABV of 5.97% and median of 5.60%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00100 0.05000 0.05600 0.05975 0.06700 0.12800

IBU

We see a similar trend happen when we discuss IBU for these beers as well. The aptly named, “Bitter Bitch Imperial IPA” brings Oregon to the top of the list, despite appearing to have fewer beers at the high end of the IBU distribution.

##                            Beer_Name State IBU
## 1857       Bitter Bitch Imperial IPA    OR 138
## 1719              Troopers Alley IPA    VA 135
## 1305                   Dead-Eye DIPA    MA 130
## 625  Bay of Bengal Double IPA (2014)    OH 126
## 425                     Abrasive Ale    MN 120
## 1452                    Heady Topper    VT 120

Testing the Correlation Between IBU and ABV

As hinted to earlier, during the missing value imputation for ABV and IBU, we noticed that the two columns seemed to be adequate predictors for one another. The data indicates there is a strong positive correlation between ABV and IBU at an r-value of 0.652 (r-values range between -1.0 and 1.0):

##       ABV               IBU        
##  Min.   :0.00100   Min.   :  4.00  
##  1st Qu.:0.05000   1st Qu.: 21.00  
##  Median :0.05600   Median : 35.00  
##  Mean   :0.05977   Mean   : 42.71  
##  3rd Qu.:0.06700   3rd Qu.: 64.00  
##  Max.   :0.12800   Max.   :138.00  
##  NA's   :62        NA's   :1005
## [1] 0.6523603

IPAs vs. Ales:

Using K-Nearest Neighbors to Predict IPAs and Ales from IBU/ABV

It was wondered whether there was a strong predictive factor that we could use to test the hypothesis that IPAs tend to be more bitter and have a higher alcohol content than other types of ales. In order to perform this analysis, we first filtered out all non-ales from the data set, then we used a predictive machine learning algorithm known as K-nearest neighbors, which clumps beers together based on IBU and ABV, and uses these clusters to predict whether a test beer is likely an IPA or another type of ale.

From the model we built using the K-nearest neighbors approach, we were able to predict with 79.6% accuracy whether a given ale was an IPA or not.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   250   49
##      TRUE     45  116
##                                           
##                Accuracy : 0.7957          
##                  95% CI : (0.7559, 0.8316)
##     No Information Rate : 0.6413          
##     P-Value [Acc > NIR] : 4.129e-13       
##                                           
##                   Kappa : 0.5534          
##                                           
##  Mcnemar's Test P-Value : 0.757           
##                                           
##             Sensitivity : 0.8475          
##             Specificity : 0.7030          
##          Pos Pred Value : 0.8361          
##          Neg Pred Value : 0.7205          
##              Prevalence : 0.6413          
##          Detection Rate : 0.5435          
##    Detection Prevalence : 0.6500          
##       Balanced Accuracy : 0.7752          
##                                           
##        'Positive' Class : FALSE           
## 

Interesting Findings:

In this analysis, the relationship between Alcohol by Volume (ABV) in beer produced by a state and the corresponding DUI arrest rate was examined. A correlation was calculated between the two variables and the results showed a negative correlation coefficient of -0.07, indicating that there is no positive relationship between the two variables. This suggests that producing beer with higher alcohol content does not necessarily result in an increase in DUI arrests.

Conclusion:

From our investigations, we raised a number of considerable questions we were able to answer. We found that just five states accounted for 31% of all breweries, with Colorado alone accounting for an impressive 8%. We found there was a significant portion of the available beer data which had not listed International Bitterness Units (IBU), opening the field to furhter investigation and analysis. We found both the median and mean Alcohol-by-Volume (ABV) metrics for American beers to rest around 5.6%. And the data indicated there was a strongly predictive factor (~79% accuracy) between IBU, ABV, and whether or not a beer would be considered an Indian Pale Ale (IPA).

Outside of the initial topics proposed for investigation, we went out of our way to gather relevant Driving Under the Influence (DUI) statistics on a state basis. We compared these additional findings with the data from the beers and breweries provided, and we found the data to indicate that there was not a strong correlation between states that produce high-ABV beers and number of DUI incidents. Although this may be confounded by the fact that the beers produced in these states tend to ship both domestically and abroad, we determined this should have an overall positive impact on the marketing and sales division for these more high-caliber products.