Question 1. Use read_csv to read the data sets county.csv and votes.csv into R. Both data sets can be downloaded from the Exams folder in Blackboard.
The county data contains demographic information on each US county. Variable descriptions:
- fips: unique identifier for counties
- county: name of county
- state: state abbreviation
- pop2014: population estimate, 2014
- pct_bachelors: Bachelor's degree or higher, percent of persons age 25+
The votes data contains information on voting outcomes for the 2016 presidential election. Variable descriptions:
- fips: unique identifier for counties
- votes_clinton: number of votes for Hillary Clinton
- votes_trump: number of votes for Donald Trump
- total_votes: total number of votes
(a) Use inner_join() to combine the county and votes data frames, using fips as the key. Call the resulting combined data frame county_votes.
(b) Use mutate() to add a new column to the county_votes data frame called pct_clinton, which is defined as the number of votes for Hillary Clinton divided by the total number of votes, and then multiplied by 100. That is, pct_clinton = 100 * votes_clinton / total_votes.
(c) Which counties had the highest percentage of votes for Clinton? Which counties had the lowest percentage of votes for Clinton? [Hint: use arrange()]
(d) Use ggplot() to make a scatter plot with pct_bachelors on the x-axis and pct_clinton on the y-axis. Use geom_smooth() to add a smooth trend line to the scatter plot. Describe the association between the variables in the scatter plot.
(e) Use filter() to subset the rows of county_votes corresponding to counties that are in California. Call the subsetted data frame county_votes_ca.
(f) Use ggplot() to make a scatter plot with pct_bachelors on the x-axis and pct_clinton on the y-axis, but only for the subset of counties that are in California. Use geom_smooth() to add a smooth trend line to the scatter plot. Describe the association between the variables in the scatter plot.