Visual Exploration of the Diamonds Dataset Using ggplot2 in R

By Kelvin Kiprono in ggplot2

November 25, 2024

Introduction

The diamonds dataset, included in R’s ggplot2 package, is a widely used dataset for practicing data visualization and analysis.

  • Objective: To uncover patterns, distributions, and relationships among diamond attributes using visualizations.
  • Key Questions: -What are the distributions of key variables like price and carat? -How do attributes like cut, clarity, and color influence price? -Are there patterns or correlations between variables like carat and price?

Description of Variables

-price:Numeric; the price of the diamond in US dollars. -carat:Numeric; weight of the diamond (1 carat = 200 mg). -cut:Categorical; quality of the diamond’s cut: Fair, Good, Very Good, Premium, Ideal. -clarity:Categorical; internal purity: IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1. -color:Categorical; color grade: D (Colorless) to J (Noticeable Color).

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

Exploratory Data Analysis (EDA)

Univariate Analysis

Price Distribution -Visualization: Histogram. -Insight: Identifies price range and skewness.

ggplot(diamonds, aes(x = price)) +
  geom_histogram(binwidth = 500, fill = "blue", color = "black") +
  labs(title = "Distribution of Diamond Prices", x = "Price", y = "Count")

Carat Distribution

  • Visualization: Boxplot.
  • Insight: Highlights common carat sizes and outliers.
ggplot(diamonds, aes(y = carat)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  labs(title = "Boxplot of Carat", y = "Carat")

Cut Quality Distribution

  • Visualization: Bar chart.
  • Insight: Shows the frequency of different cut categories.
ggplot(diamonds, aes(x = cut)) +
  geom_bar(fill = "purple", color = "black") +
  labs(title = "Count of Diamonds by Cut", x = "Cut", y = "Count")

BIVARIATE ANALYSIS

Carat vs. Price

  • Visualization: Scatter plot.
  • Insight: Examines the relationship between carat size and price.
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3, color = "darkblue") +
  labs(title = "Scatter Plot of Carat vs. Price", x = "Carat", y = "Price")

Price by Cut

  • Visualization: Boxplot.
  • Insight: Compares price ranges across cut categories.
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Price Distribution by Cut", x = "Cut", y = "Price")

Price by Clarity

  • Visualization: Violin plot.
  • Insight: Shows the distribution of price for each clarity grade.
ggplot(diamonds, aes(x = clarity, y = price)) +
  geom_violin(fill = "cyan", color = "black") +
  labs(title = "Price Distribution by Clarity", x = "Clarity", y = "Price")

MULTIVARIATE ANALYSIS

Carat vs. Price by Clarity

  • Visualization: Scatter plot with clarity as color.
  • Insight: Explores how clarity influences the carat-price relationship.
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.5) +
  labs(title = "Carat vs. Price by Clarity", x = "Carat", y = "Price")

Carat vs. Price by Cut

  • Visualization: Faceted scatter plot.
  • Insight: Examines variations in the carat-price relationship across cuts.
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.8, color = "grey") +
  facet_wrap(~cut) +
  labs(title = "Carat vs. Price Faceted by Cut", x = "Carat", y = "Price")

Price Density Plot

  • Visualization: Density plot.
  • Insight: Highlights peaks and smooth distribution of prices.
options(scipen = 1000)
ggplot(diamonds, aes(x = price)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  labs(title = "Density Plot of Prices", x = "Price", y = "Density")

Using transition_time (from gganimate package)

This is useful when visualizing how a variable changes over time or sequentially through categories. Although the diamonds dataset doesn’t include a time variable, you can create sequences based on categorical or numeric variables for animation.

Animate Price Distribution by Cut or Clarity Animate how the price distribution evolves for different cut or clarity categories.

library(gganimate)
## Warning: package 'gganimate' was built under R version 4.4.2
ggplot(diamonds, aes(x = price)) +
  geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
  labs(title = "Price Distribution by Cut", x = "Price", y = "Count") +
  transition_states(cut, transition_length = 2, state_length = 1) +
  labs(subtitle = "Cut: {closest_state}")

Animate Carat vs. Price by Clarity Show how the carat-price relationship changes for different clarity grades.

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.5) +
  labs(title = "Carat vs. Price by Clarity", x = "Carat", y = "Price") +
  transition_states(clarity, transition_length = 2, state_length = 1) +
  labs(subtitle = "Clarity: {closest_state}")

Using patchwork (from patchwork package)

-Side-by-Side Comparison of Univariate Distributions Combine histograms of price and carat distributions.

library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.2
plot_price <- ggplot(diamonds, aes(x = price)) +
  geom_histogram(binwidth = 500, fill = "blue") +
  labs(title = "Price Distribution", x = "Price", y = "Count")

plot_carat <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "green") +
  labs(title = "Carat Distribution", x = "Carat", y = "Count")

plot_price + plot_carat

#Comparing Relationships Across Variables Place scatter plots of carat vs. price faceted by cut and clarity together for better comparison.

plot_cut <- ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3, color = "red") +
  facet_wrap(~cut) +
  labs(title = "Carat vs. Price by Cut", x = "Carat", y = "Price")

plot_clarity <- ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3, color = "blue") +
  facet_wrap(~clarity) +
  labs(title = "Carat vs. Price by Clarity", x = "Carat", y = "Price")

plot_cut | plot_clarity

Animate Price Density Plot by Clarity

ggplot(diamonds, aes(x = price, fill = clarity)) +
  geom_density(alpha = 0.5) +
  labs(title = "Price Density Across Clarity Grades",
       x = "Price", y = "Density",
       subtitle = "Clarity: {closest_state}") +
  transition_states(clarity, transition_length = 2, state_length = 1) +
  theme_minimal()

Conclusion

Visualizations provide a clear understanding of the diamonds dataset, revealing key trends and relationships.

Posted on:
November 25, 2024
Length:
5 minute read, 966 words
Categories:
ggplot2
Tags:
Statistical Data Analysis
See Also:
Exploring the Iris Dataset: A Visual Analysis of Sepal and Petal Characteristics
Exploring Air Quality in New York: A Predictive Analysis of Ozone Levels Using Environmental Factors
Time Series Analysis