Monday, May 01, 2023

Violin plot versus Box-Whisker Plot

A box and whisker plot (Also called: box plot, box-whisker diagram) is defined as a graphical method of displaying variation in a set of data. In most cases, a histogram provides a sufficient display, but a box and whisker plot can provide additional detail while allowing multiple sets of data to be displayed in the same graph. The box-whisker plot displays the following in the data set. 

  1. Minimum value: The smallest value in the data set
  2. Second quartile: The value below which the lower 25% of the data are contained
  3. Median value: The middle number in a range of numbers
  4. Third quartile: The value above which the upper 25% of the data are contained
  5. Maximum value: The largest value in the data set

The box-whisker plot can also indicate the mean value (the dot). The difference between the mean value and the median value can indicate how skewed the data is. 


The box and whisker plot can also include the outliers where outliers are defined as values below Q1 - 1.5 * IQR or values above Q3 + 1.5 IQR (Q1 is 25th percentile and Q3 is 75th percentile, IQR - Interquartile is the distance between 25th percentile and 75th percentile). 


Boxplot can include the only box with lower, upper quartile and median, but not include the min and max values. In a paper by White et al "Combination Therapy with Oral Treprostinil for Pulmonary Arterial Hypertension A Double-Blind Placebo-controlled Clinical Trial", the boxplots without min and max were used to present the NT-proBNP data (a measure with skewed distribution). 

Recently, I see several papers using violin plots to display the data distribution. According to Wikipedia:

violin plot is a statistical graphic for comparing probability distribution. It is similar to a box plot, with the addition of a rotated kernel density plot on each side.

Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.

A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak). In this case a violin plot shows the presence of different peaks, their position and relative amplitude.

Like box plots, violin plots are used to represent comparison of a variable distribution (or sample distribution) across different "categories" (for example, temperature distribution compared between day and night, or distribution of car prices compared across different car makers).

A violin plot can have multiple layers. For instance, the outer shape represents all possible results. The next layer inside might represent the values that occur 95% of the time. The next layer (if it exists) inside might represent the values that occur 50% of the time.

Although more informative than box plots, they are less popular. Because of their unpopularity, they may be harder to understand for readers not familiar with them. In this case, a more accessible alternative is to plot a series of stacked histograms or kernel density distributions.


In a paper by Colli et al "Burden of Nonsynonymous Mutations amongTCGA Cancers and Candidate Immune CheckpointInhibitor Responses", the violin plot was used to display the distribution for r the number of NsM (log10) across different tumor types. 


SAS has a procedure Proc BOXPLOT to generate the box-whisker plots and SAS codes are also provided for generating the Violin plots. Other data analysis software including R have packages to generate the box-whisker plot and violin plot.  

No comments: