Tutorial: Chord Diagram

I’ve seen a lot of tutorials for Chord Diagrams in Tableau that use data densification. Back in the day when I needed to create my own “shapefiles” for Tableau I used data modeling. I’m not positive where the origins came for this term, but I first spotted a few years ago on Twitter by Tableau Zen Master Rody Zakovic. He also wrote about it on his website. With this post I wanted to take the shroud off the chart type and show you how I personally go about creating it.

For this tutorial, I’m going to use superstore data. Chord diagrams look at pairwise relationships. For this I’m going to look at which products are bought together in orders. 

Part I: Modeling the Data

We’re not going to use raw data for this, most of the time we should, but because a chord diagram visualizes pairwise relationships its just easier to prep data prior to bringing it into Tableau.

To prep the data we’re going to use R. We could use a number of tools but the most important piece (for me) is the ability to create a rank. Author note: I recognize we could do this inside of Tableau, but doing this allows us to avoid Table Calculations, which ultimately cause visualizations to render slower. Thats right, modeling the data will render a data faster than using data densification.

 

We’re going to use the tidyverse and magrittr R packages for this prep. If you are new to R then you’ll want to be sure to check out tidyverse.

library(tidyverse)
library(magrittr)

product <- read_csv(".../Sample - Superstore.xls")

Now that our data are loaded we’re going get the rank of every single product. There are other ways to get to the same result of what you are going to see.

##  Get total quantity by product then rank then get summarize other KPIs.
product_summary <-
  product %>%
  group_by(`Product Name`) %>%
  summarise(Quantity = sum(Quantity)) %>%
  arrange(desc(Quantity)) %>%
  mutate(rank = row_number(-Quantity)) %>%
  ungroup() %>%
  select(-Quantity)

Here is the output so far. Just a rank of each product .

# A tibble: 1,850 x 2
   `Product Name`                                 rank
   <chr>                                         <int>
 1 Staples                                           1
 2 Staple envelope                                   2
 3 Easy-staple paper                                 3
 4 Staples in misc. colors                           4
 5 KI Adjustable-Height Table                        5
 6 Avery Non-Stick Binders                           6
 7 Storex Dura Pro Binders                           7
 8 GBC Premium Transparent Covers with Diagonal…     8
 9 Situations Contoured Folding Chairs, 4/Set        9
10 Staple-based wall hangings                       10
# ... with 1,840 more rows

To make our life easy we are going to join this back to our raw dataset. After we join, were going to group by Product ID and Order ID and summarize sales, profit, and quantity. There are additional fields in the group by we want the remain in the data. All of these are product attributes.

product_summary %<>%
left_join(product) %>% filter(!is.na(rank)) %>% group_by(`Category`, `Sub-Category`, `Order ID`, `Product Name`, rank) %>% summarise( Sales = sum(Sales), Profit = sum(Profit), Quantity = sum(Quantity) ) %>% arrange(rank)

Again, a snippet of the output

# A tibble: 9,986 x 8
# Groups:   Category, Sub-Category, Order ID, Product Name [9,986]
   Category        `Sub-Category` `Order ID`     `Product Name`  rank Sales Profit Quantity
   <chr>           <chr>          <chr>          <chr>          <int> <dbl>  <dbl>    <dbl>
 1 Office Supplies Fasteners      CA-2015-102673 Staples            1  7.52   1.41        5
 2 Office Supplies Fasteners      CA-2015-106439 Staples            1 11.3    5.22        3
 3 Office Supplies Fasteners      CA-2015-108189 Staples            1  4.67   1.46        2
 4 Office Supplies Fasteners      CA-2015-116757 Staples            1 21.3    7.19        6
 5 Office Supplies Fasteners      CA-2015-119032 Staples            1  3.76   1.32        2
 6 Office Supplies Fasteners      CA-2015-122336 Staples            1 31.0   10.1        13
 7 Office Supplies Fasteners      CA-2015-139857 Staples            1 12.4    5.80        5
 8 Office Supplies Fasteners      CA-2015-140165 Staples            1  7.10   2.40        2
 9 Office Supplies Fasteners      CA-2015-145317 Staples            1 18.9    5.92        3
10 Office Supplies Fasteners      CA-2015-153983 Staples            1 30.4   15.2         5
# ... with 9,976 more rows

Next comes the fun part. We’re going to take the summary products and orders and change the name of every column from ColumnName to ColumnName2. This going to be the second dataset that will allow us to do pairwise comparisons. Once we have this data we can join the data back to its original data source (I use a right join here so the original version is the left-side of the join, but honestly it doesn’t matter).

Once we have our join on order ID, we can then summarize our data at the product-product2 level (this is our pairwise comparison). Again, we’ll include any product characteristics in the group_by statement to keep it around in the end.

product_summary %<>%
  rename(
    Category2 = Category,
    `Sub-Category2` = `Sub-Category`,
    `Product Name2` = `Product Name`,
    rank2 = rank,
    Sales2 = Sales,
    Profit2 = Profit,
    Quantity2 = Quantity,
  ) %>%
  right_join(product_summary) %>%
  group_by(Category, `Sub-Category`, `Product Name`, rank, Category2, `Sub-Category2`, `Product Name2`, rank2) %>%
  summarise(
    Sales = sum(Sales),
    Sales2 = sum(Sales2),
    Quantity = sum(Quantity),
    Quantity2 = sum(Quantity2),
    Profit = sum(Profit),
    Profit2 = sum(Profit2)
  ) %>%
  arrange(rank, rank2)

While not exactly the output (since we’ll have product characteristics still, here is approximately what we will have produced. In Row 3 you will see that there were 9 “FUR-BO-10000112” and 3 “OFF-PA-10004041” were sold in the same orders. In some cases like row 1 and 4 you see the exact same Product ID with the exact same quantity between the pairs. These are actually orders where no other products were bought at the same time!

# A tibble: 21,632 x 4
   `Product Name` `Product Name2`                              Quantity Quantity2
   <chr>          <chr>                                           <dbl>     <dbl>
 1 Staples        Staples                                           215       215
 2 Staples        KI Adjustable-Height Table                          8         6
 3 Staples        Kingston Digital DataTraveler 16GB USB 2.0          2         1
 4 Staples        Staple holder                                       3         1
 5 Staples        GBC Standard Recycled Report Covers, Clear …       12         5
 6 Staples        Maxell 4.7GB DVD-R                                 13         7
 7 Staples        Wilson Jones Easy Flow II Sheet Lifters             3         2
 8 Staples        Wilson Jones Clip & Carry Folder Binder Too…        4         5
 9 Staples        Adjustable Depth Letter/Legal Cart                  3         4
10 Staples        Eldon Shelf Savers Cubes and Bins                   4         4
# ... with 21,622 more rows

Now its time for the “hard core data modeling”, but before that, I want to get rid of any pairs where Product ID and ProductID2 are the same. I add a column of ones called one and a column called level, which I’ll talk about in a bit. But once I do this, I create a data frame with a two columns: one with straight 1s and the second that goes from 0.00 to 1.00 by .01, but then has 2, 3, 4, 5, 6, 7 in it too. The 0.00 to 1.00 will help form the chords, 2 and 3 will create the dots beyond the chords, and 4 through 7 will create  some bars/lines.

product_summary %>%
  mutate(one = 1) %>%
  left_join(
    data_frame(
      one = 1,
      t = c(c(0:100)/100, c(2:7))
    )
  ) -> tmp

Finally, we just need to export the data for use in Tableau.

write_csv(tmp, ".../superstore_chord.csv")

Part II: Building the Chord Diagram

Are you ready to build this:

Things are about to get a bit crazy. My directions might not be perfect, but when in doubt check the calculations. The most important fields for the chord diagram are rank and rank2. These are the ranks for the pairwise relationships and the basis for the to-from relationship of the chords.

 

First we need to make the source and destination nodes. We do need to know the total nodes on the view to figure this out, which we will dictate with an integer parameter we’ll call [Rank Select]. For this example, our parameter has a range from 1 to 400. We’ll then build a boolean calculation called [Rank Select | TF] that returns only nodes within the range of this parameter. Add this as a context filter on the view.

 

//Rank Select | TF
[Rank] <= [Rank Select]
AND 
[Rank2] <= [Rank Select]

Now that we know the total number of nodes that will be on the view we can identify the source and destination locations for every node. To do so we need to find the x and y coordinates. We can use the rank and tha parameter to identify locations along the unit circle.

//x1 (source)
COS([Rank]/[Rank Select] * PI() * 2)

//x2 (destination)
COS([Rank2]/[Rank Select] * PI() * 2) 

//y1 (source)
SIN([Rank]/[Rank Select] * PI() * 2)

//y2 (destination)
SIN([Rank2]/[Rank Select] * PI() * 2) 

All of these calculations won’t actually end up on the visualization, but we’ll use these to make our chords.

 

Remember we did some data modeling before it came into Tableau. We have a column, [t], that ranges from 0 to 1 by .01 (plus some other values). Each of those values will help us draw a line between the chords.

 

To draw the curves we need to use a formula for a 3 point spline. Which is:

((1-t)2 * pt1) + (2t(1-t) * pt2) + (t2 * pt3)

Now we just need to fill in each of the three points: Point1 is the starting point, point 3 is the end point, and point 2 is some mysterious 3rd point we’ll use. Most visualizations I’ve seen use 0,0 for this midpoint. I don’t personally like this as an option but first I’ll show you what this looks like.

The prototypical "start" pattern with a 0,0 midpoint.
It makes a really nice star pattern, but my issue is that if there are two nodes close together they still stretch out really far into the center of the circle and it hard to tell which is which. What would be nice is if the closer the nodes were together, the less intense of an arc. This means the midpoint would have to be dynamic and based on the length distance two nodes are separated. So after a few days of trial and error I landed on a dynamic midpoint defined by the distance between the two points. First the [segment length].
SQRT(([x2] - [x1])^2 + ([y2] - [y1])^2)
We’ll also use two more formulas that help us find the location of this midpoint:
//[angle x]
COS((ACOS([x1])+ACOS([x2]))/2)

//[angle y]
SIN((ASIN([y1])+ASIN([y2]))/2)
The end result are chords that bend in a more systematic fashion.

So were going to build our chord diagram very soon, but remember when we did our data modeling? We brought in values from 0 to 1 by .01, but we also brought in values of 2, 3, 4, 5, 6, and 7. This will be different components of our dynamic chord diagram. These values will define the locations of other components on this visualization: dots before the chord start, dots after the chord end, bars before the dots and chords, and bars after the dots and chords. Here’s the math to break out the components. 

// Separator
IF t <= 1
THEN "chord"
ELSEIF t = 2
THEN "start point"
ELSEIF t = 3
THEN "end point"
ELSEIF t = 4 OR t = 5
THEN "start bar"
ELSEIF t = 6 OR t = 7
THEN "end bar"
END
The [separator] calculation will help us create this visualization, but we finally build we need to define the height of the bars.
When we make this we’re going to have a set action based on the rank field. We can call the set [Rank | Set]. After that we can build the calculations for the [start bar] and the [end bar].
// Start Bar
{FIXED [Rank] : SUM(IF level = 1 AND t = .5 then [Quantity] end)}
/
{MAX({FIXED [Rank] : SUM(IF level = 1 AND t = .5 then [Quantity] end)})}

// End Bar
({FIXED [Rank2] : SQRT(SUM(IF level = 1 AND t = .5 AND [Rank | Set] THEN [Quantity2] end))}
/
{MAX({FIXED [Rank] : SQRT(SUM(IF level = 1 AND t = .5 AND [Rank | Set] THEN [Quantity2] end))})}) 
* 
[start bar length]

Now its time to build the formulas for the x and y we’ll actually put on the view:

// x
IF t >= 0 AND t <= 1
THEN
(( (1 - [t])^2 ) * [1b. x1]) +
(2 * (1 - [t]) * [t] *  
  (
    ((2 - [segment length])/ 2) *
    [angle x]
  )
) + 
([t]^2 * [x2])
ELSEIF t = 2
THEN 1.07 * [x1]
ELSEIF t = 3
THEN 1.07 * [x2]
ELSEIF t = 4
THEN 1.14 * [x1]
ELSEIF t = 5
THEN (1.14 + ([start bar length]* [bar height])) * [x1]
ELSEIF t = 6
THEN 1.14 * [x2]
ELSEIF t = 7
THEN (1.14 + ([end bar length]* [bar height])) * [x2]
END

//y
IF t >= 0 AND t <= 1
THEN
(( (1 - [t])^2 ) * [1d. y1]) +
(2 * (1 - [t]) * [t] *  
  (
    ((2 - [segment length])/ 2) *
    [angle y] 
  )
) + 
([t]^2 * [y2])
ELSEIF t = 2
THEN 1.07 * [y1]
ELSEIF t = 3
THEN 1.07 * [y2]
ELSEIF t = 4
THEN 1.14 * [y1]
ELSEIF t = 5
THEN (1.14 + ([start bar length]* [bar height])) * [y1]
ELSEIF t = 6
THEN 1.14 * [y2]
ELSEIF t = 7
THEN (1.14 + ([end bar length]* [bar height])) * [y2]
END
Using this we can now start building our view. We already have one dimension as a filter. Place [x] on rows and [y] on columns. Add the [Rank | Set] to color. Change the mark type to line and place t on path. Place [separator], [Product Name], [Rank], and [Rank2] on detail.

Make Rank into a set action. When you try it out, things wont work the way you think. The major issue is that we have more data than we need on this visualization. Thats because we have chords that go in both directions.

Clicking on one node will highlight all the wrong information.

Let’s clean things up. The calculation is a bit nuanced and I’m not going to explain all the details, but after a bunch of trial-and-error, I know this just works.

CASE [separator]
WHEN "chord"
THEN IF [Rank | set]
    THEN TRUE
    ELSE TRUE
//    ELSE {FIXED [Rank] : MIN(Rank2)} = Rank2
    END
WHEN "start point"
THEN IF [Rank | set]
    THEN TRUE
    ELSE {FIXED [Rank] : MIN([Rank2])} = [Rank2]
    END
WHEN "start bar"
THEN IF [Rank | set]
    THEN FALSE
    ELSE IF {COUNTD(IF [Rank | set] THEN [Rank] END)} = 0
         THEN TRUE
         ELSE FALSE
         END
    END
WHEN "end bar"
THEN IF [Rank | set]
    THEN TRUE
    ELSE FALSE
    END
END
Add this to filter, and select TRUE. Now things are mostly finished, but let’s just add a few more features: dynamic labels, and resizing of things:
// dynamic labels
IF [5c. Rank | set] AND [3. separator] = "end bar" AND t = 7
THEN [Product Name2]
ELSEIF [5c. Rank | set] AND [3. separator] = "start bar" AND [t] = 5
THEN [Product Name]
END

// Size
IF CONTAINS([3. separator], "bar")
THEN 20
ELSEIF CONTAINS([3. separator], "point")
THEN ([Quantity]) * 3
ELSE 5
END

Place [dynamic labels] on text and [size] on size.

And there you have it, an amazing chord diagram. (P.S. This workbook is downloadable.)

Comments are closed.