Responsible Data Visualization
When it comes to data visualization, we’ve seen several trends over the past year – first there were visual resumes, then bar chart races, but most recently visualizations have been dominated by COVID-19 data. Some of these visualizations have been very good, while others have been a bit misleading. The prevalence of COVID-19 visualizations means that individuals who do not necessarily interact with data or visualizations on a regular basis are being flooded with graphs and charts. When the visualizations are clear and representative of the data, this is a good thing; when they fail to use best practices, they can be misleading or even fear or panic inducing.
Some of the traps that poorer visualizations fall into are super common and as data visualization experts, we know to avoid them. However, sometimes they’re all too easy to fall for – I’m talking choropleth (or filled) maps that are colored by the number of COVID-19 cases, trend charts using logarithmic scales without indicating as such and using poor color choices. In an effort to take data that was relevant to me and visualize it in an effective manner, I’ve connected to data from the Missouri Department of Health and Senior Services. The result is the visualization below.
OK, so that’s the visualization, but here’s the how and – maybe more importantly – the why:
- Color Choice: Throughout the dashboard, I use only two color families, a palette based off of amber as it’s root color and another based off cyan as it’s root. This choice avoids the reds that I’ve seen on several COVID-19 vizzes - in western culture, red is a warning color and while we’re surely warning viewers of the situation, we don’t want to induce panic. Also, the visualization is colorblind friendly – leveraging the common blue/orange combination – across all types of colorblindness. Finally, the colors are consistent across the visualization, making them both easy to read and impactful.
- Data Source and Update Date: Featured prominently at the top of the visualization to provide the viewer a sense of comfort in where the data is coming from as well as how fresh the data is in the viz.
- Visualization Choice – Summary Data: At the top of the dashboard, we have some summary data for both confirmed cases and COVID-19 related deaths. This data includes both the absolute numbers, as well as a representative number per 100,000 residents. Additionally, there is an indicator showing changes in the numbers over the prior day. These also serve as buttons to toggle the data in the below visualizations to switch between case and death data.
- Visualiation Choice – Current Status of Testing: While the data could be all doom and gloom, I found the data regarding the number of tests administered in the state to be quite interesting. As I’m writing this, there have been roughly 30,000 tests performed, with only 8% coming back positive. While it might not always be quite this sunny, I thought this was a bright spot that deserved being highlighted.
- Visualization Choice – The Map: I already mentioned above how maps can be super misleading. As data visualization professionals, we’ve all heard that when using a filled map (also called a choropleth map), we should make sure that we’re truly comparing apples to apples. Sure St. Louis County has the most cases, but it’s also the largest by number of residents – about 200,000 ahead of the next highest county – in the state. When we look at the number of cases per capita, St. Louis County is only third highest, behind Perry County (the Missouri hotspot based on this data) and St. Louis City. Knowing that some counties may be smaller in area and therefore not visualize as well (St. Louis City), I also provided a button to switch the view between the map and a list format with the data presented as a bar chart, showing how the data varies between counties.
- Visualization Choice – The Trend Line: So, currently, the data actually presents better for Missouri using a linear scale making that an easy choice. Should I eventually have to go to a logarithmic scale, this will be clearly noted.
- Finally, Visualization Choice – Distribution by Age: Once again, this is one where the per capita versus absolute numbers tell a different story and where we need to make sure we’re comparing apples to apples. While right now there are more cases amongst people aged 55 to 59 than 65 to 69 (259 versus 183, respectively), the per capita number is higher amongst theose aged 65 to 69, and therefore should be presented as such.
There are still some tweaks to do, but this will occur over time, as data changes and is added. I’m definitely open to feedback on this one, so let me know if you have any.
Bonus – here’s a quick snap of how I pulled all the data together, leveraging Tableau Prep Builder. I was able to join the population data from the US Census Bureau with the COVID-19 data from the Missouri Department of Health and Senior Services, do some quick calculations and output some very clean data sources. This is the most work I’ve done with Prep Builder to date, but found it to be pretty straight forward. There are definitely some places (documentation/annotation) that I would like to see improvements, but overall a pretty solid tool for what I was doing.