Think like a hacker pt. 1 – Never do it more than twice!

2024-01-18

Data tools like Tableau or Alteryx have made data analysis much more accessible to the average user than they ever have been. Non-technical users can now, with only a few hours of training, start building data pipelines, data sets and dashboards that have required complex coding and data skills in the past . It is still hugely beneficial to know basic development concepts and techniques to make sure whatever you built doesn’t fall over with the next data refresh. You can read part 2 and part 3 here.

Many of today’s data tools promise “data democratisation” – although few of them define what that means – with the assumption that things get better and faster if just everybody had access to data. While superficially this might make sense, if you compare it with the real world, this typically leads to a lot of new problems.

One of the main reasons why business processes in Excel fall flat (apart from the fact that it is not made for this purpose) is the fact that the people who develop these workbooks often don’t have a development background. This is not bad per-se but it does mean that they often are not fully aware of concepts like maintainability, robustness and exception handling. What they do usually works at the end but only at the expense of time and effort to keep it running on an ongoing basis.

A new, more powerful, tool then means that there is often even more at risk should one of these processes fail.

Organisations will need to acknowledge that democratising data doesn’t stop with the implementation of a new tool. It needs training and guidance on how to use these tools to make sure they don’t build tomorrows legacy system.

What to expect

In this series I will explain some basic development concepts, techniques and rules that are applicable to any tool, language, and process. Whether you use Tableau or PowerBI, Alteryx or SQL or stick to Excel, applying these tips will help you develop more robust solutions. You will save time in the long run because you will reduce the need for fixes – and if one is needed, it might be quicker to implement.

I will focus on what I would consider “institutional knowledge”, things you usually learn over time, the hard way, rather than things you’d find in a text book (I think).

One warning though: Most of these tips will also slightly increase your effort in the short term. You will need to spend more time developing and use more complex ways to achieve something, to save a lot of time and headache later on.

As always, the real world is messy and there is never a “one size fits all” solution. I aim to provide tips that apply in most cases but take everything you read here with a grain of salt. There are always exceptions to every rule and it’s your responsibility to critically ask yourself whether you encountered such an exception or if you use it as an excuse to save a little bit of time right now.

Fool me once…

When working with data, you will – very quickly – end up in a situation where the same “thing” is required for different aspects of the data. Data in multiple columns that needs to be cleaned up in the same way. The same Year over Year comparison needs to be done for multiple measures or different business units want the same dashboard for their data.

Often, the quickest way to achieve this is to just copy what already exists and make a few minor adjustments. Done!

But at this point, the problems usually start. In most cases, it doesn’t stop there and suddenly you need to create a 3^rd version of it. And a 4^th.

For each new version you will have to spend the same amount of effort; but even worse, if you need to change something, you suddenly will have to change it in multiple different places and make sure that you apply those changes consistently.

My rule for this is:

If you do something once, do it!
- You need to create scripts, calculations, tools, etc. so just do these things but be aware of the things that you do. (Easier said than done)
If you do something twice, mark it!
- Be aware that you repeat an action. Do you create the same or a very similar calculation? Do you use a tool again in your workflow with the same settings? Do you create the same chart just with one small change? That is not something to worry about yet, but you should get suspicious. Make a mental note of it; ideally consider if it’s likely that you will need to repeat it a third time.
If you do something thrice, change it!
- When you are about to do something a third time, you should stop right there and think about how you can change your whole approach to not do it again.

Why?

With each iteration you create more work for yourself to recreate it again and much more work to change aspects of your solution down the track. You might find a bug or the requirements change and before you know you will need to change multiple different versions of your solution. That in itself usually requires significant effort already, not even to mention that the likelihood of newly introduced errors increases. It’s easy to make a mistake when copy/pasting a formula or not ticking a certain setting or just forgetting about one of the versions that you need to change.

How do you approach this?

The most important aspect for any problem is to fully understand it. Why do you need to do the same thing again? And even if it seems obvious, don’t just take the first reason and run with it!

When you duplicate a dashboard for another Business Unit, is there a reason why it needs to be a second one? Why can’t you use the first one and just add a filter? Do you run into data security issues? Do the different business units say they require different insights? Do they really or are they supposed to look at the same metrics? This can quickly become a rabbit hole but as a data analyst you are always also a business consultant, trying to understand how to help your customers best!

Once you fully understand, why this is necessary, you might already know the solution. The biggest problem then is to implement it and – maybe – to convince your users that this is the best way forward.

Examples

1. Dashboards

When working in reporting it is quite common to look at different change rates over time. Current Month Profit, “Month on Month” change and “Year on Year” change. If you do this for Profit, Sales amount and Sales volume each, you have 9 values to calculate. And whenever there is a change you’ll need to change it three times. What other ways are there to solve the same problem?

Let the user choose! Most tools provide the ability for user input. This means you create the calculations only once with a placeholder (often called a parameter) and the user choses which measure to display
Pivot the data. Often you can avoid the repetition when transforming the measures from being in 3 columns to being in one column. You end up with only three calculations and can use a filter to select which measure to display
Don’t calculate at all. MoM and YoY calculations are often needed because there is no other context given, so the audience doesn’t know if a value is good or bad. If you provide a line chart over the last 12 months, your audience can see if the line goes up or down and whether the current value is good or bad.

2. Workflows

I regularly end up in a situation where I branch my flow and suddenly end up applying the same filter or the same calculation for each of the branches. The solution to that is usually to “just” move those nodes to a position before the branch takes place, which eliminates the need to do them multiple times. However…this often requires a bit of refactoring. These processes evolve over time and you quickly build dependencies where just moving a node might break some other aspects of the flow.

3. Databases

Do you keep creating datasets (tables, views, etc.) that are very similar to other ones that already exist? Rather than copying the definition and creating it once more, can you consolidate these views into one, more comprehensive one? Maybe one that has more columns than required for each use case? Or one that is on the lowest level of detail rather than aggregating it? With most modern tools, typical data volumes are not a big problem anymore. If those are not possible, can you create one standard view that apply all the main transformations and then use this as the baseline to create customised views where you drop certain columns or apply additional changes.

4. Access

Oftentimes duplication occurs as a solution to access requirements. Group A should see all their data without getting access to group B’s data and vice versa. The absolutely bulletproof way to do that is to “physically” (as far as that is possible in a digital system) separate them. ie. Different sites on a Tableau server, different workbooks within a site, different warehouses in Snowflake, different flows in Alteryx, etc. The more dynamic solution that requires a little more upfront effort but is equally secure is row level security. Rather than copying a dashboard for two groups, only one dashboard connects to a data source that manages the access. For all things security, there are always a lot of considerations and you definitely don’t want to “just quickly enable row level security”. Consider this an option that you might want to discuss with your Tableau and Database admins to figure out what solution is appropriate for your circumstances.

Possible solutions

Those were 4 specific examples that I have done myself or see others do repeatedly. Your situation will likely be different but when you find yourself recreating the same thing over and over again, there are a few general considerations you should think about. They might give you an idea on where to look to reduce the redundancy.

1. Outside the box

Is there a completely different approach to solving this problem? As in the Tableau example, can I get rid of the calculations entirely and create a chart instead? Maybe the intended tool is just not the best one for the job? Does it need to be a dashboard or could you just display a figure somewhere? Do you automate a manual process that shouldn’t be done this way anyway?

While this might be the easiest solution for you in terms of development, it might be much harder to convince your users to accept that change. It’s worth having this discussion and stakeholders might be more inclined to accept them if you can explain how much more effort (ie. money) it’ll cost to do it like originally intended.

The other aspect of this is that you will need to be very comfortable with your tools and options and think very creatively, as “outside the box” usually means “not immediately obvious”!

2. Specialised tool/process

Does your platform provide you with a native solution to the problem? Alteryx has multi column calculations so that you can create a calculation and can apply it to multiple columns. There are also batch macros to do a task repeatedly. Tableau allows in-tool pivoting which often solves a lot of the duplication problems in calculations. Most DB platforms have the ability to define procedures that you can re-use in different situations.

These solutions might be more advanced, so this is not only an opportunity to develop something more efficiently but also for you to upskill in the tool of your choice. It’s often also worth spending some time researching the problem at hand; you might find an aspect of your tool that you were not aware of yet.

3. Build a template

Can you build something once and have it dynamically adjust to the audience? Dashboards can be designed once and a global filter can restrict the data for a specific business unit. Workflows and processes can be triggered with parameters that define how much data or which data should be used. The important aspect to keep in mind is that “template” does not refer to a generalised file that can be copied and adjusted. It refers to exactly ONE file that dynamically shows the required content.

4. Fix it at the source

This one is often easier said than done! If you have to calculate the same ratios every single time you work with a data source or construct the date correctly as it is stored in this exotic format it’s usually the best to fix it at the source. Create a custom view or a new data source in which you address the problems for everybody who wants to use it.

Especially in larger organisations, that can often mean a lot of effort to get it on the roadmap of the responsible team, not even to mention documenting all the requirements and getting it into production. I typically try to at least get an idea of how feasible such a request would be. Do I know anybody in the data team who can give me a realistic estimate? And is the relationship to the team good enough that they understand the benefit of such a request? Or are they based on the other side of the planet and take care of a lot of high-profile systems so my request will fall between the cracks?

Typically, I would raise such a request in the background but keep developing. That way I can keep any deadlines but may be able to re-factor it if the source data changes in time.

5. Live with it

Definitely my least favourite option and one that you should only pick if you can honestly look at yourself in the mirror and claim that you have tried everything else! In some cases there just is no way around duplication. Business requirements may be so exotic or you deal with so many conflicting aspects that the only way to solve it is to create duplication.

Summary

A guide like this can never anticipate all situations you might find yourself in but I hope it gives you a few things to think about when developing next time. Even if the only thing you keep in mind is that you shouldn’t do things more than twice, at least you know of this article now to refer back to.

In the end you will need to decide for yourself whether it is worth putting the effort into refactoring it or if you just run with a solution that may be hard to maintain or fix in the future.

If you have any questions, something is unclear or you disagree with something I said here, get in touch and let me know! Special thanks to Steve Wood and Heather Cox who provided feedback and pointed out some errors and omissions!

Comments

3 responses to “Think like a hacker pt. 1 – Never do it more than twice!”

Think like a hacker pt. 3 – Break things! – Vizku

May 9, 2024

[…] to make sure whatever you built doesn’t fall over with the next data refresh. You can read part 1 and part 2 […]

Reply
Think like a hacker pt. 2 – Anticipate the future! – Vizku

May 9, 2024

[…] to make sure whatever you built doesn’t fall over with the next data refresh. You can read part 1 and part 3 […]

Reply
Where does Tableau’s 4th wave take us? – Vizku

May 10, 2024

[…] It will also allow for internal sharing of assets, which has the potential to have a huge impact on governance. Subject matter exports can develop complex breakdowns which can then be embedded in different places rather than leaving it up to the individual to create their own version of the truth. (Something that was actually my main point in this years TC presentation) […]

Reply