A classical annoyance of data analysis is the extra work required to properly record what you have done. Despite benefits of enabling others to pick up where you left off, or to help you revisit your work and still understand it later, the tedium of proper documentation tends to result in little to no documentation whatsoever. Subsequently, failure to document forms technical debt. The value of your acquired insight is reduced when you have trouble explaining your results and have trouble recreating findings.
What is technical debt?
Technical debt is an obtuse term and thus has multiple definitions depending on the context in which it is used. Essentially, it refers to starting a project and not truly finishing it.
The debt that arises from lack of documentation represents a hidden cost to analyzing data without taking the time to explain the process. This debt only materializes later, once the analytical objectives shift or the analysis must be performed again. Encountering a “What was I thinking?” moment requires you to retrace your steps.
Separately, yet similarly ominous, is the widespread problem of not explaining the work. Perhaps it is an artifact of academia, or intellectual hubris. Our goal at Megaputer is to re-envision that cost-benefit analysis that an analyst undertakes when deciding whether to document, and make the costs smaller and the benefits larger.
Avoiding technical debt
One of the first steps is recognizing that data analysis is a process that not only must be created, but also maintained over time. We think the answer lies in making analysis self-documenting1. An analysis becomes self-documenting when there is no additional effort required to generate the documentation of that analysis. Merely by performing the work, the documentation materializes. Personified, this is similar to living a life that is autobiographical. There is no need to hire a biographer towards the end, as the act of living itself memorializes life.
We designed PolyAnalyst with this principle in mind. Merely by using PolyAnalyst to generate analytical results, a journal of your work is recorded. Not simply a textual log, however. Analysis is done by performing individual analytical steps that you compose together to form an analytical step sequence. A step might be something as simple as moving data from one database to another, or changing around the columns and rows of a table, or generating a machine learning model. This overall sequence (or set of sequences) comprises your analysis. The act of adding a step to the sequence is concurrently the act of curating your record of the steps you have done. In naming the step appropriately, you describe what the step does. Moreover, this overhead of naming steps and sequencing steps is simple. There is very little if any extra burden imposed on you. Indirectly, you are making the decisions of how to explain your work to others upfront, immediately, when you decide to undertake a particular analytical step.
By baking the requirements of documentation into the approach, your work becomes naturally self-documenting. We think this is a vital feature in the design of an analytical product. Analytical results do not speak for themselves, they must be accompanied by explanation, and we have designed PolyAnalyst to simplify the task of generating such explanation. When the insights you have generated are easily explainable and readily modifiable, then you have designed with the future in mind, leaving you better equipped to tackle your future analytical tasks, or revisit and old one and pick up right where you left off.