I wrote about design REPL previously. Here I’m showing an example how it works, on a real (and rather large) project I did.
This post shows, that:
- I have been using design REPL even before I coined the term
- there’s LOTS of work prior to writing first line of code or even choosing the technology, frameworks and the like
- good design helps tremendously (and how) when you want to deliver
It also talks about Golden Master testing technique, which is one of the best I know when working with untamed code.
Input: monolith system with number of reporting features scattered throughout. Some of those happen to kill the system.
Turn it off, you say? Wise!
Unfortunately, you can’t turn it off. It’s used, it’s wanted, it’s generating revenue – well, if it’s not busy killing off entire system, that is.
Problem: cut it away safely, but quickly. Avoid more performance crashes. Keep it working. Bonus points if you rewrite the code to newer language.
Bonus requirement was something I wanted as well. I was tired of constantly hearing how we’ll rewrite anything to Java and did NOT fancy to become one of those who say they will rewrite… but never actually do.
#1: requirements and priorities
Few things to do here. First, word meanings:
- safely means “no functionality loss, no report data changes, no report result changes, users are NOT to notice (unless by better performance)”,
- quickly means “show results in 4 weeks”,
- avoid more performance crashes really means “no more DB down due to reporting”,
- keep it working is most straightforward – “you cannot shut it down”.
- Newer language is techies preference: instead of Coldfusion, we want Java.
Now, some questions leading to quantification and hard data:
- what exactly brings down DB?
- what can I do to stop it? what can my project do?
- how many reporting features are there?
- how many reports are we talking about?
- how many people will be working on it? for how long?
- where’s the code for reporting currently? what it does?
- no functionality loss: all reports need to be available as they are, all data as well, all related capabilities too (clicking the reports out, scheduling them, etc.).
- users are NOT to notice (unless by better performance): UI stays exactly as it is (hmmm, it’s Coldfusion?)
- also, “no report result changes”, you know that for better performance we need to change the queries? Do it in such a way that result stays the same
- 4 weeks just by myself – cannot happen. You will get a team.
- DBAs help needed and data access. Promised and granted. But there are only 2 DBAs and they are needed for operational matters.
- how many people do I get? One Scrum team can be spared, what are you planning, what do you need?
- there are 3 large reporting modules in our monolith, but only one is said to kill the DB (for now).
- Is “quickly” more important than “safely” and “avoid crashes“? No, avoid crashes is most important. Rest – up to you. Can we have something next week, BTW?
- DB downs are caused by computational-intensive queries: what about solutions on that front? Perhaps limits, indexation, DB redesign… In short only two DBAs busy with day-to-day operations mean slow progress on those fronts. Here is a list of worst 50 queries.
- Do worst queries change depending on time of day or week or users? Hmm, nice point, they do. Still, here are three faulty-reporting-related queries that consistently appear high on the list, no matter the time, the day and the user.
- Will I have reports-only DB?
Keep it working:
- I can’t shut it down? No, you can’t. It’s used. It works. It brings clients. Revenue. No way.
- And if it continues crashing the DB? Still no.
- Really? Isn’t avoid crashes our priority? Not even lessen the volume? OK, you may lessen the volume.
- Impose limits on users? No! Not… OK, perhaps. Bring us data first. Please?
Bonus – move to Java:
- There. Is. A. Lot. Of. Code. 😦
- This guy who wrote it no longer works here. Wait! I talked with him before he left! Where are my notes?!
- There are actually TWO reporting engines. Both seem used.
- It’s OK if Coldfusion stays for UI.
Priorities and meaning:
- avoid crashes == “no more DB down due to reporting”,
- keep it working == “you cannot shut it down, but you can lessen the volume and perhaps even impose limits on usage”,
- safely == “no functionality loss, no report data changes, no report result changes, users are NOT to notice (unless by better performance)”,
- quickly == “show plan in 1 week, first results in 4, though if you have good plans and offer quick remedies we’ll get off your back”,
- Newer language == Coldfusion for UI, Java otherwise.
Needs: plans, project design, more intel, more time.
Problem is obviously not yet solved so we go on. So I need:
- more time with the code (2 reporting engines?!)
- more time with data as DBAs are swamped (quantification, problematic queries)
- pass on to as urgent those three problematic queries to be fixed ASAP as quick remedy.
- 2-hour session with DBAs (no sooner than hell will freeze will I get it)
- track down who worked on that faulty reporting module, is still with the company and what can (s)he tell me about it
#2: problem analysis, finding options
Almost all points from loop directly transfer here. More time is spent, DBAs are pestered regularly with direct and well-researched questions since session is given a green light and urged but will not happen (operational priorities or constant rescue missions), more data comes to light. After few days of reading and copied-DB-tables querying…
- There are “old” and “new” reports. Apparently, after half a year of usage reporting module became so popular we needed to “categorize” reports. So, old engine runs all reports without category, new one adds category. This significantly changes the way queries are built. Hell. Old docs mention that old reports were to be deprecated… but seems that never got time. ARGH.
- Three offending queries are fixed. Fixes will be rolled with new patch. Kinda slow? Ops, DBAs and I don’t have time to press. It does give ample testing time and meets all safety margins. (And it’s nowhere near enough.)
- Data investigation is fruitful. Damn so.
- Reporting users fall into categories: heavy, moderate and light. Or – by impact – murderous, expected/tolerable, non-noticeable.
- Murderous users have both volume and data to kill us.
- Fortunately they are NOT high in numbers. We probably could introduce some payments for such heavy usage and gain from it. Such BI usually doesn’t come cheap. We can make it cheap and more performant. We will have to make it more performant anyway…
- Old reports are few in numbers, but still used.
- There are high numbers of reports that are scheduled monthly, weekly or even daily… and are never opened.
- One third of those belong to users who have not logged in more than a year now.
- Code is much clearer now, though there are some dark spots left, but general mechanisms and it’s components are clear.
Pretty report is compiled with obligatory graphs, showing never opened reports, users categories, data volumes etc. Takes a day, almost two, to transfer all the findings into visuals. Numbers of reports ran, used, scheduled, etc. per day, week, etc. are also compiled.
Hot-fixes are announced and their less-than-expected results as well.
Requests are made:
- legal to see what obligations we have to keep when it comes to report keeping
- especially for heavy users
- if we can ask for payments if reporting is used ABOVE some (any?) level
- product /strategy team if we’re OK with being used as cheap BI
- Ops / DBAs for reporting-only DB parameters
- customer support to (if ok) alert users that never-opened reports will stop being generated and their schedules will be deactivated
- customer support / account managers to ask if we can shut down “old” reports (and if so, on what terms and when)?
- account managers to shut down inactive accounts which generate reports
- Do we have 100% certainty that “never opened” really is what we think it is?
- Shutting down old reports.
- Why hot-fixes improved so little?
- Crashes almost happen: DBAs go query-hunting thrice a day.
Enough data is gathered to start planning what to do and formulate theories to solve the problem (cut it safely, quickly, while avoiding crashes, keeping it working and without functionality loss, while – for bonus points – moving it to Java). Next goal is to offer a plan (and within a day). Asked DBAs if my queries and findings made sense and for more prod data to be copied.
#3: checking out options
- code that deals with “never opened” report, to make absolutely certain prior to contacting users
- UI / engine code, to see how to split them
- responses to requests:
- legal: will take time, three week at least for charging for reporting usage if volume extends… No heavy user has special provisions in their contracts with regards to reporting though.
- accounts managers will begin accounts closure after checking in with finances (if they pay, no closing)
- customer support OKed broadcasting needed messages / asking questions about old reports.
- production data and explain plans for same queries but for various users
- worse queries lists (generated at different moments / days)
- hunted queries (from DBAs manual interventions)
Asked a colleague to independently verify my findings with “never opened”, I found it truthful for all but few conditions, which, when excluded, did not affect overall numbers much (less than 10% IIRC).
No official responses from DBAs and ops about new DB server (or even it’s parameters that they required), coupled with overall work volume and times they had on this, made it clear that all remedies on that side should not be counted upon as quick.
Split was fairly easy if old reporting engine could be removed first. Not so much if it couldn’t (since basically every step of report building had if “old engine” then… and the Coldfusion code dealing with it wasn’t so neatly packed and separated).
Closing the accounts turned to be quite a relief. So was switching off unused reports (though it lasted longer and required more sessions with customer support so all questions incoming were answered). Due to both efforts DB suddenly breathed a sigh of relief! Some even wanted to herald project end. Wise person warned me about it in advance so I planted warnings ahead of time. Project continued.
Evaluating data and the queries showed weaknesses in our data model (which was build more than a decade ago). And the need for DBArchitect, not just DBAdministrator. Oh, our admins could be architects, at least some of them could, but… none had the time. 😛 Nevertheless, few points carried on to print phase.
Our data model weaknesses were large enough that we needed to spread awareness of it and fast. We decided to find a suitable person to handle DB as an architect and started to look closer on queries devs put together. I quickly found few senior devs willing to invest the time and we started having weekly SQL reviews with data pulled out of Git logs about queries folks made. Additionally, SQL got it’s own standards page and I asked code reviewers to make sure everybody adhered. This caused others to add their own best practices and Wiki section on SQL promptly grew. I made sure DBAs actually added their own and revised others.
Two things caused my mood to soar:
- Colleague’s verification turned out same results as mine did.
- I found seams in UI/engine code and with customer support verification that old reports can be archived as they are unused, I could easily complete designing the solution.
And finally, I was able to propose a design. It was iterative and each phase gave us concrete benefit and safety (due to Golden Master technique).
I went to all folks I trusted and involved and talked with them about that design.
I scheduled a call with the promised team so they too could have input here.
Basically… I kept asking for feedback, pointers and the like.
- Iteration #1, reqs and priorities is a must. I can’t stress this enough: you actually learn what you should set out to do.
- Iteration #2, options. Happens automatically when you deep-dive the problem.
- Iteration #3, setting the course. Here is where decisions happen. #1 and #2 serve as input. Eval phase here is truly crucial.
In the end
My plan, the end result and observations from whole thing. But first…
In music business, first record is referred to as “golden master” and serves as a source for all the others. In testing, you capture your program’s output with number of input sets. Then you compare, that after your changes, with same input sets, you get same outputs. Fairly simple and quite powerful. And above most: perfect match for reporting, be it query building or execution. This technique is necessary during usual legacy code retreat, and thus you may know if from SCKRK‘s code retreats.
In our case it went as shown below:
We had two such tests that ran on all reports, via Jenkins and were part of whole project build pipeline. Additionally they were launched especially when we wanted to clear the phase of a plan.
- Teach the team what I learned.
- Set up testing infrastructure (Jenkins) for them to use.
- Set up separate DB schema for all our changes.
- Get us Tomcat server in our dev env and start making provisions for prod.
- Golden Master (hence: GM) all the reports and queries.
Phase 1: rewrite.
- Remove “old engine”.
- GM building queries.
- Rewrite “new engine” to Java (Spring, query building DSL, on Tomcat).
- GM both building queries and their results. (this ended each phase, actually)
Phase 2: REST API
We wanted old GUI to talk with new service via REST API. First we had in mind REST per report, but in the end it was REST for debugging and REST for running/scheduling the report, both parametrized. This was team’s input and I agreed as it was simpler and clearer.
Phase 3: secure it.
Meant to prevent misusing various aspects of both old and new systems from revealing reports meant for someone else. Can’t say much about it, but in the end we changed REST API somewhat.
Phase 4: performance
This was to go query by query and improve performance of it. However it was put off by other efforts and the fact that relief’s been large enough that nobody complained anymore. In that light, spending Scrum team’s time on improving the queries was no longer seen as viable.
Still, some improvements were made and we consistently could identify worst offenders now on our own, without DBAs.
The end result and observations
End result was a simple Java app running on Tomcat, building queries using subset of our DB tables and a special user with STRICTLY limited privileges. There were 2 REST services, query building logic (the engine), and DB access. Once the coding started, I was actually glad, because the most turbulent stage (talks, data gathering and all that which is part of design) ended.
Avoid crashes was actually achieved from the start by manual query hunting done by our DBAs. Then by relief projects with outdated reports and adding a feature to turn off report generation if it’s unopened. Almost all done on DB level.
Account managers and customer support were crucial here in obtaining permissions from client as this broke the safety criteria (functionality change, reports turned off / archived).
Worst offenders turned out to be less problematic even prior to being reworked. Volume of switched off reports helped greatly. Then reworking most awful queries also helped, but the fact that DBAs couldn’t review them for weeks because of other work speaks for itself.
Actual rewrite helped in one part: killing the reporting engine was now an option that didn’t mean killing the entire platform. Not that it was needed that much now. New code was simpler, faster, had unit tests and CI jobs with Golden Masters. Most importantly: it was self-contained. Any work, change etc. was much easier and talking between reporting and monolith now happened via RESTs called from old UI CFML code.
Safety – that is no functionality loss – turned out quite possible because of GM. In the end we did lessen the volume though. And we made it possible to kill the reporting. Which was said never to be done, but ended up being used up in few critical cases months later. Overall… we had only few bugs. Less than 10. And those mostly happened because of badly applied user grants for reporting user.
Quickness was achieved – more than anything else – thanks to cooperation of many people. Those who did the query fixes, those who contacted customers asking them about switching off reports that are unused, about old reports, all those people involved from customer support and account managers they really helped here and did so at the drop of the hat. Rest was fairly straightforward rewrite and nice set up of infrastructure (separate schema where only we changed things, Jenkins jobs that we could run whenever etc.). Of course, relative greenfield-ness of the project helped here as well.
Reports-only DB was kicked off. But seasons’d changed (more than once) before we saw something at this front.
Insight into how people use our reporting module turned out pretty interesting… and was in the end ignored.
We never heard back from legal. 🙂 Perhaps we should be worried? But few weeks later they asked us not to use open-source with viral licenses despite us being SAAS. We asked them for clarifications, but they were busy again.
Work remained on DB front: DB data model redesign, better indexation, DB Architect and all that. That was put off and came back to bite us later.
Overall, project was quite the success. DB downs stopped happening, number of queries disappeared from “worst offenders” lists, SQL reviews kept going for a few months or so, until other projects pulled folks somewhere else. Still, tickets and corrections were done thanks to them and the docs stayed, so had the guidelines. I later noticed former reviewers (and not only) linking them here and there, sometimes during code reviews. We also rewrote part of the system to Java, stood our ground that “yes, Tomcat is QUITE enough for this, no JBoss is required” and fought quite a battle to have new DB user have only strictly necessary privileges. It paid off many times later, which – had we not had to fight so hard for it – perhaps would’ve gone unnoticed (which is a topic for quite an interesting post in itself, about how we miss the good things that happen due to what we’ve done). Some time later there was a serious performance problem and reporting could have been investigated quickly and throttled similarly fast (both under a day) which was quite something compared to my full week of poring through the code, when both old and new engines were supported.
When I coined design REPL this project was something I was going to post as “large example”. Soon however, I realized that it deserves it’s own post and here we are.
Thanks to asking several questions prior to coding, thanks to spending time on evaluating data and (even more so) persistently asking for it, I avoided number of surprises, like the fact that DBAs – despite considering the project essential and vital to their daily job – would sometimes be completely unavailable. Heck, even they found such black scenario unlikely. They had foreseen that full-time support is not possible, but to be entirely out of reach was not considered.
Gathering the options allowed for simultaneous work done here: closing accounts long unopened, archiving unopened reports and turning their ongoing generation off, asking customers to agree to archive old reports and disabling their generation, setting up separate DB schema for the rewrite etc. All of that was done before we finished second stage of designed solution.
So, next time you are to embark on a large project in a large system, think of design REPL. I’ll gladly hear any feedback you have.