War Stories, Episode 002 – Speculative Solutions

Why am I using the term “Scar Wars” to depict the war stories?  Well, it has to do with the habit of techies to try and outdo each other with their war stories; exchanging war stories in a one upmanship manner.  Each participant tells their most painful/shocking/hilarious war story.  As you might imagine, there may be some suspense or even exaggeration in the story, but hopefully a lot fun and even serious take home lessons.



Speculative Solutions – What is he talking about?

Technical problem solving often involves pursuing speculative solutions to see if they work.  I define speculative solutions as a solution that is not strongly backed by a metrics/evidence-based root cause analysis.  For example, you have some strange fluke with your laptop, so you shut it down and restart it to see if that works; not always a bad approach because it does not take too long, but can be problematic if you are in the middle of something important. 

Speculative Solutions – Can Be Problematic

I can’t tell you how many Oracle performance troubleshooting sessions I’ve been in where numerous speculative solutions are thrown out there to see which ones “stick”.  Don’t get me wrong, I’m not adverse to discussing a range of possible root causes/solution, but not to the extent where an overly assertive personality will get their “solution” prioritized even if it is not strongly evidenced based.  I suppose I should say, the context of my distaste for speculative solutions is where one doesn’t really have a handle on the root cause of the performance problem and the solution would be:

1.      production impacting,

2.      require a lengthy or high resource consuming fix/test cycle, and

3.      where the application team can’t afford to waste a fix test cycle on a speculative solution.

Example Speculative Solution - The ol’ “this parameter will fix it”.

In a discussion on Oracle performance (back in the 9i days) I overheard someone mention how the parameters optimizer_index_cost_adj and optimizer_index_caching helped tune an OLTP database.  So I did a little internet research on these and found that these parameters are intended to influence the cost-based optimizer:  a setting of 10 for optimizer_index_cost_adj was touted as a “silver bullet” to provide huge performance gains because it would tune all your SQL to favor indexed scans over full table scans.  A value of 90 for optimizer_index_caching was similarly touted as a “silver bullet” because it would inform the optimizer how much of an index would likely be cached.   The writeups on these parameters seemed very scientific, in that you could verify a setting with a query, etc.  So, me being very keen to use this newfound “knowledge” and because the Oracle defaults were “very authoritatively” said to be “improperly set”, I searched for a problem to apply my ready-made solution.  So along comes an OLTP database application having performance problems [who doesn’t want a “silver bullet” solution?].

Speculative Solution – Applied in Production … and Quickly Rolled Back.

So, I happened to be on hand when the operations DBA (a DB restart was required) applied the solution in production, as was someone from the app team to test it out.  Well, the long and short of it was the speculative solution brought the system to its knees as recognized by the app team tester; I quickly deduced that this needed to be rolled back; contacted the operations DBA and had them reverse it.

Touted as a Hero 

Now this is the strange part.  The next day, I get a nice e-mail from the VP of R&D IT for having saved the day and rescued the system, and, I was to receive a bronze service award for my efforts.  I have no idea how that understanding of the situation was conveyed to the VP, but I kept my mouth shut and took the extra $250 in my paycheck.

Some Lessons

-        Thoroughly test in a pre-prod environment of similar setup.  In this case it was tested in pre-prod, but the differences in data and workload masked the problem.

-        I won’t name the web site where I got this advice from, but needless to say, from then on, I took advice from there with extreme caution.

-        Have a good relationship with your operations DBA and app team tester.  These two are the ones who really saved the day.  The tester identified the problem, and the DBA implemented the rollback very quickly.

-        I was willing to take a ding on that one but couldn’t see the point of shattering the VP’s idealistic vision of my expertise.

-        Put this one in your bag of tricks.  I can’t recall ever again recommending to change these parameter settings.  That is, tread lightly with any parameter which will change the optimization en masse  on many of the queries at once [e.g. cursor_sharing = force].



 

Appeal for More War Stories

That’s it for now.  Again, I’m starting to get volunteers to join me in sharing their war stories, so tune up your war stories.  The idea of a doing the war stories in a podcast format is starting to take shape.  My vision here is to make a place where DBA’s, developers, Oracle “grey beards” and other battle-scarred veteran (and new) techies get together and tell each other their war stories, banter around a bit, and have some fun.  Please contact me if you are interested in having your war story included.  Of course, “the stories you are about to hear are true, only the names have been changed to protect the innocent” [some of you old timers will recall where this quote is from].

 

War Stories, Episode 003 – A Single SQL Statement Crashes a DB Host Machine

War Stories, Episode 001 - Markitecture vs Architecture