War Stories, Episode 003 – A Single SQL Statement Crashes a DB Host Machine

A Single SQL Statement Crashes a DB Host Machine

I recall vividly the day I crashed the DB host machine twice with a single SQL statement.  This war story goes way back to the early 1990’s clustered Oracle in version 5 running on VAX/VMS.  We called it “clustered Oracle”, I’m not sure, but in V5 of Oracle I think this was the precursor to RAC (at least I don’t recall the term RAC cluster being used at that time.  [If anyone knows the back story on clustered Oracle or RAC, I’d be glad to hear it.]



The Project

My first really big project was a large clinical trials application and adverse events reporting system that was roughly a 120-person year effort project over 5 years. Back in the late 80’s and early 90’s there were no commercially available solutions for clinical trials data management and adverse events reporting available, so we built our own [Side-Bar Note: Ultimately, this system was in production for 10 years and handled about 70 medical development programs worldwide; we were able to do this by having the application be metadata driven and persisting the clinical results in a key-value-pair structure].  I was responsible for the design and development of several key subsystems and was reporting to Jim J. (who was a lot of fun to work for - we were on a username basis; I called him jrj4272 and he called me rdc0208).  Incidentally, Jim was a good ol’ southern boy who went to Woodstock in 1968 with some of his buddies just to see what all the hippies were up to; I was in 4th grade at the time, so I couldn’t join him.

The First Crash

The system included a large (at the time) compound table (~5000 rows) containing the data for many of the companies compounds that they were working on to become drugs.  I needed to refresh the data in the table on our development database, so my plan was to delete the data and reload it.  You may ask, “why didn’t you use ‘truncate’?”.  It was either ignorance on my part, or ‘truncate’ was not available in V5 of Oracle [I like to think that ‘truncate’ was not available at the time].  Our system was multi-user, so I figured I’d be nice and lock the table before deleting, so here is what I did:

Lock table in shared update mode nowait;
Delete * from compound;

CRASH!  As soon as I hit “semicolon/return” on the last SQL statement the VAX crashed.  Did you ever do something like touch a door knob at the same time a bolt of lightning strikes?  For a split second you think you caused the lightning to strike. Hmm, could it be that I crashed the VAX with that SQL statement … no, it couldn’t be possible.   Anyway, I thought it an interesting coincidence and didn’t give it a second thought after that.  Since it was time for lunch and no one could work, I headed off to lunch with the usual crew.

The Second Crash

Back from lunch, I sat down at my terminal and could see that the VAX was back up.  So where was I?  Oh, I remember …

Lock table in shared update mode nowait;
Delete * from compound 
/* momentary pause;  thinking to myself, “the last time I did this the VAX crashed … ”;  thinking again, “… certainly it’s a very low probability that this killed the VAX …”;
“ … type the semicolon and hit return … ” */
;

Semicolon, return, CRASH! Whoops! Did I do that? Perhaps I did, so I sat in my cube in a low crouching posture hoping it was all just a coincidence that the VAX crashed again with the single SQL statement -- until I get a certain visitor.

The Call from the Systems Manager

My boss, Jim J. pops hurriedly into my cube to let me know that he just came off a 20-minute chew out call from an unhappy Systems Manager (let’s call him Joe R.) wanting to know why rdc0208 has more than 5000 enqueue entries on the VAX which caused the crash.  Jim’s reply to Joe was that “rdc0208 was just doing his job”.

The Problem for the Business

Everyone in the US R&D center (~ 2000 people) were using this machine for production and development, so it was brought down for the whole company.  Thankfully, just a machine/db restart and roll forward recovery was needed.

The Blame Game

Oracle Corporation blamed the operating system, and Digital Equipment Corporation (the operating system vendor) blamed the Oracle software.  Evidently Oracle was doing row level locking for this, and VMS was using an enqueue entry for each row locked (VMS had a limit of 512 enqueue entries).  Oracle was saying “why are you using enqueue entries for this?” and Digital was saying “why are you doing row level locking?”; with no short-term resolution to the problem.

 

Lessons:

  • Do not have production applications and development on same box.  From that point forward, PROD and DEV were on separate boxes.

  • Since there was no short-term resolution, the company I was working for gave up on clustered Oracle (for the time being).

  • It is nice to have a boss who will back you up rather than throw you under the bus.  Jim J. could have let me hang and dry on that one, but his statement that I was just doing my job felt like it saved my job.

  • Chalk it up to experience and put that one in your bag of tricks.



 

Appeal for More War Stories

That’s it for now.  Once again, I’m starting to get volunteers to join me in sharing their war stories, so tune up your war stories.  The idea of a doing the war stories in a podcast format is starting to take shape.  My vision here is to make a place where DBA’s, developers, Oracle “grey beards” and other battle-scarred veteran (and new) techies get together and tell each other their war stories, banter around a bit, and have some fun.  Please contact me if you are interested in having your war story included.  Of course, “the stories you are about to hear are true, only the names have been changed to protect the innocent [guilty]” [some of you old timers will recall where this quote is from].

 

War Stories, Episode 004 – The Self-Aggrandizing “Oracle Guru”

War Stories, Episode 002 – Speculative Solutions