Wednesday, November 5, 2008

A gender dropdown issue!

It was festive holiday for my India software engineering team on account of Deepawali, my phone rang at 7:00 in the morning a day before Deepawali on 27th Oct, 08; this call was from one of the senior manager in the operation team to inform me about showstopper issue in application. He said “Hi, the gender dropdown in patient demographics screen does not let users to select any option but this is mandatory for registration batch completion. I have also sent a separate email to you about this issue”

At the outset, I thought its usability issue about the gender dropdown field in patient demographics screen as Jim, a product manager reported few days back and found it working on next day so gone back asleep. I woke up suddenly comprehending his words “……. mandatory for registration batch completion” and rushed to my blackberry to learn the details of issue. Even on email, issue seemed to be normal functional issue but I was anxious on this occurrence because no functional showstopper was expected as none of the code change promoted to live environment since last few days they were put on hold due to imminent Deepawali holidays.

In the beginning, I assumed the observations might be intermittent performance issue either because of network bandwidth or service provider slackness. I checked with my IT manager and learnt about smooth operation of network connectivity. Meanwhile issue email was endorsed by same manager with another issue within application. I was sure of some big mess. On further verification these issues became very obvious & absolute without failure to be replicated. I sensed something got messed up now and have called my technical analyst to verify these issues to fix as I wanted to get rid of the issue ASAP because I could not sleep entire night due to last night playing-card party engagements. We, Hindus play cards during DEEPWALI festivals. I don’t know the rituals but I enjoy it.

Few hours have gone by without any resolution and clue of root cause. We did everything we could do in technical review to find distinct between UAT and production code base. I also called up tech architect, release & configuration manager to drill down issue to address but until next 4 hours no clue. Tech architect was unable to connect into production environment so rushed to office for connectivity convenience.

Everyone, working on this issue was brought in the yahoo conference but surprisingly there was some bug noticed in yahoo IM as on every new invitation were initiating new conference bridge.

I dabbled from code to environment and configuration to manual be-bugging possibilities in the system. On detailed analysis of our production issue, I conceived the possibility of Microsoft recommended patch installed in production environment in the last weekend might have caused the erratic behavior of system. Assuming after MS patch updates in the web server, IIS was not rest, so we did that as well but did not help us. There was no choice other than contacting datacenter and get all the ms patches uninstalled so did contact head of datacenter in Los Angles, CA and explained the issues to him. Initially he was little reluctant over identified potential root cause but later he had no choice instead to trust my words.

Since, I did not sleep for last couple of nights and was very sleepy. I was lying on my bed beside my laptop for catnap to get some energy. Looking at my gesture, my better half says (in sarcasm), "Hey, It’s been age (days) you had cuddled me as you holding your laptop today. Is everything alright?” I said (with heavy sigh!) “I’m walking on tightrope! Let’s talk later”

Process of ms patch uninstalls and restarts of servers started but issue still persist in the system. Meanwhile, Robert Wong (aka Rob), the datacenter head learnt ms patch inducted issues in other client’s system as well, so he was convinced with the approach to uninstall patches will help to kill the issues. Patch had been removed from all the affected servers however issue kept peeping because there is always chance of incomplete removal of updates from each corner of the OS. I was running short of time because in next couple of hours my US clients would start their business and if system will have issues then going to be tough time for engineering. I have asked Rob to restore OS image taken backup before patch updates in the last weekend. He asked for 1 hour permission (30 min to reach datacenter another 30 min to process) because he needed physically rush to the datacenter for this call however I knew, It was an exigent situation and an arduous winter so everything depended on Rob reaching datacenter and restoring the OS image within next hour, so I can have enough bandwidth for sniff test in restored environment before start US business hours.

I had put green to the signal despite his disclaimers that he will not be available on IM & over phone but reachable on text mode of communication either through emails or SMSs. It was already 1:15 pm and in next 5 hours my Eastern Coast clients were expected to login into the system.

First email from Rob at 1:19 PM on subject line “MDSweb patch causing problems” stating “Enroute to data center for roll back”

Second at 1:47 PM read “I've arrived at data center”

Third one at 1:51 PM stated “The copy process is only about 30percent done”

I had been cogitating till my brain started agitating because estimated time had already exceeded and there was no information from Rob about the situation.


Forth at 3:24 PM read “Restore is finished but having some boot problems will have to run it again”

Fifth at 3:53 PM read "It's not looking good. I have blue screen I'm doing everything I can do bring system back up"

I was panic stricken but regardless overwhelming odds I was holding my horse tight because this message could panic my team those waiting for final status after prolonged struggle in the holiday since morning.


Fifth at 4:16 PM read “It's not the backup the problem its disk partition during restore”

Sixth at 4:54 PM read “Ok! Problem was corruption in “C” drive only backup Confirmed with Zabi the full c and data drive backup image is before patches and restoring from that now About 15 min to go. Also we had a while back prepared a fully working VM of MDSweb for disaster If restore does not work we can fire up that VM and restore most recent data drive and that will be fallback plan Let's hope we don't need to go there Second note. MDSweb is running on older server that no longer has warranty I suggest as diligent exercise foe next week we turn on the backup and let your team qc it”

Seventh at 5:09 PM read “Restore completed and system is booting U cam login and check in few minutes”

It was like conquering the world and asked team to verify the system in production. It was all ok

Eighth at 5:23 PM read “I'm headed home now. Please send email once you verified system”

I have started my day in US hours as I started all my morning routine in the India evening :)

No comments: