Engineering Management

Wednesday, December 10, 2008

Operation Road blocker!

It was international hump day on 12th Nov, 08 when operation team raised issue about application outage due to some internet connectivity being mercurial. On ping in various websites, it was found that network loosing loads of packets especially in MDS application connection however it was working fine for few other websites. Loss of packets in other websites like yahoo, google, and call-mustang were very intermittent. tracert was running until team realized sites like yahoo, google getting stable but call-mustang tracert behaves same as MDS site, although MDS site was completely down for India users but application was up and running fine for US. It was extreme tricky situation for us to analyze the root cause of such outage because application servers had no issue and application was smoothly accessible by our US team. India local service provider was found ok because other internet sites were working fine and our application was showing similar behavior using other internet service as well (verified through data card connection).

I had been coordinating with all the stakeholders be it data center, service provider, ops user, engineering team, and internal business team. On further analysis it was unanimously agreed that the 3rd level connection has some problem routing requests back n forth. Data center team started further coordination with level3 engineers through their service providers (SAVVIS) and our IT team with our local service provider (Bharti).

Within 3 hour of issue report at 10:49 am, I could sent an email reading application is running smoothly and stable with 0% packet loss after finding issue resolved for the India team. I was sure for neither BHARTI nor SAVVIS caused the issue directly because other internet sites from our office was fine during challenge in MDS apps and MDS application was fine from US (Confirmed by my US team member). I concluded issue with bridge (level3) between India service provider (Bharti) and ATOMIC service provider (SAVVIS)”

On smooth application operation I checked and compared route and found it different than previous one. Now level3 issue was resolved.

Thursday, November 6, 2008

Rude Awakening in Product Management

2.9 version of application was running in production and 3.0 version was on final stage of engineering development but, I was not comfortable releasing 3.0 over the 2.9 version to production because 3.0 version was complete facelift of the application comprising new colors, total new navigation system and new modules etc. I had concern on its acceptance by user despite its strong aesthetics and new features because of user habit developed in a decade.

All the business stakeholders got convince by my thought for not making big bang release of 3.0 and taken decision to have 3.0 version GO LIVE as beta release in a separate web server with new URL having it directed same DB as of 2.9. With new strategy, code and SPs had to be compatible for access the same database from both web old & new applications.

This move was successful and 3.0 version continued to be enhanced based on feedback in beta however 2.9 version supports never stopped.

Time gone by and we had released 3.2 version in production with more features and modules. Since it has been almost 3 months having 2.9 (old version) up and running in production, so we have taken call to turn it off and we did it with enough notice to application users and our clients.

Since 3.0 release comprised many new business rules implemented on it and new features in application were bound to bring many enhancement requests from end users. Requests for enhancement grown to decide upon another release over 3.1 to address all the requests

Turnoff old application running on main production URL and route to new application server

As planned notification to users, we had to turnoff old application, so taken action to route main production URL to beta site (new application) however it was done in few minutes and verified by all the engineering and operation users in India but on start of US business hours, it was noticed that clients could not login to application through production URL after DNS routed to new application server. On further analysis we found re-direction of main production URL to main server from previous server had issues with URL to IP mapping. This was due to delay in propagation of new IP for main production URL to all DNS servers globally. As a result, IE was unable to resolve IP on global user login. We had no choice to re-direct URL to old IPs as an contingency then found user were fine with new application in main production URL until changes are completely propagated globally for DNS update

Lesson Learnt - Propagation of new IP for URL to all DNS servers globally takes around 24-72 hours.

Wednesday, November 5, 2008

A gender dropdown issue!

It was festive holiday for my India software engineering team on account of Deepawali, my phone rang at 7:00 in the morning a day before Deepawali on 27th Oct, 08; this call was from one of the senior manager in the operation team to inform me about showstopper issue in application. He said “Hi, the gender dropdown in patient demographics screen does not let users to select any option but this is mandatory for registration batch completion. I have also sent a separate email to you about this issue”

At the outset, I thought its usability issue about the gender dropdown field in patient demographics screen as Jim, a product manager reported few days back and found it working on next day so gone back asleep. I woke up suddenly comprehending his words “……. mandatory for registration batch completion” and rushed to my blackberry to learn the details of issue. Even on email, issue seemed to be normal functional issue but I was anxious on this occurrence because no functional showstopper was expected as none of the code change promoted to live environment since last few days they were put on hold due to imminent Deepawali holidays.

In the beginning, I assumed the observations might be intermittent performance issue either because of network bandwidth or service provider slackness. I checked with my IT manager and learnt about smooth operation of network connectivity. Meanwhile issue email was endorsed by same manager with another issue within application. I was sure of some big mess. On further verification these issues became very obvious & absolute without failure to be replicated. I sensed something got messed up now and have called my technical analyst to verify these issues to fix as I wanted to get rid of the issue ASAP because I could not sleep entire night due to last night playing-card party engagements. We, Hindus play cards during DEEPWALI festivals. I don’t know the rituals but I enjoy it.

Few hours have gone by without any resolution and clue of root cause. We did everything we could do in technical review to find distinct between UAT and production code base. I also called up tech architect, release & configuration manager to drill down issue to address but until next 4 hours no clue. Tech architect was unable to connect into production environment so rushed to office for connectivity convenience.

Everyone, working on this issue was brought in the yahoo conference but surprisingly there was some bug noticed in yahoo IM as on every new invitation were initiating new conference bridge.

I dabbled from code to environment and configuration to manual be-bugging possibilities in the system. On detailed analysis of our production issue, I conceived the possibility of Microsoft recommended patch installed in production environment in the last weekend might have caused the erratic behavior of system. Assuming after MS patch updates in the web server, IIS was not rest, so we did that as well but did not help us. There was no choice other than contacting datacenter and get all the ms patches uninstalled so did contact head of datacenter in Los Angles, CA and explained the issues to him. Initially he was little reluctant over identified potential root cause but later he had no choice instead to trust my words.

Since, I did not sleep for last couple of nights and was very sleepy. I was lying on my bed beside my laptop for catnap to get some energy. Looking at my gesture, my better half says (in sarcasm), "Hey, It’s been age (days) you had cuddled me as you holding your laptop today. Is everything alright?” I said (with heavy sigh!) “I’m walking on tightrope! Let’s talk later”

Process of ms patch uninstalls and restarts of servers started but issue still persist in the system. Meanwhile, Robert Wong (aka Rob), the datacenter head learnt ms patch inducted issues in other client’s system as well, so he was convinced with the approach to uninstall patches will help to kill the issues. Patch had been removed from all the affected servers however issue kept peeping because there is always chance of incomplete removal of updates from each corner of the OS. I was running short of time because in next couple of hours my US clients would start their business and if system will have issues then going to be tough time for engineering. I have asked Rob to restore OS image taken backup before patch updates in the last weekend. He asked for 1 hour permission (30 min to reach datacenter another 30 min to process) because he needed physically rush to the datacenter for this call however I knew, It was an exigent situation and an arduous winter so everything depended on Rob reaching datacenter and restoring the OS image within next hour, so I can have enough bandwidth for sniff test in restored environment before start US business hours.

I had put green to the signal despite his disclaimers that he will not be available on IM & over phone but reachable on text mode of communication either through emails or SMSs. It was already 1:15 pm and in next 5 hours my Eastern Coast clients were expected to login into the system.

First email from Rob at 1:19 PM on subject line “MDSweb patch causing problems” stating “Enroute to data center for roll back”

Second at 1:47 PM read “I've arrived at data center”

Third one at 1:51 PM stated “The copy process is only about 30percent done”

I had been cogitating till my brain started agitating because estimated time had already exceeded and there was no information from Rob about the situation.

Forth at 3:24 PM read “Restore is finished but having some boot problems will have to run it again”

Fifth at 3:53 PM read "It's not looking good. I have blue screen I'm doing everything I can do bring system back up"

I was panic stricken but regardless overwhelming odds I was holding my horse tight because this message could panic my team those waiting for final status after prolonged struggle in the holiday since morning.

Fifth at 4:16 PM read “It's not the backup the problem its disk partition during restore”

Sixth at 4:54 PM read “Ok! Problem was corruption in “C” drive only backup Confirmed with Zabi the full c and data drive backup image is before patches and restoring from that now About 15 min to go. Also we had a while back prepared a fully working VM of MDSweb for disaster If restore does not work we can fire up that VM and restore most recent data drive and that will be fallback plan Let's hope we don't need to go there Second note. MDSweb is running on older server that no longer has warranty I suggest as diligent exercise foe next week we turn on the backup and let your team qc it”

Seventh at 5:09 PM read “Restore completed and system is booting U cam login and check in few minutes”

It was like conquering the world and asked team to verify the system in production. It was all ok

Eighth at 5:23 PM read “I'm headed home now. Please send email once you verified system”

I have started my day in US hours as I started all my morning routine in the India evening :)