Archive for category Tech

PowerEdge 6650 NIC Issues

Problem

I have a new problem with my Dell PowerEdge 6650 servers. They crash! Well sort of. Under heavy NFS usage (usually compiling a large program over NFS mount) the machine seems to hang. I use the word seems because all SSH connections to the machine timeout and it does not respond to network traffic. However, a look at the console proves the machine is alive and well.

Platform

  • 2x Dell PowerEdge 6650 Servers (One NFS client, One NFS server)
  • Embedded Broadcom BCM5700 Gigabit NICs (Connected via Cisco Gigabit switch)
  • CentOS 5.2 custom kernel 2.6.25

Evidence

Once on the console the obvious act is to check the system logs. If you wait long enough, the network interface begins working again and the following messages are in the system log. Of course eth1 may be replaced by your interface.

kernel: NETDEV WATCHDOG: eth1: transmit timed out
kernel: tg3: eth1: transmit timed out, resetting
kernel: tg3: DEBUG: MAC_TX_STATUS[00000008]  MAC_RX_STATUS[00000008]
kernel: tg3: DEBUG: RDMAC_STATUS[00000000]  WDMAC_STATUS[00000000]
kernel: tg3: tg3_stop_block timed out,  ofs=1800 enable_bit=2
kernel: tg3: tg3_stop_block timed out,  ofs=4800 enable_bit=2
kernel: tg3: eth1: Link is down.

This says that the ethernet watchdog figured out that the network interface hung or crashed or deadlocked or got stuck. The remedy? The watchdog restarts the interface.

So everytime you get the network crash you have two options….

  1. Wait ~15 minutes for it to fix itself (miraculous, I know!)
  2. Restart the networking service

Resolution 1

As I mentioned this was fairly repeatable for me. All I had to do was attempt to compile an application in a remote directory mounted with NFS. The NFS client was always the party to crash. The cheap (well lazy) fix that I took was to add a spare PCI gigabit NIC to the client machine. This resolved the problem on the client side.

Problem Re-emergence

After a couple weeks of operating with the client on PCI NIC and server on embedded NIC, the server’s NIC locked up just like the client’s had previously. This time I got a little fed up because I didn’t have a spare gigabit NIC to put in the server.

Resolution 2 – I Hope

This resolution is tentative. I have implemented it and have not had a crash but I do not trust that it is permanently fixed until more time has passed. I’ll post an update if I have any new issues.

At the advice of the local sysadmin I went to Dell’s website and poked around until I found an ISO cd image that contained all of the possible firmware updates for the PowerEdge 6650 on one CD. He recommended I give that a try and upgrade every single piece of firmware possible.

The result of the scan from the CD was that I was up to date on everything but “BMC” the Board Management Controller. My version was 1.64 and the latest was 1.78. So I let the CD do the firmware upgrade for me.

Since the upgrade (28 days ago) and a reboot I have not had another NIC crash. I don’t consider this conclusive yet because it is very possible that the situation has to be just right.

In summary the correct solution appears to be to update all of the server firmware (duh?). The easiest way to do that is to get the update CD for your OS from Dell. The CD is called something like “Dell CD ISO – PowerEdge Updates”. Let this also be a warning. Until I knew that update CD existed, I thought that I had upgraded all of the firmware possible in the server via individual floppies. Don’t make the same mistake, try Dell’s update all CD.

Failure .. Again!

Today I have crashed the NIC in the NFS server again… I’m looking for a new fix!

See Comments for updates from me.

Backlight replacement for IBM Thinkpad T42p

A couple weeks ago I screwed up big time. I wasn’t being too careful and I dropped my laptop when moving it off of a bar height table. The net effect: No image when I turn it on. This was a nearly instant diagnosis on my part of a broken backlight. Just to make sure nothing else broke, I put it on my docking station and everything worked just fine.

So I ordered a new backlight off of ebay. $10 for the light and $8 for shipping from Hong Kong. The annoying thing is waiting 2 weeks for it to arrive.

The light finally arrived and I reserved today for making the repair. The job took about 6 hours total. The first part, disassembling and removing the LCD as a whole was easy thanks to IBM’s good service manual. But the service manual doesn’t describe how to do any work on the LCD. The picture below is all the service manual guides you to.

The LCD removed as per Service Manual

At that point I was on my own and things got slow.

It took several hours of meticulous care to find + remove screws and figure out where to pull to safely remove parts. Eventually, I got it completely apart. My diagnosis was confirmed when I got down to the light and small pieces of glass sprinkled out. Knowing that these bulbs contained mercury I backed off and returned to the area a few minutes later. (All the mercury was probably gone before I even left but I decided to be cautious.)

Here is a shot of the screen in a few different pieces.

Unfortunately I didn’t take any pictures of the actual LCD part once it had been disassembled.

In the picture below you can clearly see the 3 wireless antenna’s embedded in the LCD housing. The silver metal at the bottom is the bluetooth antenna. The 2 copper pieces are the Wifi antennas (Main, Aux), as this laptop and its Wifi card support Antenna diversity.

Here is another picture just for fun of the guts of the laptop. Interesting enough, I can’t figure out what chip is there in the center with no heatsink attached. My thought is GPU but that just doesn’t seem like a wise idea. It strangely resembles a CPU.

And finally, we see success!

The first boot after all of the repair had me nervous because it remained on POST for a long time. When I hit escape I realized why… It was doing an extended RAM check of 2GB because I had removed the BIOS battery causing it to forget all of its settings. I removed the battery because I long ago learned when your working on laptops in that many pieces remove every source of power, wall power, standard battery, second battery and BIOS battery!

My repair job turned out not to be perfect. :( I failed to keep all of the dust out of the LCD when I had it split into many layers. This resulted in some dust specs that I cannot remove from the screen unless I take it apart again. The image seems to have a greenish/blue tint too. This could be because of the color temperature of the new bulb, the power output of the inverter or something is just slightly out of position from my install. I was able to remove some of the tint by altering the video card’s color settings. I can certainly live with it because it really is a minor tint. I guess this just means I’ll have to plan for a new laptop in 1-2 years. I can’t see throwing this things away before then because it still smokes a lot of laptops on the market.

UPDATE: Read the comments for new information on this subject!

Toshiba Satellite HDD Password

Like 2+ months ago my girlfriend’s friend asked me if I knew how to fix her laptop. She had been using and then after a reboot it said “Enter HDD Password”. This notebook was a Toshiba Satellite.

I had never worked on this brand before so I couldn’t troubleshoot on the phone and had her leave the laptop with me. 2+ months later I actually take a look at it.

Symptoms:

Spends approx. 2 mins on the Toshiba POST screen… SLOW! Has faint HDD spinup/spindown sound. Finally advances to say “Enter HDD Password”.

The owner claimed that she did not intentionally enter an HDD password but did acknowledge that she might have without realizing what she had done.

HDD Passwords can’t be cracked:

Or so is the rumor. Anyways, I did a few searches on the net and sure enough the only way to crack the HDD password on that model is to send it to a Toshiba authorized repair center.

Whats that sound:

At this point I decide to turn my stereo off. Now I can hear a faint sound of HDD death! So I pulled the drive out and put one of my laptop’s drives in. Turn the Toshiba on and it nearly instantly passes the Toshiba POST screen and then gives me the missing NTLDR message. Good that’s what its supposed to do. Then I decided to put he deadish Toshiba drive in my laptop’s second drive bay. When I put the drive in it has the sound of HDD death even louder. Attempt after attempt to successfully spin up but eventually it gives up.

In Summary:

If your Toshiba laptop suddenly starts asking for a HDD password at boot but you don’t remember setting one… Your hard disk drive could be dead. This is a little more comforting than thinking your an idiot, right?