Ugggh! We've been fighting a serious network traffic issue the past few weeks. At times the traffic is so heavy it will totally lock up one of our Dell PowerConnect 5324 switches and require a reboot! We've had 3 or 4 different switches lockup recently ... dude, something is not kosher on the GCC network!
We've done our best to sleuth out the problem via "watch the blinky lights" methods, WireShark packet sniffing, and SNMP monitoring via Cacti (unfortunately I "killed" the desktop that I forgot was running our OpManager install ... doh!). So far we've found:
- a laptop which was in a wireless to wired network loopback
- an incorrectly placed patch cable creating a physical loopback in an IDF
- Our Dell 5324 switches don't come with packet storm control enabled ... it's now on and all our switches are limiting broadcast and multicast traffic to 1000 packets/second (default value) per switch port. It "seems" to have helped.
The frustrating part is that we can't seem to find anything overly obvious to point our finger at ... of course our lack in switch expertise isn't helping matters. So last Friday I started contacting local consulting companies that area IT folks recommended. Now I'm waiting on proposals to roll in ... and bracing for sticker shock. As of Tuesday we came out of crisis mode ... things seem to be working as expected, but we've still not identified the problem. Last time we had a major networking issue we traced it down to a single cat5e cable that had been pinched under a table leg ... causing it to spew enormous amounts of non-sense data through the network ... a failing network card can do similar, but typically they are easy to trace down to a single port that is going nuts. This time there is no smoking gun :-(
Other things that have been brought to light through this process (most have been on our to-do list for a long time):
- We need to color code the cables in our IDF's so that it's easy to tell what's an uplink, VLAN, etc.
- We need to finish the project we started back in September to move and replace everything in our MDF. It's a total mess ... pictures of the chaos to follow.
- We need to settle on a quality network monitoring solution ... and we need someone with expertise to setup/configure this solution so we're monitoring the right stuff and know what things to look for that would indicate the initial stages of a potential problem. Everyone says Orion is the cadillac solution ... and of course it's really expensive.
- We need volunteers who are switch/infrastructure gurus. I'll get something put in the GCC enewsletter shortly.
- We need more managed switches ... and move to eliminate all non-managed switches from here out.
- Perhaps we should move from Dell to say, HP switches? ... I already know the consultants are going to push we move to Cisco switches. I don't believe the cost is justifiable though.
- We need to keep an up to date map of switch port to device so we know what thing is plugged in where. How do you keep this up to date?
- We need to keep up to date IP/MAC addresses for all Tech Arts gear (AMX, amps, lights, projectors, etc)
- Implement more VLANs? (start with the tech arts gear?)
- We need experts to come in an optimize our LAN infrastructure. Our video guys push huge multi-gig files all over the network ... how can we keep things speedy for them and the other users on those switches? QOS? Bandwidth throttling? VLANs? Light up more of the fiber in the backbones?
- We have no redundant data pathways. I'd like to see us move to a place where a single switch being down only impacts the users connected to those ports.
- We need packet sniffers in each IDF ... we're thinking old laptops with large drives and an extra PCMCIA NIC or two would fill this need nicely in most IDFs with 1 or 2 switches.
- Did I mention the need for experienced volunteers in these areas? :-)
- What am I forgetting/missing?
So right now I have lots of questions and wants ... it'll be interesting to see what we can get accomplished in the next weeks/months. If you're a guru in any of the above areas and you're interested in helping us (volunteer or pay) shoot me an email at gccjason gmail.com
I am the IT Director for Calvary Melbourne, and when I got here 8 months ago, we had similar things happening frequently. What is running your core? We have a Dell 6024F Full Fiber Switch. We also recently went through the process of adding fiber capacity and now our entire network is in a star configuration. On our Backend we used the Layer 3 capabilities of the 6024 to segment all the traffic in multiple VLAN's...most of which are IDF-centric.
-Do you have a layer 3 device at your core? L3 Switch / Router / etc
-Dell switches are working great here... We have about 150 employees between Church and School, and our Video production team is constantly doing the same thing...moving whole sermons and other multi-Gig files back and forth.
-Usually segmenting the network if it is getting large will help with the problems.
-At my last company we used What's Up Gold to monitor all our servers / networks. It's not that $$$, but it does a great job. It can be configured to look for specific Event IDs in Windows and alarm on them as well.
Posted by: Chris Kehayias | April 05, 2007 at 08:21 AM
Have you considered the possibility that you may have been pwned? The latest animated cursor overflow bug found in ALL versions of Windows from NT-Vista, including server OS's, is easy to use and until yesterday MS had no patches for it. Yesterday I discovered a Windows 2003 server belonging to a friend of mine that had been exploited. He noticed his network was crawling to a halt. I checked it out and found a new Admin user that wasn't there before and a bunch of hacktools in that user's profile. There was a process that was consistently using 80% of his CPU resources. This was a 2003 server that was fully updated with all MS updates, had Symantec corporate fully updated, and was running behind a firewall.
Of course your problem could be alot of things, but don't exclude this possibility.
Posted by: Bryson Medlock | April 05, 2007 at 10:43 AM
Just a few thoughts off the top of my head...if you could post more details of your network, specifically your core it would be helpful.
-VLANs...I just started network segmentation here @ NCC and it has helped quite a bit. I'd definitely recommend putting your tech arts on their own vlan (thats what we did, helped considerably) and look at giving them a dedicated gigabit switch for their own private network to use.
-Layer 3...As Chris mentioned, having Layer 3 devices at your core will really help segment and control your traffic, network-wide.
-Switch wise you might want to consider 3com. We use their switches exclusively...they have a great centralized management utility called Network Supervisor (really makes management a snap).
-I'm going to assume your core still consists of 10/100 switches/devices. Sounds like your in need of a gigabit or total fiber backbone. This would really help alleviate some congestion. You could setup a fiber core linked to all your IDFs, possibly even connect all your servers to fiber as well. You mentioned you'd like redundant data paths; Cisco really shines with redundancy and bandwidth controlling...you might want to start "building the case" for why Cisco should be at your core.
- Security wise...get all managed switches and enforce MAC-address policies on them...thats a good start towards controlling unauthorized port use. Might be time to start looking at NAC technologies.
In closing...I'd go through your building, pull out the un-managed switches replace with managed, create logical vlans for different departments/areas/etc, upgrade your core/backbone to fiber or gigabit devices, enforce MAC filtering on ports and maybe even setup some QoS standards as well.
Check out the 3com Network Supervisor, will work with non-3com devices and its a free download from their site. On our switches it shows the stress load on each switch, what port(s) is causing it...it even shows logical maps of your network in addition it will map IP addresses to their port on the switch. This has been very helpful in diagnosing high network traffic.
If you can provide some more detail about what currently is at your core in terms of switches/routers and how your MDF and IDFs connect that'd really help...I may have a couple leads on some used fiber 3com equipment from my old job (public school IT)...let me know if your interested.
Posted by: Travis Kensil | April 05, 2007 at 06:20 PM
Hey, you're making me want to come back and play with the network! I'm actually on the Networking track of the Computer Technology degree at IUPUI (granted, I just started last semester, but it's a strong interest). I may not have the answers, but I'd jump at the chance to learn about how they're discovered!
We're not doing everything perfectly at Lakeview, but because we just got all-managed core switches (HP ProCurve) and matching wireless, we do have extensive VLAN, port security, and other features that I haven't even scratched the surface of yet, but hope to soon. Just having managed switches gives me so much insight into the whole network or any part of it! I haven't done much with 802.1X yet, although I have it working on a test employee wireless network SSID and it's very cool! I want to expand it to the wired switch ports and see how it goes. I have, however, done some Port Security (MAC-based) on ports in public areas. Our nursery check-in system, for example, is in our lobby. Each port is locked to only the MAC address of the check-in computer. Plug another NIC into the port: shutdown! The port then has to be re-enabled administratively, preventing further attack.
802.1X's power comes in with multiple VLANs; it's possible to set up all wired ports in our lobby for guest internet access (like the public wireless, only wired), but if someone on staff plugs in with their laptop configured for 802.1X authentication, it will authenticate them and then bump their computer onto the employee internal VLAN, until they unplug, at which time BAM!, back as a guest port! I'm really close to this, with a bit more playing. And port-security with MACs can still be used on ports with copiers and printers connected, that don't change all the time.
Are all of your subnets on separate VLANs, and if so what does the routing? The switches? Your SonicWall firewall? How many managed switches do you have vs. unmanaged? If only I'd had time to stick around for a network tour on Tuesday! I'll be following this story closely :-)
Posted by: David Szpunar | April 05, 2007 at 10:44 PM
I'm trying to move to colour-coding patch cords here too:
Blue: Stations on primary VLAN
Yellow: Stations on public VLAN
Red: Unsecured internet VLAN
Orange: Uplink/Trunks (Carry multiple VLANs)
White: Voice
What I want to add is :
Green: Management (DRAC cards, UPS, etc)
Purple: Lighting control VLAN
Gray: Non-ethernet data (PRI, RS-232, etc)
Posted by: Ian Beyer | April 05, 2007 at 11:01 PM
Travis, we've had nothing but problems with the 3Com switches - the failure rate we've experienced at COR has been unacceptable (ask Clif about it some time), and we've been exceedingly happy with our HP 4100GL series core switches, and HP 2600 series edge switches. We've also been playing with an HP 1800 series gigabit switch in the IT department that is very attractive from a cost standpoint (I think we paid under $400 for 24 ports, managed - they also have an 8-port version for about $150 which I'd love to have at my desk and be able to tap multiple VLANs - It's been rock solid. )
Posted by: Ian Beyer | April 05, 2007 at 11:06 PM
Ian,
Thats very interesting...we have had a couple problems with our 3coms a few years ago after some buggy software updates but have never had any "real" issues. What kind of failures did you get (reboots, total failure, etc.)...would like to know so I can start watching for the signs, although we are slated to upgrade these next year and move to Cisco products.
Posted by: Travis Kensil | April 06, 2007 at 04:43 PM
Well, I started out commenting here, but it got so long I made it its own post :-) Consider it included by reference :-) http://infotech.lakeviewchurch.org/2007/04/07/procurve-switches-and-our-network/
Posted by: David Szpunar | April 07, 2007 at 11:22 AM