08-12-2014 02:08 PM
We have been using the Infoblox captive portal (ie. authenticated DHCP) for a year now. The system has been flawless this entire time. This past week we have been running into performance issues where the front page of the portal takes anywhere from 10 to 90 seconds to render. For example:
jemurray@janus:~$ time curl -s https://portal.ip.wustl.edu/ | grep real
We don't have any problems with the actual registration page where you enter the username and password:
jemurray@janus:~$ time curl -s https://portal.ip.wustl.edu:4433/register-user.html | grep real
It is only an issue with the web server that is listening on ports 80 and 443.
The servers are a HA pair of 820's running portal services only.
We have made the following big changes (although we didn't notice any problems until this week):
July 30, 2014 enable HA (was a standalone node before)
June 12, 2014 upgrade from 6.8.1-210379 to 6.10.6-240571
Based on support recommendations we have removed the connections limits on the portal:
set connection_limit https 0
set connection_limit http 0
(Although I don't think this was it since I never saw any of my test IPs blocked in the syslog logs)
Another recommendation was to reboot the entire portal cluster. This didn't help.
Support requested information from the following commands:
show cpu 1 10
show process all
show connections numeric
Based on the output of these commands the server is not CPU or memory/swap limited. There are a few hundred devices active in the portal that are not registered (probably devices that grab the first available open wireless network) and probalby won't register for a while. This should be expected from a captive portal device.
We did run a packet capture on the portal for about 10 minutes and analyzed it though wireshark. There were about 200 unique clients, none of them stood out as heavy hitters that are obviously eating up all resources.
Without having any real access to the backend system to troubleshoot this issue, my theory is that the httpd web server configuration has changed from version 6.8 -> 6.10 that is causing some type of performance issue. Since this is a portal system where many devices may live in limbo for a while, I am guessing it is the addition of weatherbug, facebook, gmail, etc automated app drons hitting the portal trying to get their updates.
We are trying to figure out if any other portal users have noticed performance problems on their captive portal devices?
Support is actively working on this issue. However, with student move in fast approaching we are in need of a quick resolution to this issue. We are so limited in what we can troubleshoot/debug because of the appliance nature of the systems, I don't like having to rely solely on support for help if anyone has any ideas to try.
-- Jason E. Murray Sr. Systems Engineer Washington University in St. Louis Phone: 314-935-4865 Email: firstname.lastname@example.org Web: http://nts.wustl.edu/~jemurray/