Jyoti Bansal is the founder of AppDynamics, an application management company that, among other things “makes sure essential software applications of customers such as Netflix stay up and running.” I gave him a call to talk about some of the technological challenges that have plagued HealthCare.Gov during its first week online, what’s causing those problems, and what he would change if put in charge of the project.
What follows is a transcript of our conversation, lightly edited for clarity.
Sarah Kliff: HealthCare.Gov has obviously had some trouble getting people signed up this week. As an outsider, what do you think is causing the problems?
Jyoti Bansal: Based on my experience, the challenges look like glitches in software code. And the software code didn’t go through enough testing. It would take some time to find the bugs.
Then there are bugs in scalability, what happens when 100 people or more are trying to do the exact same thing. Those are the things that really need tuning at this point.
Most of the problems like these are in the software. Hardware is the easy part. You can add more hardware and do it easily. Software takes more time. In the rush of getting this out, it seems like testing wasn’t done completely. My expectations from this is that these problems should go away in the next few weeks. The site still won’t be as fast as something like Netflix, but it should work.
SK: Why wouldn’t it be as fast as a site like Netflix?
JB: Netflix have spent the last 10 years perfecting and redesigning and rearchitecting very good user experience. This is very new. It will take time to get to that level. It could be there in six months, but I wouldn’t expect it to get there in the first step.
SK: Can you talk a little bit more about the glitches you think are going on, both ones in the code and ones that have to do with scalability? What about the front end of the Web site tips you off to that?
JB: So the bugs in the code, as an outsider, you see when the site is functional, you click on something and an error message pops up. Those are bugs in code. It’s not a performance or scalability issue. That part hasn’t been tested, so when you click, and maybe you’re trying to figure out if you’re eligible for Medicaid or what the subsidies are, and you get an error message, that’s a bug in the code.
Scalability is when you login, and there are too many users. What they’ve done which is smart thing to do, for now, at least, is meter access. In the front they’ll only allow a certain number of users. Kind of like you form a line.
SK: I know the White House says they’re working hard to add additional servers and increase the site’s capacity. How hard is that to do?
JB: On day one, if you’re having high volume, you could add more servers. Hardware is the easy part. Let’s say you add servers, and hardware isn’t a problem but you still can’t keep up on scalability, then that’s indicative of something wrong in the software. It’s like you have four lanes in the highway converging into three lanes of a bottleneck. If your software isn’t designed to reach all the lanes, that will happen.
SK: The Obama administration has said that all these problems are happening because of overwhelming traffic. How good of an explanation is that?
JB: That seems like not a very good excuse to me. In sites like these there’s a very standard approach to capacity planning. You start with some basic math. Like, in this case, you look at all the federal states and how many uninsured people they have. Out of those you think, maybe 10 percent would log in in the first day. But you model for the worst case, and that’s how you come up with your peak of how many people could try to do the same thing at the same time.
Before you launch you run a lot of load testing with twice the load of the peak , so you can go through and remove glitches. I’m a very very big supporter of the health-care act, but I don’t buy the argument that the load was too unexpected.
SK: What would you be doing right now if you were running healthcare.gov?
JB: First I would put some really good instrumentation in place. The problem is if you’re fighting a fire, and it's dark, you don’t know what’s going on. In other words, you can’t manage what you can’t measure. So first I would put something in place so you can measure what’s happening.
The second thing I’d do is I’d start building a very good load testing environment, so everything could be simulated in a load test, and move faster. Really everything is about speed right now, how quickly can you find problems and fix them. Ninety percent of the effort is really finding what to fix. Making the coding changes is only about 10 percent.