Q&A: building the Stampede supercomputer
We talk to the man leading production of the 10-petaflop Stampede at TACC
This week, the 10-petaflop Stampede supercomputer went online after two years in development.
One of the prime uses of supercomputers is to provide resources for the scientific community. In 2011, the US’s National Science Federation put out its solicitation for a new supercomputer and TACC, the Texas Advanced Computer Center, won the contract.
The result is Stampede, a Dell PowerEdge C8220 cluster packed with Intel Xeon Phi coprocessors. It has more than 6,000 compute nodes, more than 96,000 processing cores, and 205TB of memory, for close to ten petaflops of peak performance.
Before the launch, we spoke to its director, Jay Boisseau, to find out what it takes to create the world’s seventh fastest computer, and how TACC worked with Dell and Intel to create a computer that would stand out from the rest.
Q. What’s the timeline for building a supercomputer like Stampede?
A. I think the solicitation came out in Dec 2010, but our proposals were due in March 2011, or the beginning of spring. That meant that the reviews and awards were made by the end of that summer, which gives you some time to start building your system and everything.
We started building the data centre right away, as soon as we found out we got the award. We spent the end of 2011 and the first part of 2012 building out our data centre, and then started building the cluster.
Q. What are the key moments in the process?
A. When you take delivery of all your nodes and you hook up with the InfiniBand network, you can then start running some apps across the nodes. That was a key moment and that was this past autumn.
We’re in a really crucial phase right now, which we call the early user phase – we have probably 15 research groups that are on the systems right now all doing science and reporting back results. We just got a notification the other day from one user saying the performance was spectacular, so now’s a time for debugging the software stack and identifying any nodes that are sort of straggler nodes, maybe performing slightly slower than expected.
Q. What’s next?
A. [Next we’ll run] two weeks of reliability testing. So we’re really going to hammer the system with lots of jobs and verify that the hardware and software are reliable.
On 7 January, assuming the reliability testing goes well, we’ll put it into production and that means we’ll start accounting for usage. We’ll start decrementing users’ accounts of allocation and for the first two weeks well be measuring the performance as well to make sure the reliability is holding up under the load of a production user community.
We expect to have an acceptance review at the end of January, and we expect NSF [the National Science Foundation] to convene a peer review team that will look at the deployment, the reliability statistics, the initial usage of the first few weeks and make a decision on whether to approve full funding. So an important note is that the system is not formally reviewed yet, but this is standard operating procedure; you get an award to build the system with the expectation of funding being approved.
Q. So how do you pay for everything?
A. You don’t pay for everything until it’s proven to work. It’s not like buying a PC where you pay up front and you get the PC and if it doesn’t work you get your money back. With these very large scale procurements you get an award which gives you the promise of payment pending successful review, but the machine must pass review acceptance tests, and those acceptance tests must be reviewed by a peer review committee convened by the NSF, before the formal authorisation of payment goes through.