Q&A: building the Stampede supercomputer
By Tim Danton
Posted on 8 Jan 2013 at 10:30
This week, the 10-petaflop Stampede supercomputer went online after two years in development.
One of the prime uses of supercomputers is to provide resources for the scientific community. In 2011, the US’s National Science Federation put out its solicitation for a new supercomputer and TACC, the Texas Advanced Computer Center, won the contract.
The result is Stampede, a Dell PowerEdge C8220 cluster packed with Intel Xeon Phi coprocessors. It has more than 6,000 compute nodes, more than 96,000 processing cores, and 205TB of memory, for close to ten petaflops of peak performance.
Before the launch, we spoke to its director, Jay Boisseau, to find out what it takes to create the world’s seventh fastest computer, and how TACC worked with Dell and Intel to create a computer that would stand out from the rest.
Q. What’s the timeline for building a supercomputer like Stampede?
A. I think the solicitation came out in Dec 2010, but our proposals were due in March 2011, or the beginning of spring. That meant that the reviews and awards were made by the end of that summer, which gives you some time to start building your system and everything.
We started building the data centre right away, as soon as we found out we got the award. We spent the end of 2011 and the first part of 2012 building out our data centre, and then started building the cluster.
Q. What are the key moments in the process?
A. When you take delivery of all your nodes and you hook up with the InfiniBand network, you can then start running some apps across the nodes. That was a key moment and that was this past autumn.
We’re in a really crucial phase right now, which we call the early user phase – we have probably 15 research groups that are on the systems right now all doing science and reporting back results. We just got a notification the other day from one user saying the performance was spectacular, so now’s a time for debugging the software stack and identifying any nodes that are sort of straggler nodes, maybe performing slightly slower than expected.
Q. What’s next?
A. [Next we’ll run] two weeks of reliability testing. So we’re really going to hammer the system with lots of jobs and verify that the hardware and software are reliable.
On 7 January, assuming the reliability testing goes well, we’ll put it into production and that means we’ll start accounting for usage. We’ll start decrementing users’ accounts of allocation and for the first two weeks well be measuring the performance as well to make sure the reliability is holding up under the load of a production user community.
We expect to have an acceptance review at the end of January, and we expect NSF [the National Science Foundation] to convene a peer review team that will look at the deployment, the reliability statistics, the initial usage of the first few weeks and make a decision on whether to approve full funding. So an important note is that the system is not formally reviewed yet, but this is standard operating procedure; you get an award to build the system with the expectation of funding being approved.
Q. So how do you pay for everything?
A. You don’t pay for everything until it’s proven to work. It’s not like buying a PC where you pay up front and you get the PC and if it doesn’t work you get your money back. With these very large scale procurements you get an award which gives you the promise of payment pending successful review, but the machine must pass review acceptance tests, and those acceptance tests must be reviewed by a peer review committee convened by the NSF, before the formal authorisation of payment goes through.
- Hands on with the new Google Maps
- Nokia Lumia 925 review: first look
- Why I won't subscribe to Creative Cloud
- GoPro camera strapped to a remote-control helicopter: the ultimate boy's toy
- Acer Iconia A1 review: first look
- Acer Aspire P3 review: first look
- Acer Aspire R7 review: first look
- How we produce the PC Pro podcast
- Google Now draining iPhone battery
- The government website that doesn't work with IE, Chrome, Firefox, Safari, Macs or smartphones
- How to fix Facebook: Social Fixer
- Taking the stress out of WordPress updates
- Where to download free web fonts
- Turn your tablet into a Sky+ remote control
- How to measure the success of a new IT system
- Three years on: the state of the tablet market
- Windows 8: what works and what doesn't
- Yes, I write down my passwords
- How to make money from apps
- Hack your own radio transmitter