How cloud is changing complex computational demand (Image Credit: Resource Database on Unsplash)The vast majority of cloud computing has been focused on three key goals. The first, which was the initial driver for many and continues to be so, is focused on lowering costs. The second is about expanding the capability of that IT stack by easily scaling to add more capacity. Finally, the last goal is about adding new technology to solve new problems and enhance capability.

Once you move away from general-purpose IT, that latter point opens up a vast range of opportunities for organisations. It could give them access to capabilities that were previously too expensive or too complex to deploy in their existing infrastructure. For example, high-performance computing.

For companies involved in engineering, this has significant benefits. Not only are the cost of some engineering capabilities lower, but the ability of the cloud to scale makes it easier and faster to carry those tasks out.

Dr Neil Ashton, Principal Computational Engineering SA at Amazon Web Services (Image Credit: Amazon)
Dr Neil Ashton, Principal Computational Engineering SA at Amazon Web Services

To get an understanding of what this means, Enterprise Times talked with Dr Neil Ashton, Principal Computational Engineering SA at Amazon Web Services. Ashton is an engineer by trade, and his specialism is Computational Fluid Dynamics (CFD). It has a wide range of applications, from designing cars and aircraft to modelling the airflow in a data centre.

Ashton describes his job as, “essentially, I help customers like Formula One to help them to do those workloads like computational fluid dynamics, and how they get more performance out of it.

How has cloud has changed CFD?

CFD can be a highly compute-intensive task making it the province of a small number of organisations. Ashton talked about the benefits of cloud-based CFD for Formula One companies. He said, “Formula One is a really good example of an industry that wants to extract maximum performance and needs agility. One of the great things about working with F1 is that by moving to AWS, they were able to get the compute resources they needed when they needed it. One of the statistics that we talk about is 60 hours to 12 hours.”

That is a significant reduction in time and cost for an industry that is seeking to reduce as much cost as possible to close the gap between the rich companies and the poorer companies. But apart from reducing the time by 80%, what else does it mean?

According to Ashton, “Imagine what that means to an engineer. 60 hours, is like you click Go on a simulation, and it takes three days. For those three days, you’re not sure whether that design was better or worse. When you go down to 12 hours, essentially the same day, it makes a really big difference in the efficiency of the engineer. I think it’s an impressive number.”

One of the things making that possible is the AWS Graviton. It is custom-built silicon created specifically for a wide-range of compute-intensive workloads. The latest version, Graviton-3, was announced at AWS re:Invent 2022.

How is Graviton impacting CFD and HPC?

Depending on the complexity of the CFD task, engineers will use anything from a workstation to model a small data centre to a High-Performance Computing (HPC) array for something like F1. Enterprise Times asked Ashton how Cloud and Graviton is changing CFD. Is it scaling based on CPU? Scaling based on memory? Scaling based on storage? Because all three have very different requirements, and very different IO challenges?

Ashton pointed to the announcements in the keynotes at re:Invent. “We now have a HPC optimised family of Amazon EC2 instances, for exactly that reason that different HPC workloads or even in computational fluid dynamics have different characteristics. We have more of a compute optimised where it’s more trying to scale out the workload to get more compute throughput, essentially higher and higher cores with more throughput out of it. But you don’t need lots of memory, or local disk.

“But there are some bits of the process, or workloads like finite element analysis, more structural modelling, that actually requires a lot more memory and local disk. That’s why we launched the AWS EC2 Hpc6ID, which is targeted for that sort of workload. People who don’t need a local disk and the memory, can take the Hpc6A.

“If they need more of the memory of the local disk, they can go to the Hpc6ID. We also pre-announced a Graviton-3 version for those who need more networking. It gives the choice because HPC is very much a workload-dependent thing, and so you need to have a range of options to kind of see what they need.”

Shifting to the GPU

A few years back IBM release OpenCAPI. It was part of their move to make better use of the GPU and bypass the CPU bottleneck. One thing it did was to allow the CPU to offload some work to the GPU and allow that GPU to talk directly to memory for compute-intensive workloads. CFD is a good example of a task that can benefit from that. Where is AWS with allowing its custom silicon to talk direct to memory?

Ashton replied, “one of the interesting trends we’ve seen is there is a growth of HPC being on GPUs. We often talk about the convergence of HPC in machine learning. It’s why the P4D and the P4DE to the NVIDIA A100-based platforms are also a great choice. The choice is the point because what we see is that there are some codes that have been ported to be able to run on GPUs. They can get this extra memory bandwidth and the performance and the vectorization that they want.

“But not all codes are ported to GPUs. Some want the more traditional CPU. So we give them that choice as well. I don’t think there’s a single answer to everything. A lot of the time it’s dependent on what the code can support. Which is why we really believe in the breadth of options and then the customer can pick which suits their workload, the best.

“We did some work with a partner and a tech out of Bristol where we will be comparing the different options with GPU and CPU. He was showing that GPUs really do give a good performance improvement, but what’s interesting is it also gives us a price performance benefit. That’s why a lot of people move into GPUs, because you get a lot more efficiency out of the box, but again, it depends on the code.”

Is there a move towards people optimising code for GPUs?

“If you look at scientific computing, and computational fluid dynamics, I need them. It has had decades of research. So some of the codes are very sophisticated code, millions of lines of code. To port those onto GPUs actually requires quite a lot of work, which has taken time. Previously, there were memory constraints on the GPUs.

“Now, the P4D has 40 gig, and the P4DE has 80 gig. So now you are not so memory constrained, and so you can use them. That’s been one of the big differences, the increase of memory on the GPUs.

“But every new generation CPU is still bringing out extra performance. It depends where they’re at in that lifecycle. If you get the choice, then the customers can make that decision.”

Don’t forget storage

Ashton also talked about the challenge of storage. Some customers want access to fast storage and that plays to the work Amazon has done with its own parallel file system, Amazon FSx for Lustre. While some code still wants a local disk, FSx for Lustre is a managed service which brings its own benefits.

Working with local disk means sizing, costing and buying more disk in advance of its demand. As a service, FSx for Lustre is expandable as you need it. Ashton commented, “as a user, you just say how much storage you want, and it will go and scale it for you. It also links with s3. We had a lot of customers who really like the durability and the cost point of s3, but not the POSIX file system.

“What’s nice with lustre is you have this direct link between s3 and the file system. You can set up Data Repository tasks, so it’s automatically syncing between them. You can be working on all these files, and it will be syncing it back to s3, so you’ve automatically got it backed up. If you decide to shut down the cluster, the data’s in s3.

“That’s one of the big differences with AWS with regards to supercomputing. It’s no longer necessary to just have a single monolithic large machine. Because if you think about the way that a lot of end users work, they’re split into groups, by projects, and geographies. You might have a team in Europe, or team in the US, or a team in Asia.

“This is enabling you to have your own cluster. If there’s a group in Oxford, who wants that cluster, they can create the cluster themselves. But then the team in the US can also have a cluster. You don’t need to have a single big cluster because it can never be, for latency reasons, the ideal.”

Changing our view of the supercomputer

How does all of this change our view of the supercomputer? When we break down into clusters why is this more than just a distributed computer?

Ashton replied, “We tried to do to cover both sides. When we speak to customers and we ask them, what sort of jobs do they typically want to run, a lot of users are not needing to run to 1,000 cores. They may need to run 500 cores or 2,000 or 4,000. We have had customers who have scaled up to you know more than 40,000 cores for a single job. So you can run large MPI jobs. But we just see that most of the time, they are running smaller jobs in distributed ways.

“We still are laser focused on having the performance. So the individual cluster still has the compute, the memory bandwidth, the network. If you look at the reality, a lot of them are split up into separate groups. It’s the flexibility for example, to have different operating system. If you have your own cluster, you could have Ubuntu, somebody else could have Santos, somebody else could have whatever they wanted.

“The other one is if you if your group wants to have a certain set of modules or compilers, you can do that on your cluster. If you have a single big cluster the system administration has to get everyone to agree a compromise. There is the latency side. You want the systems to be close to you. The problem that some people had is if that cluster was based in the US, because you can’t get away from the speed of light. It’s the one thing we can’t change.”

Flexible workloads

When we measure supercomputers, we make an assumption of a single job, a single workload and record performance in petaflops and beyond. But when they are in use, they’re not single-job focused, or they rarely single-job focused. There are other jobs running on them. How does this new view of a supercomputer based on clusters change that?

Ashton replied, “One of our key focus areas is more tightly coupled workloads like CFD, or things like that. One of the other big uses, and ARM is an example of that, is very large numbers of loosely coupled jobs, single node jobs. I think they run over 350,000 jobs, so that’s also a supercomputer if you think about it.

This is again, redefining the term. If it’s a lot of high performance, compute, architecture, and infrastructure and storage and orchestration. That’s what we’re seeing. It’s a very diverse set of workloads, and it’s also the same with CFD. You could do a CFD simulation of a car. And that could be a tightly coupled one run on 2,000 cores, scale it out, make it run faster. But we also have customers who may want to run many of those. But they’re not linked. So that’s loosely coupled, because each of them are 2,000 core jobs that are not linked to each other.

“What customers want to do, is both, they want to have the ability to create a cluster to run their job, but maybe they want to distribute ops to them. And that’s what the cloud is perfect for. We have ARM or AstraZeneca, who are doing these more distributed jobs. But then we have Formula One or Joby Aviation who are doing the more traditional tightly coupled. The fact that you can do both, it’s kind of a unique advantage of AWS versus a more traditional supercomputer.”

Enterprise Times: What does this mean?

The ability of cloud computing to improve IT is unquestioned. But the majority of its use is for general-purpose computing, or improving our ability to do certain tasks. When it comes to more complex computational tasks, the focus is often on the ability of cloud to scale resources to deal with higher data loads.

Talking with Dr Ashton gives a different perspective on some of the possibilities. The ability to deliver HPC and take on tasks like CFD has an impact on businesses of all sizes. It doesn’t matter if you are trying to understand CFD in a data centre room or its impact on an F1 car. Cloud makes it possible to address such complex challenges.

Just as interesting was his take on the use of clusters that deliver supercomputer like capabilities without the complexity of supercomputers. It makes cloud a much more flexible option for a whole new variety of tasks. It will also be interesting to see at the next re:Invent what people have done with the new Graviton-3 silicon.

LEAVE A REPLY

Please enter your comment!
Please enter your name here