Brad Wardell's site for talking about the customization of Windows.
Published on January 14, 2015 By Frogboy In PC Gaming

imageUnlike previous versions of DirectX, the difference between the new DirectX and previous generations are obvious enough that they can be explained in charts (and maybe someone with some visual design skill can do this).

This article is an extreme oversimplification. If someone wants to send me a chart to put in this article, I’ll update. Smile

Your CPU and your GPU

Since the start of the PC, we have had the PC and the GPU (or at least, the “video card”).

Up until DirectX 9, the CPU, being 1 core in those days, would talk to the GPU through the “main” thread. 

DirectX 10 improved things a bit by allowing multiple cores send jobs to the GPU. This was nice but the pipeline to the GPU was still serialized. Thus, you still ended up with 1 CPU core talking to 1 GPU core.

It’s not about getting close to the hardware

Every time I hear someone say “but X allows you to get close to the hardware” I want to shake them.  None of this has to do with getting close to the hardware. It’s all about the cores. Getting “closer” to the hardware is relatively meaningless at this point.  It’s almost as bad as those people who think we should be injecting assembly language into our source code.  We’re way beyond that.

It’s all about the cores

Last Fall, Nvidia released the Geforce GTX 970.  It has 5.2 BILLION transistors on it. It already supports DirectX 12. Right now.  It has thousands of cores in it.  And with DirectX 11, I can talk to exactly 1 of them at a time.

Meanwhile, your PC might have 4, 8 or  more CPU cores on it. And exactly 1 of them at a time can talk to the GPU.

Let’s take a pause here. I want you to think about that for a moment.  Think about how limiting that is.  Think about how limiting that has been for game developers. How long has your computer been multi-core?

But DirectX 12? In theory, all your cores can talk to the GPU simultaneously.  Mantle already does this and the results are spectacular.  In fact, most benchmarks that have been talked about have been understated because they seem unbelievable.  I’m been part of (non-NDA) meetings where we’ve discussed having to low-ball performance gains to being “only” 40%.  The reality is, as in, the real-world, non-benchmark results I’ve seen from Mantle (and presumable DirectX 12 when it’s ready) are far beyond this.  The reasons are obvious.

To to summarize:

DirectX 11: Your CPU communicates to the GPU 1 core to 1 core at a time.   It is still a big boost over DirectX 9 where only 1 dedicated thread was allowed to talk to the GPU but it’s still only scratching the surface.

DirectX 12: Every core can talk to the GPU at the same time and, depending on the driver, I could theoretically start taking control and talking to all those cores.  

That’s basically the difference. Oversimplified to be sure but it’s why everyone is so excited about this. 

The GPU wars will really take off as each vendor will now be able to come up with some amazing tools to offload work onto GPUs. 

Not just about games

Cloud computing is, ironically, going to be the biggest beneficiary of DirectX 12.  That sounds unintuitive but the fact is, there’s nothing stopping a DirectX 12 enabled machine from fully running VMs on these video cards. Ask your IT manager which they’d rather do? Pop in a new video card or replace the whole box.  Right now, this isn’t doable because cloud services don’t even have video cards in them typically (I’m looking at you Azure. I can’t use you for offloading Metamaps!)

It’s not magic

DirectX 12 won’t make your PC or XBox One magically faster. 

First off, the developer has to write their game so that they’re interacting with the GPU through multiple cores simultaneously. Most games, even today, are still written so that only 1 core is dedicated to interacting with the GPU.

Second, this only benefits you if your game is CPU bound. Most games are. In fact, I’m not sure I’ve ever seen a modern Nvidia card get GPU bound (if anyone can think of an example, please leave it in the comments).

Third, if you’re a XBox One fan, don’t assume this will give the XBO superiority.  By the time games come out that use this, you can be assured that Sony will have an answer.

Rapid adoption 

There is no doubt in my mind that support for Mantle/DirectX12/xxxx will be rapid because the benefits are both obvious and easy to explain, even to non-technical people.  Giving a presentation on the power of Oxide’s new Nitrous 3D engine is easy thanks to the demos but it’s even easier because it’s obvious why it’s so much more capable than anything out there.

If I am making a game that needs thousands of movie-level CGI elements on today’s hardware, I need to be able to walk a non-technical person through what Nitrous is doing differently. The first game to use it should be announced before GDC and in theory, will be the very first native DirectX 12 and Mantle and xxxx game (i.e. written from scratch for those platforms).

A new way of looking at things: Don’t read this because what is read can’t be unread

DirectX 12/etc. will ruin older movies and game effects a little bit.  It has for me. Let me give you a straight forward example:

Last warning:

Seriously.

Okay. One of the most obvious limitations games have due to the 1 core to 1 core interaction are light sources.  Creating a light source is “expensive” but easily done on today’s hardware.  Creating dozens of light sources simultaneously on screen at once is basically not doable unless you have Mantle or DirectX 12.  Guess how many light sources most engines support right now? 20? 10? Try 4. Four. Which is fine for a relatively static scene. But it obviously means we’re a long long way from having true “photo realism”. 

So your game might have lots of lasers and explosions and such, but only (at most) a few of them are actually real light sources (and 3 of them are typically reserved lighting the scene).

As my son likes to say: You may not know that the lights are fake but your brain knows.

image

You’ll never watch this battle the same again.

image

Or this. Wow, those must be magical explosions, they don’t cast shadows…Or maybe it’s a CGI scene..

And once you realize that, you’ll never look at an older CGI movie or a game the same because you’ll see blaster shots and little explosions in a scene and realize they’re not causing shadows or lighting anything in the scene. You subconsciously knew the scene was “fake”. You knew it was filled with CGI but you may not have been able to explain why.  Force lightning or a wizard spell that isn’t casting light or shadows on the scene may not be consciously noticeable but believe me, you’re aware of it (modern CGI fixes this btw but our games are still stuck at a handful).

Why I’ve been covering this

Before DirectX 12, I had never really talked about graphics APIs.  That’s because I found them depressing.  My claim to fame (code-wise) is multithreading AI programming.  I wrote the first commercial multithreaded game back in the 90s and I’ve been a big advocate of multithreading since.  GalCiv for Windows was the first game to make use of Intel hyperthreading.  

Stardock’s games are traditionally famous for good AI. It’s certainly not because I’m a great programmer. It’s because I have always tossed everything from path finding to AI strategy onto threads.  The turn time in say Sorcerer King with > 1000 units running around is typically less than 2 seconds. And those are monsters fighting battles with magical spells and lots of pathfinding.  That’s all because I can throw all this work onto multiple threads that are now on multiple cores.  In essence, I’m cheating.   So next time you’re playing a strategy game where you’re waiting 2 minutes between turns, you know why.

But the graphics side? Depressing.

image

That magical spell is having no affect on the lighting or shadows. You may not notice it consciously but your brain does (DirectX 9). A DirectX 10/11 game would be able to give that spell a point light but as you can see, it’s a stream of light which is a different animal.

 

You don’t need an expert

Assuming you’re remotely technical, the change from DirectX 11 to DirectX 12/Mantle changes are obvious enough that you should be able to imagine the benefits.  If before only 1 core could send jobs to your GPU but now you could have all your cores send jobs at the same time, you can imagine what kinds of things can become possible.  Your theoretical improvement in performance is (N-1)X100% where N is how many cores you have.  That’s not what you’ll really get. No one writes perfect parallelized code and no GPU is at 0% saturation.  But you get the idea.

GDC

Pay very very close attention to GDC this year. Even if you’re an OpenGL fan.  NVidia, AMD, Microsoft, Intel and Sony have a unified goal.  Something is about to happen. Something wonderful.


Comments (Page 1)
on Jan 14, 2015

Wow this all sounds amazing.

 

I'm still a little confused on a couple things though, and haven't been able to get a consistent answer from various sources... Will I need a new card to take advantage of this feature? Or will my GTX 760 work after a DirectX and driver upgrade? And will the new version of Windows be required for the DirectX upgrade? 

on Jan 14, 2015

I'm the kind of guy that tends to turn off shadows because they look so bad. Like, when WoW introduced shadows back in the Arthas expansion I thought it was awesome until I realized my massive lightexplosive starfire spells don't do anything to it. Shadows worked better in skyrim though, because your effects are usually between your viewport and whatever's behind it.

So two big questions:

first, what does this mean for multicore GPUs? And I don't mean "every GPU is multicore", I mean SLI and Crossfire. Are we going to see AMD/Nvidia step up to the plate and build drivers/software with Microsoft to such a point that DX12 doesn't even need to know if you have SLI/Crossfire? Or are devs going to continue to have to make exceptions for those of us with multiple GPUs? 

second, in terms of adoption, is DX12 going to be patched in to windows 7 or 8? Or is it going to be windows 10 exclusive? Because you know how people tend to hang on to their OS - heck, most games still don't even dare require 64-bit CPU. So let's say in two years time you're looking at maybe 15-20% or gamers using windows 10 (yes, those numbers were pulled from my ass). If it remans a win10 exclusive, then DX12 will simply be where mantle is today - a cool gadget, but games are still shipping with support for dx9 (!!!). 

 

on Jan 14, 2015

Very cool getting a developer's take on dx12's changes & how they effect limitations they run into.   Thanks Brad!

 

on Jan 15, 2015

"Cloud computing is, ironically, going to be the biggest beneficiary of DirectX 12.  That sounds unintuitive but the fact is, there’s nothing stopping a DirectX 12 enabled machine from fully running VMs on these video cards. Ask your IT manager which they’d rather do? Pop in a new video card or replace the whole box.  Right now, this isn’t doable because cloud services don’t even have video cards in them typically (I’m looking at you Azure. I can’t use you for offloading Metamaps!)"

 

I would love for someone to elaborate on this. What real world changes and possibilities will be possible in cloud computing as a result of DX 12?

on Jan 15, 2015

You stack a bunch of 8 core, 4ghz processors up in your server farm and leave yourself with little upgrade potential short of tossing half the components.  A year later when you decide you need more horsepower, instead of junking your servers, or adding to the number of servers, you buy a bunch of gpu's and stick them in the expansion slots.

 

Very little gain can be had without switching out the board and most of it's components, being able to efficiently utilize a graphics card for processing gives you a huge upgrade potential.

on Jan 15, 2015

Do you ever do SIMD optimizations (manual vectorization with SSE & AVX)? The internal loop of the code you posted in GalCiv forums seems a good canidate for it. I'm a software developer myself and i do it quite often when i need to squeeze maximum out of the CPU core. The results are often impressing, up to 5x increase in execution speed. 

Of course a lot of algorithms cannot be parallelized by SIMD, but some do and if those take a lot of time in percentage of total time it makes sense to parallelize them.

on Jan 15, 2015

thetrees

"Cloud computing is, ironically, going to be the biggest beneficiary of DirectX 12.  That sounds unintuitive but the fact is, there’s nothing stopping a DirectX 12 enabled machine from fully running VMs on these video cards. Ask your IT manager which they’d rather do? Pop in a new video card or replace the whole box.  Right now, this isn’t doable because cloud services don’t even have video cards in them typically (I’m looking at you Azure. I can’t use you for offloading Metamaps!)"

 

I would love for someone to elaborate on this. What real world changes and possibilities will be possible in cloud computing as a result of DX 12?

It varies depending on the task being done. Gpu's are highly specialized processors designed to perform massively parallellized floating point operations  orders of magnitude faster than more generalized x86/x64/arm/etc chips (Which can do everything else that's gpu cannot).

-if- a task can be done using floating point, a gpu can do it amazingly well, but the number of gpu accelerated applications shows how small that pool is. .. Buuuut..... Lets say that hypothetically,  someone made an alternative mod_SSL apache module that was able to offload the computationally expensive encryption, it would be easier to use https  with everything, even on high traffic web servers.

Gpu offloading presents a second set of possible problems through.
gpu's do two things that are terrible for high density data centers... Use a large amount of power, and generate a lot of heat with that power.  Having seen dco (datacenter operations) techs to hardware replacement tickets  with a picture of a charred raid card (a type of card you can normally -touch- without third degree burns), and need to wait for power pole upgrades to plug in a new box, those two things are extremely bad for a datacenter where a cooling (industrial strength ac) failure  can quickly bring ambient temperatures from 60someF  up to  100-120degrees F within an hour or two.

The idea of turning on the on board gpus or plugging in high end ones in a datacenter makes me cringe for those reasons, but any server farms that do a lot of floating point operations could significantly benefit from it.


Sent from my Windows Phone

 

on Jan 15, 2015

Two questions, first can't gpu's also do regular processing to help alleviate CPU's. I remember reading this. Did in imagine this. Did you actually say that previously the CPU's were not taking advantage of all those gpu's. Both so surprising considering that it needs to be invented first. Meaning that is makes since for software to catch up to hardware. And yes I'm talking about windows. I still believe that you could bypass directx, and write code directly utilizing graphics off the card. Two problems with this. First this will probably have compatibility issues with different graphics cards, unless they all speak the same language. Second this would take a lot longer. If someone has already done the work why do almost the same work. This is not exactly the same argument as using assembly language, because, only c and assembly have binary charts meaning c is the only compiler that is as efficient as assembly. That is why all these languages that are made to obsolete c can't do their job and obsolete c. Machine language on the other hand is more efficient, but both machine and assembly language is harder to program. That is why there programs were smaller.

on Jan 15, 2015

aww http://stackoverflow.com/questions/8524487/kernels-that-run-fast-on-multicores-but-relatively-slow-on-gpu is a pretty good quick summary on what kinds of things can be offloaded,he said it a little more nuanced than "not taking advantage of" but sorta

 

Writing parallelized multithreaded code is comparatively harder than serialized multithreaded code... More importantly, outside of certain enterprise class environments its something only recently even possible to any significant extent.

Pentium pro's were dual core (among other things), but consumer grade CPUs only more recently started getting mainstream multicore capabilities in the last 5-10years.  Early on low hanging fruit like splitting word's spellchecking off into its own thread  made good use of soaking up that extra core, but with the number of cores really climbing, more focus towards making it easier to do multithreading and outright virtualizing things  to take better advantage of those extra cores has been made
Sent from my Windows Phone

on Jan 15, 2015

Thanks Frogboy I enjoyed reading your article. That was very clear and concise. 

 

on Jan 15, 2015

Only thing that really bugged me with the dx12 demos I have seen is Forza 5 working on a pc... GOD DAMN IT MICROSOFT! ****! 

on Feb 03, 2015

" Your CPU communicates to the GPU 1 core to 1 core at a time."

This is not true. Your CPU communicates with the command buffer (it fills it up), and then GPU cores (as many as it has) can access that command buffer and process commands. If your CPU can fill that buffer fast enough then you are good, if not then your application is CPU bound. There is no 1->1 communication as you say - there is 1->many communication.

"None of this has to do with getting close to the hardware. It’s all about the cores."

It's not ALL about the cores. It's also about reducing amount of CPU overhead that is due to the amount of API book keeping it normally needs to do. It exposes internals of the API to the developer, letting him take control of the book keeping which generally means much lower overhead as he knows best what his application needs, instead of letting the API/driver make expensive assumptions. This is important because the overhead I mention is huge, especially when compared to consoles that have very little overhead when communicating with their hardware (huge = order of magnitude). In fact most of the CPU cycles taken up by your API calls are overhead and not the actual commands you are sending, therefore it is important to reduce it. This can mean the difference between 1000 draw calls per frame and 10000 draw calls (on a single core).

And you seem to be forgetting or not aware that both DirectX 11 and OpenGL already support command lists which allow you to queue commands from different cores/threads. Therefore DirectX 12 supporting such a thing in itself is not as big of a game changer as you imply. The game changer is that you are now given much better control over how those commands are queued (i.e. getting closer to the hardware).

"Cloud computing is, ironically, going to be the biggest beneficiary of DirectX 12. That sounds unintuitive but the fact is, there’s nothing stopping a DirectX 12 enabled machine from fully running VMs on these video cards."

This is just simply not possible with the current GPU architecture. Even if it was the result would be incredibly slow. GPUs are stream processing machines which means they don't have cache as we know it in CPU, they don't have out of order instruction execution, no branch prediciton and many other advanced features. They are simple cores that are meant to run hundreds of homogeneus simple tasks. They rely on hundreds of threads running at once on a single core, switching between work to hide memory latency (compared to CPU which uses the cache). Having serial applications like VM and its programs run on (if it was even possible) it would result in abysmal performance.

 

on Apr 19, 2015

Your article may have been very clear and concise but it is also very wrong.  You would do well to tone down what you think you know and begin learning a little bit.

I have been developing a replacement for the DirectX math library, in 100% pure hand-coded assembly.  And I am seeing first hand the horrors therein.  This is not an optimized library.  It is a travesty.  I am consistently seeing speed improvements of 8x to 120x in my version depending on the function.  And yes, those numbers are quite accurate.  It is simply THAT BAD inside C++.

Who in the past 35 years has provided any kind of running comparison?  Everybody "knows" but nobody has put their "knowledge" to the test.  If they did, all the naïve talk about how wonderful C++ is would stop immediately.

Very soon I will be releasing an all-ASM app and its exact counterpart in C++.  You will be able to run both on your own hardware and see the speed difference for yourself.  What made me reply was the very amazing "It’s almost as bad as those people who think we should be injecting assembly language into our source code.  We’re way beyond that."

We're way beyond nothing.  We are thousands of millennia from reaching that point in the first place - where there is no benefit to using ASM.  And we're pedaling backward a light speed.

I am one of those people - not who thinks we should be injecting assembly language into our source code, but one who thinks we should be replacing the crisis that is C++ with real hand-crafted code written by real programmers who understand the hardware and understand their trade.  C++ is nothing but management and macros that has run amok decades ago.  The CPU is still executing native instructions.  Just because they are bundled up into macros with about 1000% management suffocating them does not magically make them faster. 

I don't work on theory.  I work on applications.  I will provide you the apps when they are ready.  Then you can tell your hardware that it doesn't know what it's doing when you see how bad the situation truly is.  And when you run these (50/50 that you will even do it, since you will come away looking pretty naïve and childish) keep in mind that the only C++ I have replaced is outside DirectX 12.  What would happen to a game if I were able to get all the way to the core of the OS, DirectX, and the drivers, and clean up everything along the way?

 

on Apr 19, 2015

CMalcheski

Your article may have been very clear and concise but it is also very wrong.  You would do well to tone down what you think you know and begin learning a little bit.

I have been developing a replacement for the DirectX math library, in 100% pure hand-coded assembly.  And I am seeing first hand the horrors therein.  This is not an optimized library.  It is a travesty.  I am consistently seeing speed improvements of 8x to 120x in my version depending on the function.  And yes, those numbers are quite accurate.  It is simply THAT BAD inside C++.

Who in the past 35 years has provided any kind of running comparison?  Everybody "knows" but nobody has put their "knowledge" to the test.  If they did, all the naïve talk about how wonderful C++ is would stop immediately.

Very soon I will be releasing an all-ASM app and its exact counterpart in C++.  You will be able to run both on your own hardware and see the speed difference for yourself.  What made me reply was the very amazing "It’s almost as bad as those people who think we should be injecting assembly language into our source code.  We’re way beyond that."

We're way beyond nothing.  We are thousands of millennia from reaching that point in the first place - where there is no benefit to using ASM.  And we're pedaling backward a light speed.

I am one of those people - not who thinks we should be injecting assembly language into our source code, but one who thinks we should be replacing the crisis that is C++ with real hand-crafted code written by real programmers who understand the hardware and understand their trade.  C++ is nothing but management and macros that has run amok decades ago.  The CPU is still executing native instructions.  Just because they are bundled up into macros with about 1000% management suffocating them does not magically make them faster. 

I don't work on theory.  I work on applications.  I will provide you the apps when they are ready.  Then you can tell your hardware that it doesn't know what it's doing when you see how bad the situation truly is.  And when you run these (50/50 that you will even do it, since you will come away looking pretty naïve and childish) keep in mind that the only C++ I have replaced is outside DirectX 12.  What would happen to a game if I were able to get all the way to the core of the OS, DirectX, and the drivers, and clean up everything along the way?

 
I remember the ancient days of programming where REAL programmers wrote in machine code(huge of programmer's time to produce the program, but used the absolute minimum of computer time), advanced programmers wrote in assembler(fair bit of programmer time and a bit more computer time), commercial programmers programmed in a high level compiled language (produced results that did the job with the minimum of programmer time, but used a lot of computer time), and student programmers that used interperated high level languages(easily changed program, absolute minimum of programmer time BUT absolutely HUGE use of computer time).

unfortunately the world has been taken over by the management and beurocrats such that unless non-experts in an area can be shown the results in a rediculously short timeframe then the project/programme will not happen, and then those same management types want it quicker and with more features and fewer bugs, and the same management/marketing types want everything faster(bigger numbers are better mindset not the maximum efficiency mindset ie get the MOST results (quickly and with miniumum  effort)(management) VS get the most appropriate results from the inputs (engineering/efficiency)