See what "CUDA" is in other dictionaries. Computer resource U SM Programs using cuda

And it is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for the process of assembling the final program or library in any programming environment, for example in NetBeans.

The CUDA architecture uses a grid memory model, cluster thread modeling, and SIMD instructions. Applicable not only for high-performance graphics computing, but also for various scientific computing using nVidia video cards. Scientists and researchers widely use CUDA in a variety of fields, including astrophysics, computational biology and chemistry, fluid dynamics modeling, electromagnetic interactions, computed tomography, seismic analysis, and more. CUDA has the ability to connect to applications using OpenGL and Direct3D. CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

On March 22, 2010, nVidia released CUDA Toolkit 3.0, which contained support for OpenCL.

Equipment

The CUDA platform first appeared on the market with the release of the eighth-generation NVIDIA G80 chip and became present in all subsequent series of graphics chips, which are used in the GeForce, Quadro and NVidia Tesla accelerator families.

The first series of hardware to support the CUDA SDK, the G8x, had a 32-bit single-precision vector processor using the CUDA SDK as an API (CUDA supports the C double type, but its precision has now been reduced to 32-bit floating point). Later GT200 processors have support for 64-bit precision (SFU only), but performance is significantly worse than for 32-bit precision (due to the fact that there are only two SFUs per stream multiprocessor, while there are eight scalar processors). The GPU organizes hardware multithreading, which allows you to use all the resources of the GPU. Thus, the prospect opens up to transfer the functions of the physical accelerator to the graphics accelerator (an example of implementation is nVidia PhysX). It also opens up wide possibilities for use graphics hardware a computer for performing complex non-graphical calculations: for example, in computational biology and other branches of science.

Advantages

Compared to the traditional approach to organizing calculations general purpose Through the capabilities of graphics APIs, the CUDA architecture has the following advantages in this area:

Restrictions

All functions executable on the device do not support recursion (CUDA Toolkit 3.1 supports pointers and recursion) and have some other limitations

Supported GPUs and graphics accelerators

The list of devices from equipment manufacturer Nvidia with declared full support for CUDA technology is provided on the official Nvidia website: CUDA-Enabled GPU Products (English).

In fact, the following peripherals currently support CUDA technology in the PC hardware market:

Specification version	GPU	Video cards
1.0	G80, G92, G92b, G94, G94b	GeForce 8800GTX/Ultra, 9400GT, 9600GT, 9800GT, Tesla C/D/S870, FX4/5600, 360M, GT 420
1.1	G86, G84, G98, G96, G96b, G94, G94b, G92, G92b	GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600 GSO, 9800GTX/GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32 /370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2	GT218, GT216, GT215	GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
1.3	GT200, GT200b	GeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C/M1060, S1070, Quadro CX, FX 3/4/5800
2.0	GF100, GF110	GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070, S/M2050/70, Quadro Plex 7000, Quadro 4000, 5000, 6000, GeForce (GF110) GTX 560 TI 448, GTX570, GTX580, GTX59 0
2.1	GF104, GF114, GF116, GF108, GF106	GeForce 610M, GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000
3.0	GK104, GK106, GK107	GEFORCE GTX 690, GTX 680, GTX 670, GTX 660 Ti, GTX 660, GTX 650 Ti, GTX 650, GT 640, GeForce GTX 680MX, GeForce GTX 680M, GEFORCE GTX 675MX, GeForce GTX 670MX, GTX, GTX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX 670MX, GTX, GTX 670MX, GTX 670MX, GTX 670MX, GTX, GTX TX 660M, GeForce GT 650M, GeForce GT 645M, GeForce GT 640M
3.5	GK110

Nvidia GeForce for desktop computers
GeForce GTX 590
GeForce GTX 580
GeForce GTX 570
GeForce GTX 560 Ti
GeForce GTX 560
GeForce GTX 550 Ti
GeForce GTX 520
GeForce GTX 480
GeForce GTX 470
GeForce GTX 465
GeForce GTX 460
GeForce GTS 450
GeForce GTX 295
GeForce GTX 285
GeForce GTX 280
GeForce GTX 275
GeForce GTX 260
GeForce GTS 250
GeForce GT 240
GeForce GT 220
GeForce 210
GeForce GTS 150
GeForce GT 130
GeForce GT 120
GeForce G100
GeForce 9800 GX2
GeForce 9800 GTX+
GeForce 9800 GTX
GeForce 9800 GT
GeForce 9600 GSO
GeForce 9600 GT
GeForce 9500 GT
GeForce 9400 GT
GeForce 9400 mGPU
GeForce 9300 mGPU
GeForce 8800 GTS 512
GeForce 8800 GT
GeForce 8600 GTS
GeForce 8600 GT
GeForce 8500 GT
GeForce 8400GS

Nvidia GeForce for mobile computers
GeForce GTX 580M
GeForce GTX 570M
GeForce GTX 560M
GeForce GT 555M
GeForce GT 540M
GeForce GT 525M
GeForce GT 520M
GeForce GTX 485M
GeForce GTX 480M
GeForce GTX 470M
GeForce GTX 460M
GeForce GT 445M
GeForce GT 435M
GeForce GT 425M
GeForce GT 420M
GeForce GT 415M
GeForce GTX 285M
GeForce GTX 280M
GeForce GTX 260M
GeForce GTS 360M
GeForce GTS 350M
GeForce GTS 160M
GeForce GTS 150M
GeForce GT 335M
GeForce GT 330M
GeForce GT 325M
GeForce GT 240M
GeForce GT 130M
GeForce G210M
GeForce G110M
GeForce G105M
GeForce 310M
GeForce 305M
GeForce 9800M GTX
GeForce 9800M GT
GeForce 9800M GTS
GeForce 9700M GTS
GeForce 9700M GT
GeForce 9650MGS
GeForce 9600M GT
GeForce 9600MGS
GeForce 9500MGS
GeForce 9500M G
GeForce 9300MGS
GeForce 9300M G
GeForce 9200MGS
GeForce 9100M G
GeForce 8800M GTS
GeForce 8700M GT
GeForce 8600M GT
GeForce 8600MGS
GeForce 8400M GT
GeForce 8400MGS

Nvidia Tesla *
Tesla C2050/C2070
Tesla M2050/M2070/M2090
Tesla S2050
Tesla S1070
Tesla M1060
Tesla C1060
Tesla C870
Tesla D870
Tesla S870

Nvidia Quadro for desktop computers
Quadro 6000
Quadro 5000
Quadro 4000
Quadro 2000
Quadro 600
Quadro FX 5800
Quadro FX 5600
Quadro FX 4800
Quadro FX 4700 X2
Quadro FX 4600
Quadro FX 3700
Quadro FX 1700
Quadro FX 570
Quadro FX 470
Quadro FX 380 Low Profile
Quadro FX 370
Quadro FX 370 Low Profile
Quadro CX
Quadro NVS 450
Quadro NVS 420
Quadro NVS 290
Quadro Plex 2100 D4
Quadro Plex 2200 D2
Quadro Plex 2100 S4
Quadro Plex 1000 Model IV

Nvidia Quadro for mobile computing
Quadro 5010M
Quadro 5000M
Quadro 4000M
Quadro 3000M
Quadro 2000M
Quadro 1000M
Quadro FX 3800M
Quadro FX 3700M
Quadro FX 3600M
Quadro FX 2800M
Quadro FX 2700M
Quadro FX 1800M
Quadro FX 1700M
Quadro FX 1600M
Quadro FX 880M
Quadro FX 770M
Quadro FX 570M
Quadro FX 380M
Quadro FX 370M
Quadro FX 360M
Quadro NVS 5100M
Quadro NVS 4200M
Quadro NVS 3100M
Quadro NVS 2100M
Quadro NVS 320M
Quadro NVS 160M
Quadro NVS 150M
Quadro NVS 140M
Quadro NVS 135M
Quadro NVS 130M

Models Tesla C1060, Tesla S1070, Tesla C2050/C2070, Tesla M2050/M2070, Tesla S2050 allow GPU calculations with double precision.

Features and Specifications of Various Versions

Feature support (unlisted features are supported for all compute capabilities)	Compute capability (version)
	1.0	1.1	1.2	1.3	2.x
32-bit words in global memory	No	Yes
floating point values in global memory	No	Yes
Integer atomic functions operating on 32-bit words in shared memory	No		Yes
atomicExch() operating on 32-bit floating point values in shared memory
Integer atomic functions operating on 64-bit words in global memory
Warp vote functions
Double-precision floating-point operations	No			Yes
Atomic functions operating on 64-bit integer values in shared memory	No				Yes
Floating-point atomic addition operating on 32-bit words in global and shared memory
_ballot()
_threadfence_system()
_syncthreads_count(), _syncthreads_and(), _syncthreads_or()
Surface functions
3D grid of thread block

Technical specifications	Compute capability (version)
Technical specifications	1.0	1.1	1.2	1.3	2.x
Maximum dimensionality of grid of thread blocks	2				3
Maximum x-, y-, or z-dimension of a grid of thread blocks	65535
Maximum dimensionality of thread block	3
Maximum x- or y-dimension of a block	512				1024
Maximum z-dimension of a block	64
Maximum number of threads per block	512				1024
Warp size	32
Maximum number of resident blocks per multiprocessor	8
Maximum number of resident warps per multiprocessor	24		32		48
Maximum number of resident threads per multiprocessor	768		1024		1536
Number of 32-bit registers per multiprocessor	8K		16 K		32K
Maximum amount of shared memory per multiprocessor	16 KB				48 KB
Number of shared memory banks	16				32
Amount of local memory per thread	16 KB				512 KB
Constant memory size	64 KB
Cache working set per multiprocessor for constant memory	8 KB
Cache working set per multiprocessor for texture memory	Device dependent, between 6 KB and 8 KB
Maximum width for 1D texture	8192				32768
Maximum width for 1D texture reference bound to linear memory	2 27
Maximum width and number of layers for a 1D layered texture reference	8192 x 512				16384 x 2048
Maximum width and height for 2D texture reference bound to linear memory or a CUDA array	65536 x 32768				65536 x 65535
Maximum width, height, and number of layers for a 2D layered texture reference	8192 x 8192 x 512				16384 x 16384 x 2048
Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array	2048 x 2048 x 2048
Maximum number of textures that can be bound to a kernel	128
Maximum width for a 1D surface reference bound to a CUDA array	Not supported				8192
Maximum width and height for a 2D surface reference bound to a CUDA array					8192 x 8192
Maximum number of surfaces that can be bound to a kernel					8
Maximum number of instructions per kernel	2 million

Example

CudaArray* cu_array; texture< float , 2 >tex; // Allocate array cudaMalloc( & cu_array, cudaCreateChannelDesc< float>(), width, height) ; // Copy image data to array cudaMemcpy( cu_array, image, width* height, cudaMemcpyHostToDevice) ; // Bind the array to the texture cudaBindTexture( tex, cu_array) ; // Run kernel dim3 blockDim(16, 16, 1) ; dim3 gridDim(width / blockDim.x, height / blockDim.y, 1) ; kernel<<< gridDim, blockDim, 0 >>> (d_odata, width, height) ; cudaUnbindTexture(tex) ; __global__ void kernel(float * odata, int height, int width) ( unsigned int x = blockIdx.x * blockDim.x + threadIdx.x ; unsigned int y = blockIdx.y * blockDim.y + threadIdx.y ; float c = texfetch(tex, x, y) ; odata[ y* width+ x] = c; )

Import pycuda.driver as drv import numpy drv.init() dev = drv.Device(0) ctx = dev.make_context() mod = drv.SourceModule( """ __global__ void multiply_them(float *dest, float *a, float *b) ( const int i = threadIdx.x; dest[i] = a[i] * b[i]; ) """) multiply_them = mod.get_function ("multiply_them" ) a = numpy.random .randn (400 ) .astype (numpy.float32 ) b = numpy.random .randn (400 ) .astype (numpy.float32 ) dest = numpy.zeros_like (a) multiply_them( drv.Out (dest) , drv.In (a) , drv.In (b) , block= (400 , 1 , 1 ) ) print dest-a*b

CUDA as a subject in universities

As of December 2009, the CUDA software model is taught in 269 universities around the world. In Russia, training courses on CUDA are given at the St. Petersburg Polytechnic University, Yaroslavl State University. P. G. Demidov, Moscow, Nizhny Novgorod, St. Petersburg, Tver, Kazan, Novosibirsk, Novosibirsk State Technical University, Omsk and Perm State Universities, International University of the Nature of Society and Man "Dubna", Ivanovo State Energy University, Belgorod State University, MSTU them. Bauman, Russian Chemical Technical University named after. Mendeleev, Interregional Supercomputer Center RAS, . In addition, in December 2009, it was announced that the first Russian scientific and educational center “Parallel Computing”, located in the city of Dubna, began operating, whose tasks include training and consultations on solving complex computing problems on GPUs.

In Ukraine, courses on CUDA are taught at the Kiev Institute of System Analysis.

Notes

But the race was quickly approaching the limit. After a wave of huge increases in clock speeds (between 2001 and 2003, the Pentium 4 clock speed doubled from 1.5 to 3 GHz), users had to be content with the tenths of gigahertz that manufacturers were able to squeeze out (from 2003 to 2005, clock speeds increased from only 3 to 3 .8 GHz).

Even architectures optimized for high clock frequencies, such as Prescott, began to experience difficulties, and this time not only production ones. Chip manufacturers simply ran into the laws of physics. Some analysts even predicted that Moore's Law would cease to apply. But that did not happen. The original meaning of the law is often distorted, but it concerns the number of transistors on the surface of the silicon core. For a long time, an increase in the number of transistors in a CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The developers of the CPU architecture approached the law of growth reduction: the number of transistors that needed to be added for the required increase in performance became increasingly large, leading to a dead end.

While CPU makers were tearing their hair out trying to find a solution to their problems, GPU makers continued to benefit remarkably from the benefits of Moore's Law.

Why didn't they reach the same dead end as the developers of the CPU architecture? The reason is very simple: central processing units are designed to receive maximum performance on a stream of instructions that process various data (both integers and floating point numbers), perform random memory access, etc. Until now, developers are trying to provide greater parallelism of instructions - that is, execute as many instructions as possible in parallel. For example, with the Pentium, superscalar execution appeared, when under certain conditions it was possible to execute two instructions per clock cycle. Pentium Pro received out-of-order execution of instructions, which made it possible to optimize the operation of computing units. The problem is that there are obvious limitations to executing a sequential stream of instructions in parallel, so blindly increasing the number of computational units does not provide any benefit since they will still be idle most of the time.

In contrast, GPU operation is relatively simple. It consists of taking a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in a GPU it is possible to allocate a large part of the crystal into computational units, which, unlike the CPU, will actually be used.

Click on the picture to enlarge.

GPU differs from CPU not only in this way. Memory access in the GPU is very coupled - if a texel is read, then after a few clock cycles the neighboring texel will be read; When a pixel is recorded, after a few clock cycles the neighboring one will be recorded. By intelligently organizing memory, you can achieve performance close to theoretical throughput. This means that the GPU, unlike the CPU, does not require a huge cache, since its role is to speed up texturing operations. All that is needed is a few kilobytes containing a few texels used in bilinear and trilinear filters.

Click on the picture to enlarge.

Long live GeForce FX!

The two worlds remained separated for a long time. We used CPUs (or even multiple CPUs) for office tasks and Internet applications, and GPUs were only good for speeding up rendering. But one feature changed everything: namely, the advent of programmable GPUs. At first, CPUs had nothing to fear. The first so-called programmable GPUs (NV20 and R200) were hardly a threat. The number of instructions in the program remained limited to about 10, and they worked on very exotic data types, such as 9- or 12-bit fixed-point numbers.

Click on the picture to enlarge.

But Moore's law again showed its best side. Increasing the number of transistors not only made it possible to increase the number of computing units, but also improved their flexibility. The appearance of the NV30 can be considered a significant step forward for several reasons. Of course, gamers didn't really like the NV30 cards, but the new GPUs began to rely on two features that were designed to change the perception of GPUs as more than just graphics accelerators.

Support for single-precision floating point calculations (even if it did not comply with the IEEE754 standard);
support for more than a thousand instructions.

So we have all the conditions that can attract pioneering researchers who are always looking for additional computing power.

The idea of using graphics accelerators for mathematical calculations is not new. The first attempts were made back in the 90s of the last century. Of course, they were very primitive - limited, for the most part, to the use of some hardware functions, such as rasterization and Z-buffers, to speed up tasks such as route finding or inference Voronoi diagrams .

Click on the picture to enlarge.

In 2003, with the advent of evolved shaders, a new bar was reached - this time performing matrix calculations. This was the year when an entire section of SIGGRAPH ("Computations on GPUs") was dedicated to a new area of IT. This early initiative was called GPGPU (General-Purpose computation on GPU). And an early turning point was the emergence of .

To understand the role of BrookGPU, you need to understand how everything happened before its appearance. The only way to get GPU resources in 2003 was to use one of two graphics APIs - Direct3D or OpenGL. Consequently, developers who wanted GPU capabilities for their computing had to rely on the two mentioned APIs. The problem is that they were not always experts in programming video cards, and this seriously complicated access to technology. If 3D programmers operate with shaders, textures and fragments, then specialists in the field of parallel programming rely on threads, cores, scatters, etc. Therefore, first it was necessary to draw analogies between the two worlds.

Stream is a stream of elements of the same type; in the GPU it can be represented by a texture. In principle, in classical programming there is such an analogue as an array.
Kernel- a function that will be applied independently to each element of the stream; is the equivalent of a pixel shader. In classical programming, we can give an analogy of a loop - it is applied to a large number elements.
To read the results of applying the kernel to a thread, a texture must be created. There is no equivalent on the CPU, since it has full memory access.
Control of the location in memory where the write will be made (in scatter operations) is done through the vertex shader, since the pixel shader cannot change the coordinates of the pixel being processed.

As you can see, even taking into account the above analogies, the task does not look simple. And Brook came to the rescue. This name refers to extensions to the C language (“C with streams”, “C with streams”), as the developers at Stanford called them. At its core, Brook's task was to hide all components of the 3D API from the programmer, which made it possible to present the GPU as a coprocessor for parallel computing. To do this, the Brook compiler processed a .br file with C++ code and extensions, and then generated C++ code that was linked to a library with support for different outputs (DirectX, OpenGL ARB, OpenGL NV3x, x86).

Click on the picture to enlarge.

Brook has several credits to his credit, the first of which is bringing GPGPU out of the shadows so that the technology can be embraced by the masses. Although after the announcement of the project, a number of IT sites reported too optimistically that Brook's release casts doubt on the existence of CPUs, which will soon be replaced by more powerful GPUs. But, as we see, even after five years this did not happen. Honestly, we don't think this will ever happen. On the other hand, looking at the successful evolution of CPUs, which are increasingly oriented towards parallelism (more cores, SMT multithreading technology, expansion of SIMD blocks), as well as GPUs, which, on the contrary, are becoming more and more universal (support for floating point calculations) single precision, integer calculations, support for double precision calculations), it seems that the GPU and CPU will soon simply merge. What will happen then? Will GPUs be absorbed by CPUs, as happened with math coprocessors? Quite possible. Intel and AMD are working on similar projects today. But a lot can still change.

But let's return to our topic. Brook's advantage was to popularize the GPGPU concept; it significantly simplified access to GPU resources, which allowed more and more users to master new model programming. On the other hand, despite all the qualities of Brook, there was still a long way to go before GPU resources could be used for computing.

One of the problems is related to at different levels abstraction, and also, in particular, with the excessive additional load created by the 3D API, which can be quite noticeable. But a more serious problem can be considered a compatibility problem, with which the Brook developers could not do anything. There is fierce competition between GPU manufacturers, so they often optimize their drivers. While optimizations like these are mostly good for gamers, they could end Brook's compatibility overnight. Therefore, it is difficult to imagine using this API in production code that will work somewhere. And for a long time, Brook remained the preserve of amateur researchers and programmers.

However, Brook's success was enough to attract the attention of ATI and nVidia, who became interested in such an initiative, since it could expand the market, opening up a new important sector for companies.

The researchers initially involved in the Brook project quickly joined development teams in Santa Clara to present a global strategy for developing the new market. The idea was to create a combination of hardware and software, suitable for GPGPU tasks. Since nVidia developers know all the secrets of their GPUs, they could not rely on the graphics API, but communicate with the GPU through a driver. Although, of course, this comes with its own problems. So, the CUDA (Compute Unified Device Architecture) development team created a set of software layers for working with GPUs.

Click on the picture to enlarge.

As you can see in the diagram, CUDA provides two APIs.

High-level API: CUDA Runtime API;
low-level API: CUDA Driver API.

Because the high-level API is implemented on top of the low-level API, each call to a Runtime-level function is broken down into more simple instructions, which are processed by the Driver API. Please note that the two APIs are mutually exclusive: a programmer can use one or the other API, but it will not be possible to mix function calls from the two APIs. In general, the term "high-level API" is relative. Even the Runtime API is such that many would consider it low-level; however, it still provides functions that are very convenient for initializing or managing the context. But don't expect too much high level abstractions - you still need to have a good set of knowledge about nVidia GPUs and how they work.

The Driver API is even more difficult to work with; it will take more effort to run GPU processing. On the other hand, a low-level API is more flexible, giving the programmer additional control if needed. Two APIs are capable of working with OpenGL or Direct3D resources (only the ninth version as of today). The benefit of this feature is obvious - CUDA can be used to create resources (geometry, procedural textures, etc.) that can be passed to the graphics API or, conversely, you can have the 3D API send rendering results to the CUDA program, which, in turn, will perform post-processing. There are many examples of such interactions, and the advantage is that the resources continue to be stored in the GPU memory, they do not need to be transferred through the PCI Express bus, which still remains a bottleneck.

However, it should be noted that sharing resources in video memory is not always ideal and can lead to some headaches. For example, when changing resolution or color depth, graphic data takes priority. Therefore, if you need to increase the resources in the frame buffer, the driver will easily do this at the expense of the resources of CUDA applications, which will simply crash with an error. Of course, not very elegant, but such a situation should not happen very often. And since we started talking about disadvantages: if you want to use multiple GPUs for CUDA applications, then you need to disable SLI mode first, otherwise CUDA applications will only be able to “see” one GPU.

Finally, the third software level is dedicated to libraries - two, to be precise.

CUBLAS, where available required blocks for linear algebra calculations on GPU;
CUFFT, which supports the calculation of Fourier transforms - an algorithm widely used in the field of signal processing.

Before we dive into CUDA, let us define a number of terms scattered throughout nVidia's documentation. The company has chosen a very specific terminology that is difficult to get used to. First of all, we note that thread in CUDA it does not have the same meaning as a CPU thread, and is also not the equivalent of a thread in our GPU articles. GPU thread in in this case is the base set of data that needs to be processed. Unlike CPU threads, CUDA threads are very “lightweight”, that is, context switching between two threads is not a resource-intensive operation.

The second term often found in CUDA documentation is warp. There is no confusion here, since there is no analogue in Russian (unless you are a fan of Start Trek or the Warhammer game). In fact, the term comes from the textile industry, where weft yarn is drawn through a warp yarn that is stretched on a loom. A warp in CUDA is a group of 32 threads and is the minimum amount of data processed in a SIMD way in CUDA multiprocessors.

But such “graininess” is not always convenient for the programmer. Therefore, in CUDA, instead of working with warps directly, you can work with blocks, containing from 64 to 512 threads.

Finally, these blocks come together into grid. The advantage of this grouping is that the number of blocks simultaneously processed by the GPU is closely related to hardware resources, as we will see below. Grouping blocks into grids allows you to completely abstract away this limitation and apply the kernel to more threads in a single call without worrying about fixed resources. CUDA libraries are responsible for all this. In addition, such a model scales well. If the GPU has few resources, it will execute blocks sequentially. If the number of computing processors is large, then the blocks can be executed in parallel. That is, the same code can run on the GPU as entry level, and on top-end and even future models.

There are a couple more terms in the CUDA API that indicate CPU ( host/host) and GPU ( device). If this little introduction hasn't scared you off, then it's time to take a closer look at CUDA.

If you regularly read Tom's Hardware Guide, then the architecture of the latest GPUs from nVidia is familiar to you. If not, we recommend that you read the article " nVidia GeForce GTX 260 and 280: new generation of video cards"When it comes to CUDA, Nvidia is presenting the architecture a little differently, revealing some details that were previously hidden.

As you can see from the illustration above, the nVidia shader core consists of several texture processor clusters (Texture Processor Cluster, TPC). The 8800 GTX video card, for example, used eight clusters, the 8800 GTS - six, etc. Each cluster essentially consists of a texture block and two streaming multiprocessor. The latter include the beginning of the pipeline (front end), which reads and decodes instructions, as well as sending them for execution, and the end of the pipeline (back end), consisting of eight computing devices and two superfunctional devices SFU (Super Function Unit), where instructions are executed using the SIMD principle, that is, one instruction is applied to all threads in the warp. nVidia calls this method of execution SIMT(single instruction multiple threads, one instruction, many threads). It is important to note that the end of the conveyor operates at twice the frequency of its beginning. In practice this means that this part looks twice as “wide” as it actually is (that is, as a 16-channel SIMD block instead of an eight-channel one). Streaming multiprocessors work as follows: each clock cycle, the beginning of the pipeline selects a warp ready for execution and starts executing the instruction. For the instruction to apply to all 32 threads in the warp, the end of the pipeline will require four clock cycles, but since it runs at twice the frequency of the beginning, it will only require two clock cycles (in terms of the beginning of the pipeline). Therefore, so that the beginning of the pipeline does not idle a clock cycle, and the hardware is maximally loaded, in an ideal case, you can alternate instructions every clock cycle - a classic instruction in one clock cycle and an instruction for SFU in another.

Each multiprocessor has a specific set of resources that are worth understanding. There is a small memory area called "Shared Memory", 16 kbytes per multiprocessor. This is not a cache memory at all: the programmer can use it at his own discretion. That is, we have something close to the Local Store of SPUs on Cell processors. This detail is very interesting, since it emphasizes that CUDA is a combination of software and hardware technologies. This memory area is not used for pixel shaders, which Nvidia cleverly points out "we don't like pixels talking to each other."

This memory area opens up the possibility of exchanging information between threads in one block. It is important to emphasize this limitation: all threads in a block are guaranteed to be executed by a single multiprocessor. In contrast, the assignment of blocks to different multiprocessors is not specified at all, and two threads from different blocks cannot exchange information with each other during execution. That is, using shared memory is not so easy. However, shared memory is still justified except in cases where several threads try to access the same memory bank, causing a conflict. In other situations, access to shared memory is as fast as access to registers.

Shared memory is not the only memory that multiprocessors can access. They can use video memory, but with less throughput and long delays. Therefore, in order to reduce the frequency of access to this memory, nVidia equipped multiprocessors with a cache (approximately 8 KB per multiprocessor) that stores constants and textures.

The multiprocessor has 8,192 registers that are common to all threads of all blocks active on the multiprocessor. The number of active blocks per multiprocessor cannot exceed eight, and the number of active warps is limited to 24 (768 threads). Therefore, the 8800 GTX can handle up to 12,288 threads at a time. All of these constraints are worth mentioning because they allow the algorithm to be optimized based on available resources.

Optimizing a CUDA program thus consists of obtaining an optimal balance between the number of blocks and their size. More threads per block will be useful for reducing memory latency, but the number of registers available per thread is also reduced. Moreover, a block of 512 threads will be inefficient because only one block can be active on a multiprocessor, resulting in a loss of 256 threads. Therefore, nVidia recommends using blocks of 128 or 256 threads, which gives the optimal compromise between reduced latency and the number of registers for most cores/kernels.

From a software perspective, CUDA consists of a set of extensions to the C language, reminiscent of BrookGPU, as well as several specific API calls. Among the extensions are type specifiers related to functions and variables. It's important to remember the keyword __global__, which, when given before the function, shows that the latter belongs to the kernel - this function will be called by the CPU, and it will be executed on the GPU. Prefix __device__ indicates that the function will be executed on the GPU (which, by the way, is what CUDA calls a "device") but it can only be called from the GPU (in other words, from another __device__ function or from a __global__ function). Finally, the prefix __host__ optional, it denotes a function that is called by the CPU and executed by the CPU - in other words, a regular function.

There are a number of restrictions associated with the __device__ and __global__ functions: they cannot be recursive (that is, call themselves), and they cannot have a variable number of arguments. Finally, since __device__ functions are located in GPU memory space, it is logical that their address cannot be obtained. Variables also have a number of qualifiers that indicate the memory area where they will be stored. Variable with prefix __shared__ means that it will be stored in the shared memory of the streaming multiprocessor. The __global__ function call is slightly different. The thing is, when calling, you need to set the execution configuration - more specifically, the size of the grid to which the kernel will be applied, as well as the size of each block. Let's take, for example, a kernel with the following signature.

__global__ void Func(float* parameter);

It will be called as

Func<<< Dg, Db >>>(parameter);

where Dg is the grid size and Db is the block size. These two variables refer to a new vector type introduced with CUDA.

The CUDA API contains functions for working with memory in VRAM: cudaMalloc for allocating memory, cudaFree for freeing it, and cudaMemcpy for copying memory between RAM and VRAM and vice versa.

We'll finish this review A very interesting way in which a CUDA program is compiled: compilation is performed in several stages. First, the CPU-specific code is extracted and passed to the standard compiler. Code destined for the GPU is first converted to the PTX intermediate language. It is similar to an assembler and allows you to examine the code looking for potential inefficiencies. Finally, the last phase is to translate the intermediate language into GPU specific instructions and create a binary file.

After looking at nVidia's documentation, I'm tempted to try CUDA this week. Really, what could it be better grades API by creating own program? This is when most problems should surface, even if everything looks perfect on paper. Also, practice will best show how well you have understood all the principles outlined in the CUDA documentation.

It’s quite easy to get involved in a project like this. Today there are a large number of free but high-quality tools available for download. For our test we used Visual C++ Express 2005, which has everything we need. The hardest part was finding a program that wouldn't take weeks to port to the GPU, but would also be interesting enough that our efforts wouldn't be in vain. In the end, we chose a piece of code that takes a height map and calculates a corresponding normal map. We will not go into detail about this function, since it is hardly interesting in this article. To be brief, the program deals with the curvature of areas: for each pixel of the initial image, we superimpose a matrix that determines the color of the resulting pixel in the generated image from adjacent pixels, using a more or less complex formula. The advantage of this function is that it is very easy to parallelize, so this test perfectly shows the capabilities of CUDA.

Another advantage is that we already have a CPU implementation, so we can compare the result with the CUDA version - without reinventing the wheel.

Let us repeat once again that the purpose of the test was to get acquainted with the CUDA SDK utilities, and not to compare the versions for CPU and GPU. Since this was our first attempt at creating a CUDA program, we weren't really expecting high performance. Since this part of the code is not critical, the version for the CPU was not optimized, so a direct comparison of the results is hardly interesting.

Performance

However, we measured the execution time to see if there is an advantage to using CUDA even with the crudest implementation, or if we would need long and tedious practice to get any benefit when using the GPU. The test machine was taken from our development laboratory - a laptop with Core processor 2 Duo T5450 and a GeForce 8600M GT video card running Vista. This is far from a supercomputer, but the results are very interesting, since the test is not “tailored” for the GPU. It's always nice to see Nvidia show huge gains on systems with monster GPUs and plenty of bandwidth, but in practice many of the 70 million CUDA-capable GPUs on the PC market today aren't nearly as powerful, which is why our test still stands.

For an image of 2048 x 2048 pixels, we got the following results.

CPU 1 thread: 1,419 ms;
CPU 2 threads: 749 ms;
CPU 4 threads: 593 ms
GPU (8600M GT) blocks of 256 threads: 109 ms;
GPU (8600M GT) blocks of 128 threads: 94 ms;
GPU (8800 GTX) blocks of 128 threads/256 threads: 31 ms.

Several conclusions can be drawn from the results. Let's start with the fact that, despite the talk about the obvious laziness of programmers, we modified the initial version of the CPU for multiple threads. As we already mentioned, the code is ideal for this situation - all that is required is to split the initial image into as many zones as there are threads. Please note that the transition from one thread to two on our dual-core CPU resulted in almost linear acceleration, which also indicates the parallel nature of the test program. Quite unexpectedly, the version with four threads also turned out to be faster, although on our processor this is very strange - on the contrary, one could expect a drop in efficiency due to the overhead of managing additional threads. How can this result be explained? It's hard to say, but the Windows thread scheduler may be to blame; in any case, we repeat the result. With smaller textures (512x512), the gain from threading was not as pronounced (about 35% vs. 100%), and the behavior of the four-thread version was more logical, with no increase compared to the two-thread version. The GPU was still faster, but not as noticeably faster (the 8600M GT was three times faster than the dual-thread version).

Click on the picture to enlarge.

The second significant observation is that even the slowest GPU implementation was almost six times faster than the highest performing CPU version. For the first program and the unoptimized version of the algorithm, the result is very encouraging. Please note that we received noticeably best result on small blocks, although intuition may suggest otherwise. The explanation is simple - our program uses 14 registers per thread, and with 256-threaded blocks, 3,584 registers are required per block, and 768 threads are required to fully load the processor, as we showed. In our case, this amounts to three blocks or 10,572 registers. But the multiprocessor has only 8,192 registers, so it can only keep two blocks active. In contrast, with blocks of 128 threads, we need 1,792 registers per block; If 8,192 is divided by 1,792 and rounded to the nearest integer, we get four blocks. In practice, the number of threads will be the same (512 per multiprocessor, although theoretically 768 are needed for a full load), but increasing the number of blocks gives the GPU the advantage of flexibility in accessing memory - when there is an operation with large delays, you can start executing instructions from another block while waiting receipt of results. Four blocks clearly reduces latency, especially since our program uses multiple memory accesses.

Analysis

Finally, despite what we said above, we couldn't resist the temptation and ran the program on the 8800 GTX, which was three times faster than the 8600, regardless of block size. You might think that in practice on the corresponding architectures the result would be four or more times higher: 128 ALU/shader processors versus 32 and a higher clock speed (1.35 GHz versus 950 MHz), but it did not work out that way. Most likely, memory access was the limiting factor. To be more precise, the initial image is accessed as a multi-dimensional CUDA array - a rather complex term for what is nothing more than a texture. But there are several advantages.

accesses benefit from the texture cache;
we use wrapping mode, in which there is no need to process image boundaries, unlike the CPU version.

In addition, we can benefit from "free" filtering with normalized addressing between instead and , but in our case this is unlikely to be useful. As you know, the 8600 has 16 texture units, compared to 32 for the 8800 GTX. Therefore, the ratio between the two architectures is only two to one. Add to this the difference in frequencies and we get a ratio of (32 x 0.575) / (16 x 0.475) = 2.4 - close to the "three to one" ratio we actually got. This theory also explains why the block size does not change much on the G80, since the ALU still rests on the texture blocks.

Click on the picture to enlarge.

In addition to the promising results, our first acquaintance with CUDA went very well, given the not the most favorable conditions chosen. Developing on a Vista laptop means that you will have to use CUDA SDK 2.0, still in beta, with driver 174.55, which is also beta. Despite this, we can't report any unpleasant surprises - only initial errors during the first debugging when our still very buggy program tried to address memory outside the allocated space.

The monitor started flickering wildly, then the screen went black... until Vista ran the driver repair service and everything was fine. But it’s still somewhat surprising to see this if you’re used to seeing a typical Segmentation Fault error on standard programs like ours. Finally, a small criticism towards nVidia: in all the documentation available for CUDA, there is no small guide that would tell you step by step how to set up a development environment for Visual Studio. Actually, this is not a big problem since the SDK has a full set of examples that you can study to understand the framework for CUDA applications, but a guide for beginners would be helpful.

Click on the picture to enlarge.

Nvidia introduced CUDA with the release of the GeForce 8800. And at the time the promises seemed very tempting, but we held back our enthusiasm until we actually tested it. Indeed, at the time it seemed more like marking the territory to stay on the GPGPU wave. Without an available SDK, it’s hard to say that this isn’t just another marketing dud that won’t work. This is not the first time that a good initiative has been announced too early and failed to see the light of day at the time due to lack of support - especially in such a competitive sector. Now, a year and a half after the announcement, we can confidently say that nVidia kept its word.

The SDK quickly appeared in beta in early 2007, and since then it has been quickly updated, which proves the importance of this project for nVidia. Today, CUDA is developing very nicely: the SDK is already available in beta version 2.0 for major operating systems (Windows XP and Vista, Linux, as well as 1.1 for Mac OS X), and nVidia has dedicated an entire section of the site for developers.

On a more professional level, the impression from the first steps with CUDA turned out to be very positive. Even if you are familiar with GPU architecture, you can easily figure it out. When an API looks clear at first glance, you immediately begin to believe that you will get convincing results. But won't computing time be lost from numerous transfers from the CPU to the GPU? And how to use these thousands of threads with virtually no synchronization primitive? We started our experiments with all these concerns in mind. But they quickly dissipated when the first version of our algorithm, albeit very trivial, turned out to be significantly faster than on the CPU.

So CUDA is not a magic wand for researchers who want to convince university management to buy them a GeForce. CUDA is already a completely accessible technology that can be used by any programmer with knowledge of C, as long as they are willing to put in the time and effort to get used to the new programming paradigm. This effort will not be wasted if your algorithms are well parallelized. We would also like to thank nVidia for providing complete and high-quality documentation where novice CUDA programmers will find answers.

What does CUDA need to become a recognizable API? In one word: portability. We know that the future of IT lies in parallel computing - today everyone is already preparing for such changes, and all initiatives, both software and hardware, are aimed in this direction. However, at the moment, if we look at the development of paradigms, we are still at the initial stage: we create threads manually and try to plan access to shared resources; All this can somehow be dealt with if the number of cores can be counted on the fingers of one hand. But in a few years, when the number of processors will number in the hundreds, this possibility will no longer exist. With the release of CUDA, nVidia took the first step in solving this problem - but of course this decision Suitable only for GPUs from this company, and even then not for everyone. Only the GF8 and 9 (and their Quadro/Tesla derivatives) can run CUDA programs today. And the new 260/280 line, of course.

Click on the picture to enlarge.

Nvidia may boast that it has sold 70 million CUDA-compatible GPUs worldwide, but that's still not enough to become the de facto standard. Taking into account the fact that competitors are not sitting idly by. AMD offers its own SDK (Stream Computing), and Intel has announced a solution (Ct), although it is not yet available. There's a standards war coming, and there's clearly not going to be room in the market for three competitors until another player like Microsoft comes out with a common API, which will certainly make life easier for developers.

Therefore, nVidia has many difficulties in getting CUDA approved. Although technologically we are undoubtedly faced with a successful solution, it still remains to convince developers of its prospects - and this will not be easy. However, judging by many recent announcements and news regarding the API, the future does not look bleak.

– a set of low-level software interfaces ( API) for creating games and other high-performance multimedia applications. Includes high performance support 2D- And 3D-graphics, sound and input devices.

Direct3D (D3D) – interface for displaying three-dimensional primitives(geometric bodies). Included in .

OpenGL(from English Open Graphics Library, literally - open graphics library) is a specification that defines a programming language-independent cross-platform programming interface for writing applications using two-dimensional and three-dimensional computer graphics. Includes over 250 functions for drawing complex 3D scenes from simple primitives. Used to create video games, virtual reality, and visualization in scientific research. On the platform Windows competes with .

OpenCL(from English Open Computing Language, literally – an open language of calculations) – framework(software system framework) for writing computer programs related to parallel computing on various graphics ( GPU) And ( ). To the framework OpenCL includes a programming language and application programming interface ( API). OpenCL provides parallelism at the instruction level and at the data level and is an implementation of the technique GPGPU.

GPGPU(abbreviated from English) General-P urpose G raphics P rocessing U nits, literally – GPU general purpose) is a technique for using a graphics processing unit (GPU) or video card for general computing that is typically performed by a computer.

Shader(English) shader) – a program for constructing shadows on synthesized images, used in three-dimensional graphics to determine the final parameters of an object or image. Typically includes arbitrarily complex descriptions of light absorption and scattering, texture mapping, reflection and refraction, shading, surface displacement, and post-processing effects. Complex surfaces can be visualized using simple geometric shapes.

Rendering(English) rendering) – visualization, in computer graphics the process of obtaining an image from a model using software.

SDK(abbreviated from English) Software Development Kit) – a set of software development tools.

CPU(abbreviated from English) Central Processing Unit, literally – central/main/main computing device) – central (micro); a device that executes machine instructions; Part hardware, responsible for performing computational operations (specified operating system and application software) and coordinating the operation of all devices.

GPU(abbreviated from English) Graphic Processing Unit, literally – graphic computing device) – graphic processor; a separate device or game console that performs graphic rendering (visualization). Modern GPUs are very efficient at processing and displaying computer graphics in a realistic manner. The graphics processor in modern video adapters is used as a 3D graphics accelerator, but in some cases it can also be used for calculations ( GPGPU).

Problems CPU

For a long time, the increase in the performance of traditional ones mainly occurred due to a consistent increase in the clock frequency (about 80% of the performance was determined by the clock frequency) with a simultaneous increase in the number of transistors on one chip. However, a further increase in the clock frequency (at a clock frequency of more than 3.8 GHz, the chips simply overheat!) encounters a number of fundamental physical barriers (since the technological process has almost come close to the size of an atom: , and the size of a silicon atom is approximately 0.543 nm):

Firstly, as the crystal size decreases and the clock frequency increases, the leakage current of the transistors increases. This leads to increased power consumption and increased heat emissions;

Second, the benefits of higher clock speeds are partially negated by memory access latency, as memory access times do not keep up with increasing clock speeds;

Third, for some applications, traditional serial architectures become inefficient as clock speeds increase due to the so-called “von Neumann bottleneck,” a performance limitation resulting from sequential computation flow. At the same time, the resistive-capacitive signal transmission delays increase, which is an additional bottleneck associated with an increase in the clock frequency.

Development GPU

In parallel with this, there was (and is!) development GPU:

…

November 2008 – Intel introduced a line of 4-core Intel Core i7, which are based on a new generation microarchitecture Nehalem. The processors operate at a clock frequency of 2.6-3.2 GHz. Made using a 45nm process technology.

December 2008 – deliveries of 4-core began AMD Phenom II 940(code name - Deneb). Operates at a frequency of 3 GHz, produced using a 45-nm process technology.

…

May 2009 – company AMD introduced the GPU version ATI Radeon HD 4890 With clock frequency core increased from 850 MHz to 1 GHz. This is the first graphic processor running at 1 GHz. The computing power of the chip, thanks to the increase in frequency, increased from 1.36 to 1.6 teraflops. The processor contains 800 (!) computing cores and supports video memory GDDR5, DirectX 10.1, ATI CrossFireX and all other technologies inherent modern models video cards The chip is manufactured on the basis of 55 nm technology.

Main differences GPU

Distinctive Features GPU(compared with ) are:

– an architecture maximally aimed at increasing the speed of calculation of textures and complex graphic objects;

– peak power typical GPU much higher than that ;

– thanks to a specialized conveyor architecture, GPU much more efficient in processing graphic information, how .

"Crisis of the genre"

"Genre crisis" for matured by 2005 - that’s when they appeared. But, despite the development of technology, the increase in productivity of conventional decreased noticeably. At the same time performance GPU continues to grow. So, by 2003, this revolutionary idea crystallized - use the computing power of graphics for your needs. GPUs have become increasingly used for “non-graphical” computing (physics simulation, signal processing, computational mathematics/geometry, database operations, computational biology, computational economics, computer vision, etc.).

The main problem was that there was no standard programming interface GPU. The developers used OpenGL or Direct3D, but it was very convenient. Corporation NVIDIA(one of the largest manufacturers of graphics, media and communications processors, as well as wireless media processors; founded in 1993) began developing a unified and convenient standard - and introduced the technology CUDA.

How it started

2006 – NVIDIA demonstrates CUDA™; the beginning of a revolution in computing GPU.

2007 – NVIDIA releases architecture CUDA(original version CUDA SDK was submitted on February 15, 2007); nomination " Best New Product» from the magazine Popular Science and "Readers' Choice" from the publication HPCWire.

2008 – technology NVIDIA CUDA won the “Technical Excellence” category from PC Magazine.

What's happened CUDA

CUDA(abbreviated from English) Compute Unified Device Architecture, literally - unified computing architecture of devices) - architecture (a set of software and hardware) that allows you to produce on GPU general purpose calculations, while GPU actually acts as a powerful coprocessor.

Technology NVIDIA CUDA™ is the only development environment in a programming language C, which allows developers to create software that solves complex computing problems in less time, thanks to the processing power of GPUs. Millions of people are already working in the world GPU with the support CUDA, and thousands of programmers are already using (free!) tools CUDA to accelerate applications and solve the most complex, resource-intensive tasks - from video and audio encoding to oil and gas exploration, product modeling, medical imaging and scientific research.

CUDA gives the developer the opportunity, at his own discretion, to organize access to the set of instructions of the graphics accelerator and manage its memory, and organize complex parallel calculations on it. Graphics accelerator support CUDA becomes a powerful programmable open architecture, similar to today's . All this provides the developer with low-level, distributed and high-speed access to hardware, making CUDA a necessary basis for building serious high-level tools, such as compilers, debuggers, mathematical libraries, and software platforms.

Uralsky, leading technology specialist NVIDIA, comparing GPU And , says this: “ - This is an SUV. He drives always and everywhere, but not very fast. A GPU- This is a sports car. On a bad road, it simply won’t go anywhere, but give it a good surface, and it will show all its speed, which an SUV has never even dreamed of!..”

Technology capabilities CUDA

Cores CUDA – symbol scalar computational units in video chips NVidia, beginning with G 80 (GeForce 8 xxx, Tesla C-D-S870, FX4/5600 , 360M). The chips themselves are derivatives of the architecture. By the way, because the company NVidia so willingly took up the development of its own processors Tegra Series, also based on RISC architecture. I have a lot of experience working with these architectures.

CUDA the core contains one one vector And one scalar units that perform one vector and one scalar operation per clock cycle, transferring calculations to another multiprocessor, or to another for further processing. An array of hundreds and thousands of such cores represents significant computing power and can perform various tasks depending on the requirements, provided there is certain supporting software. Application can be varied: video stream decoding, 2D/3D graphics acceleration, cloud computing, specialized mathematical analyses, etc.

Quite often, combined professional NVidia cards Tesla And NVidia Quadro, are the backbone of modern supercomputers.

CUDA— the kernels have not undergone any significant changes since G 80, but their number increases (together with other blocks - ROP, Texture Units& etc) and the efficiency of parallel interactions with each other (modules are improved Giga Thread).

Eg:

GeForce

GTX 460 – 336 CUDA cores

GTX 580 – 512 CUDA cores

8800GTX – 128 CUDA cores

From the number of stream processors ( CUDA), the performance in shader calculations increases almost proportionally (with a uniform increase in the number of other elements).

Starting from the chip GK110 (NVidia GeForce GTX 680) — CUDA the cores no longer have a double frequency, but a common one with all other chip blocks. Instead, their number was increased by approximately three times compared to the previous generation G110.

According to Darwin's theory of evolution, the first ape (if
to be precise - homo antecessor, human predecessor) later turned into
in us. Multi-ton computer centers with a thousand or more radio tubes,
occupying entire rooms were replaced by half-kilogram laptops, which, by the way,
will not be inferior in performance to the first. Antediluvian typewriters have become
in printing anything and on anything (even on the human body)
multifunctional devices. Processor giants suddenly decided to wall up
graphics core in "stone". And video cards began not only to show a picture with
acceptable FPS and graphics quality, but also perform all kinds of calculations. Yes
still how to produce! The technology of multi-threaded computing using GPUs will be discussed.

Why GPU?

I wonder why they decided to transfer all the computing power to the graphics
adapter? As you can see, processors are still in fashion, and they are unlikely to give up their warm
place. But the GPU has a couple of aces up its sleeve, along with a joker, and some sleeves
enough. A modern central processor is designed to achieve maximum
performance when processing integer and floating-point data
comma, without particularly worrying about parallel processing of information. At the same
time, the architecture of the video card allows you to quickly and without problems “parallelize”
data processing. On the one hand, the polygons are being calculated (due to the 3D conveyor),
on the other hand, pixel texture processing. It is clear that there is a “harmonious
breakdown of the load in the card core. In addition, memory and video processor performance
more optimal than the “RAM-cache-processor” combination. The moment a data unit
in the video card begins to be processed by one GPU stream processor, another
unit is loaded in parallel into another, and, in principle, it is easy to achieve
GPU load comparable to bus bandwidth,
however, for this to happen, the conveyors must be loaded uniformly, without
any conditional transitions and branches. The central processor, by virtue of its
versatility requires a full cache for its processing needs
information.

Pundits have thought about the work of GPUs in parallel computing and
mathematics and came up with the theory that many scientific calculations are in many ways similar to
3D graphics processing. Many experts believe that the fundamental factor in
development GPGPU (General Purpose computation on GPU – universal
calculations using a video card) was the emergence of the Brook GPU project in 2003.

The creators of the project from Stanford University had to solve a difficult
problem: hardware and software to force the graphics adapter to produce
diverse calculations. And they succeeded. Using the generic C language,
American scientists made the GPU work like a processor, adjusted for
parallel processing. After Brook, a number of projects on VGA calculations appeared,
such as Accelerator library, Brahma library, system
GPU++ metaprogramming and others.

CUDA!

The premonition of the prospects of development forced AMD And NVIDIA
cling to the Brook GPU like a pit bull. If we omit the marketing policy, then
By implementing everything correctly, you can gain a foothold not only in the graphics sector
market, but also in computing (look at special computing cards and
servers Tesla with hundreds of multiprocessors), displacing the usual CPUs.

Naturally, the “FPS lords” parted ways at the stumbling block, each in their own way.
path, but the basic principle remained unchanged - to make calculations
using GPU. And now we will take a closer look at the “green” technology - CUDA
(Compute Unified Device Architecture).

Our “heroine’s” job is to provide an API, two at once.
The first is high-level, CUDA Runtime, which represents the functions that
are broken down into simpler levels and passed to the lower API - CUDA Driver. So
that the phrase “high-level” is a stretch to apply to the process. All the salt is
exactly in the driver, and libraries kindly created will help you get it
developers NVIDIA: CUBLAS (tools for mathematical calculations) and
FFT (calculation using the Fourier algorithm). Well, let's move on to the practical
parts of the material.

CUDA Terminology

NVIDIA operates with very unique definitions for the CUDA API. They
differ from the definitions used for working with a central processor.

Thread– a set of data that needs to be processed (not
requires large processing resources).

Warp– a group of 32 threads. Data is processed only
warps, therefore a warp is the minimum amount of data.

Block– a set of flows (from 64 to 512) or a set
warps (from 2 to 16).

Grid is a collection of blocks. This data division
used solely to improve performance. So, if the number
multiprocessors is large, then the blocks will be executed in parallel. If with
no luck with the card (the developers recommend using
adapter not lower than GeForce 8800 GTS 320 MB), then the data blocks will be processed
sequentially.

NVIDIA also introduces concepts such as kernel, host
And device.

We're working!

To fully work with CUDA you need:

1. Know the structure of GPU shader cores, since the essence of programming
consists in evenly distributing the load between them.
2. Be able to program in the C environment, taking into account some aspects.

Developers NVIDIA revealed the “insides” of the video card several times
differently than we are used to seeing. So, willy-nilly, you will have to study everything
subtleties of architecture. Let's look at the structure of the legendary G80 “stone” GeForce 8800
GTX.

The shader core consists of eight TPC (Texture Processor Cluster) clusters
texture processors (so, GeForce GTX 280– 15 cores, 8800 GTS
there are six of them 8600 – four, etc.). Those, in turn, consist of two
streaming multiprocessors (hereinafter referred to as SM). SM (all of them
16) consists of front end (solves the problems of reading and decoding instructions) and
back end (final output of instructions) pipelines, as well as eight scalar SP (shader
processor) and two SFUs (super function units). For each beat (unit
time) the front end selects the warp and processes it. So that all warp flows
(let me remind you, there are 32 of them) processed, 32/8 = 4 cycles are required at the end of the conveyor.

Each multiprocessor has what is called shared memory.
Its size is 16 kilobytes and provides the programmer with complete freedom
actions. Distribute as you wish :). Shared memory provides communication between threads
one block and is not intended to work with pixel shaders.

SMs can also access GDDR. To do this, they were given 8 kilobytes each.
cache memory that stores all the most important things for work (for example, computing
constants).

The multiprocessor has 8192 registers. The number of active blocks cannot be
more than eight, and the number of warps is no more than 768/32 = 24. From this it is clear that G80
can process a maximum of 32*16*24 = 12288 threads per unit of time. You can't help but
take these figures into account when optimizing the program in the future (on one scale
– block size, on the other – the number of threads). Parameter balance can play a role
important role in the future, therefore NVIDIA recommends using blocks
with 128 or 256 threads. A block of 512 threads is inefficient because it has
increased delays. Taking into account all the subtleties of the GPU video card structure plus
good programming skills, you can create very productive
tool for parallel computing. By the way, about programming...

Programming

For “creativity” with CUDA you need GeForce video card no lower
episode eight. WITH

official website you need to download three software packages: driver from
CUDA support (each OS has its own), the CUDA SDK package itself (the second
beta version) and additional libraries (CUDA toolkit). Technology supports
operating systems Windows (XP and Vista), Linux and Mac OS X. To study I
chose Vista Ultimate Edition x64 (looking ahead, I’ll say that the system behaved
Just great). At the time of writing these lines, it was relevant for work
ForceWare driver 177.35. Used as a set of tools
Borland C++ 6 Builder software package (although any environment that works with
language C).

To the person who knows the language, it will be easy to get used to the new environment. All that is required is
remember the basic parameters. Keyword _global_ (placed before the function)
indicates that the function belongs to the kernel. She will be called by the central
processor, and all the work will happen on the GPU. The _global_ call requires more
specific details, namely mesh size, block size and what kernel will be
applied. For example, the line _global_ void saxpy_parallel<<>>, where X –
the grid size, and Y is the block size, specifies these parameters.

The _device_ symbol means that the function will be called by the graphics core, also known as
will follow all instructions. This function is located in the memory of the multiprocessor,
therefore, it is impossible to obtain her address. The _host_ prefix means that the call
and processing will take place only with the participation of the CPU. It must be taken into account that _global_ and
_devices_ cannot call each other and cannot call themselves.

Also, the language for CUDA has a number of functions for working with video memory: cudafree
(freeing memory between GDDR and RAM), cudamemcpy and cudamemcpy2D (copying
memory between GDDR and RAM) and cudamalloc (memory allocation).

All program codes are compiled by the CUDA API. First it is taken
code intended exclusively for central processor, and is exposed
standard compilation, and other code intended for the graphics adapter
rewritten into the intermediate language PTX (much like assembler) for
identifying possible errors. After all these “dances” the final
translation (translation) of commands into a language understandable for GPU/CPU.

Study Kit

Almost all aspects of programming are described in the documentation that goes
along with the driver and two applications, as well as on the developers' website. Size
the article is not enough to describe them (the interested reader should attach
a little effort and study the material yourself).

CUDA SDK Browser was developed especially for beginners. Anyone can
feel the power of parallel computing firsthand (the best test for
stability – examples work without artifacts or crashes). The application has
a large number of indicative mini-programs (61 “tests”). For each experience there is
detailed documentation program code plus PDF files. It is immediately obvious that people
those present with their creations in the browser are doing serious work.
You can also compare the speed of the processor and video card during processing
data. For example, scanning multidimensional arrays with a video card GeForce 8800
GT Produces 512 MB with a block with 256 threads in 0.17109 milliseconds.
The technology does not recognize SLI tandems, so if you have a duo or trio,
disable the “pairing” function before working, otherwise CUDA will see only one
device Dual core AMD Athlon 64 X2(core frequency 3000 MHz) same experience
passes in 2.761528 milliseconds. It turns out that G92 is more than 16 times
faster than a rock AMD! As you can see, this is far from an extreme system in
in tandem with an operating system unloved by the masses shows good
results.

In addition to the browser, there are a number of programs useful to society. Adobe
has adapted its products to new technology. Now Photoshop CS4 is in full
least uses resources graphics adapters(you need to download a special
plugin). With programs such as Badaboom media converter and RapiHD you can
decode video into MPEG-2 format. Good for audio processing
will do free utility Accelero. The amount of software tailored for the CUDA API,
will undoubtedly grow.

And at this time...

In the meantime, you are reading this material, hard workers from processor concerns
are developing their own technologies to integrate GPUs into CPUs. From the outside AMD All
it’s clear: they have tremendous experience acquired along with ATI.

The creation of “microdevices”, Fusion, will consist of several cores under
codename Bulldozer and video chip RV710 (Kong). Their relationship will be
carried out through the improved HyperTransport bus. Depending on the
number of cores and their frequency characteristics AMD plans to create a whole price
hierarchy of "stones". It is also planned to produce processors for laptops (Falcon),
and for multimedia gadgets (Bobcat). Moreover, it is the application of technology
in portable devices will be the initial challenge for Canadians. With development
parallel computing, the use of such “stones” should be very popular.

Intel a little behind on time with his Larrabee. Products AMD,
if nothing happens, they will appear on store shelves in late 2009 - early
2010. And the enemy’s decision will come to light only in almost two
of the year.

Larrabee will have a large number (read: hundreds) of cores. At the beginning
There will also be products designed for 8 – 64 cores. They are very similar to Pentium, but
quite heavily reworked. Each core has 256 kilobytes of L2 cache
(its size will increase over time). The relationship will be carried out through
1024-bit bidirectional ring bus. Intel says their "child" will be
work perfectly with DirectX and Open GL API (for Apple developers), so no
no software intervention is required.

Why did I tell you all this? It is obvious that Larrabee and Fusion will not displace
regular, stationary processors from the market, just as they will not be forced out of the market
video cards. For gamers and extreme sports enthusiasts, the ultimate dream will still remain
multi-core CPU and a tandem of several top-end VGAs. But what even
processor companies are switching to parallel computing based on the principles
similar to GPGPU, says a lot. In particular, about what such
technology like CUDA has a right to exist and, most likely, will
very popular.

A short summary

Parallel computing using a video card is just a good tool
in the hands of a hardworking programmer. Hardly processors led by Moore's law
the end will come. Companies NVIDIA there is still a long way to go
promoting its API to the masses (the same can be said about the brainchild ATI/AMD).
What it will be like, the future will show. So CUDA will be back :).

P.S. I recommend that beginner programmers and interested people visit
the following “virtual establishments”:

NVIDIA official website and website
GPGPU.com. All
the information provided is on English language, but at least thank you for not
Chinese So go for it! I hope that the author helped you at least a little
exciting journeys into CUDA exploration!

See what "CUDA" is in other dictionaries. Computer resource U SM Programs using cuda

Equipment

Advantages

Restrictions

Supported GPUs and graphics accelerators

Features and Specifications of Various Versions

Example

CUDA as a subject in universities

Links

Official resources

Unofficial resources

Notes

see also

Eg:

GeForce

GTX 460 – 336 CUDA cores

GTX 580 – 512 CUDA cores

8800GTX – 128 CUDA cores

Why GPU?

CUDA!

CUDA Terminology

We're working!

Programming

Study Kit

And at this time...

A short summary

We are in social networks

Categories