Tumgik
#u280
macro-group · 5 years
Text
Ускорительная плата Alveo U280 от Xilinx
Компания Xilinx начала производство платы-ускорителя Alveo U280. От ускорителей U200/U250, выпущенных ране, её отличает наличие 8 Гбайт памяти HBM2 с суммарной пропускной способностью 460 GB/s, что является своеобразным рекордом по пропускной способности. Таким образом, семейство Alveo на сегодняшний момент состоит уже из трёх типов ускорительных плат.
Ускорительная карта Xilinx Alveo U280 предназначена для решения задач современных центров обработки данных. Как и предыдущие платы серии Alveo, ускоритель U280 выпускается в вариантах с активным и пассивным охлаждением и способен работать в круглосуточном режиме 24/7.
При работе с базами данных скорость работы U280 достигает 1200 миллионов запросов в секунду. При этом обеспечивается уменьшение времени отклика в 4 раза, по сравнению с использованием процессора. Ускоритель Alveo U280 так же отлично подходит для решения задач искусственного интеллекта. Например, на задаче перевода текста U280 показывает производительность больше 10 000 символов/секунду при задержке 6 мс. Подробности в статье Supercharge Your AI and Database Applications with Xilinx's HBM-Enabled UltraScale+ Devices Featuring Samsung HBM2.
Для программирования Alveo U280 используется среда SDAccel.
Подробнее ознакомиться с новой платой и её характеристиками можно на странице продукта по ссылке https://www.xilinx.com/products/boards-and-kits/alveo/u280.html
Tumblr media
0 notes
darsh2604 · 5 years
Link
Tumblr media
Forage harvester PALESSE K-G-6 consists of universal power vehicle PALESSE 2U280, 2U250 or U280 and semi-mounted forage harvesting combine “PALESSE FH40” with adapter kit. Harvester is available in two models: PALESSE FH40 and PALESSE FH40А. If the harvester is used for harvesting of maize for silage, it will be better to choose PALESSE FH40A with an active type device for regrinding of wax and full ripeness corn. While harvesting grass crusher unit is dismounted and replaced by the spacer plate with smooth walls. If the harvester reaps only grass, it will be useful to purchase the model PALESSE FH40 without corn crusher unit. Semi-mounted harvester PALESSE FH40 is equipped with a chopping device with radial disc type. Available in two modifications: with an active type device for regrinding of corn or without it.
1 note · View note
bigjoe11 · 2 years
Text
Aavik U180, U280, and U580 integrated amplifiers Review
Aavik U180, U280, and U580 integrated amplifiers Review
https://www.hifi-advice.com/blog/amplifier-reviews/integrated-amplifier-reviews/aavik-u180-u280-and-u580-integrated-amplifiers/ The Pascal UMAC power amp module was easy to spot and I was pleased to see such a neatly laid-out preamplifier board. There’s an XMOS input receiver and a PCM1792 DAC chip. Mounted on the power amp module’s heatsink are three full-width PCBs full of Tesla coils that…
Tumblr media
View On WordPress
0 notes
dailytechnologynews · 6 years
Photo
Tumblr media
Xilinx Alveo U280 Launched Possibly with AMD EPYC CCIX Support https://ift.tt/2Kgmo51
2 notes · View notes
shadowner · 4 years
Photo
Tumblr media
$45.88 17% off sell - Women's U280 Cat-Eye Sunglasses - 57 mm - Black Fade - C31296VODMV ImportedFunctional fashion accessory more product select from our Cat Eye:https://www.shadowner.com/49-cat-eye # 's #
0 notes
stampuscollection · 4 years
Photo
Tumblr media
USA SPECIMEN 2c Bidders Sample Special Request UPSS 818 U280 Stationery 86067 https://ift.tt/2YDhlTM
0 notes
audiointerfaceker · 4 years
Photo
Tumblr media
((*)) Tripp Lite U280-016-RM interface hub - U280-016-RM https://ift.tt/3fmPqxZ
0 notes
fastcompression · 5 years
Text
Jpeg2jpeg Acceleration with CUDA MPS on Linux
The task of fast JPEG-to-JPEG Resize is essential for high load web services. Users create most of their images with smart phones and cameras in JPEG format which is the most popular nowadays. To offer high quality services and to cut expences on storages, providers strive to implement JPEG resize on-the-fly to store just one image instead of several dozens in different resolutions.
Solutions for fast JPEG resize, which are also called Jpeg2jpeg, have been implemented on CPU, GPU, FPGA and on mobile platforms as well. The highest performance for that task was demonstrated on GPU and FPGA, which used to be considered on par for NVIDIA Tesla T4 and Xilinx VCU1525 or Alveo U280 hardware.
Tumblr media
Bottlenecks for high performance Jpeg2jpeg solutions on GPU
Implementation of fast JPEG Resize (Jpeg2jpeg transform) is quite complicated task and it's not easy to boost highly optimized solution. Nevertheless, we can point out some issues which still could be improved:
better GPU utilization
batch mode implementation
performance optimization of JPEG decoder (this is the slowest part of the pipeline)
In general, GPU can offer super high performance only in the case if there is sufficient amount of data for parallel processing. If we don't have enough data, then GPU occupancy is low and we are far from maximum performance. We have exactly the same issue with the task of JPEG Resize on GPU: usually we have non-sufficient amount of data and GPU occupancy is not high. One way to solve that matter is to implement batch mode.
Batch mode implies that the same algorithm could be applied at the same time to many items which we need to process. This is not exactly the case with JPEG Resize, because it should be possible to implement batch processing for JPEG decoding, though Resize is difficult to include at the same batch as soon as for each image we need to generate and to utilize individual sets of interpolation coefficients. And scaling ratio is usually not the same for all processed images in the batch. That's why batch mode could be limited by JPEG decoding only. If we have a look at existing FPGA-based solutions for Jpeg2jpeg software, all of them are utilizing batch JPEG decoding and then individual resize and encoding to get better performance.
Optimization of JPEG decoder could be done by taking into account the latest NVIDIA architecture to boost the performance of entropy decoder, which is the most time-consuming part of JPEG algorithm. Apart from entropy decoder optimization, it makes sense to accelerate all other parts or JPEG algorithm.
Finally, we've found the way how to accelerate the current version of JPEG Resize on GPU from Fastvideo Image Processing SDK and this is the answer: NVIDIA CUDA MPS. Below we condiser in detail what's CUDA MPS and how we could utilize it for that task.
CUDA Multi-Process Service
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (CUDA API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU. This can benefit performance when the GPU compute capacity is underutilized by a single application process.
MPS is a binary-compatible client-server runtime implementation of the CUDA API, which consists of several components:
Control Daemon Process: the control daemon is responsible for starting and stopping the server, as well as coordinating connections between clients and servers.
Client Runtime: the MPS client runtime is built into the CUDA Driver library and may be used transparently by any CUDA application.
Server Process: the server is the clients' shared connection to the GPU and provides concurrency between clients.
To balance workloads between CPU and GPU tasks, MPI processes are often allocated individual CPU cores in a multi-core CPU machine to provide CPU-core parallelization of potential Amdahl bottlenecks. As a result, the amount of work each individual MPI process is assigned may underutilize the GPU when the MPI process is accelerated using CUDA kernels. While each MPI process may end up running faster, the GPU is being used inefficiently. The Multi-Process Service takes advantage of the inter-MPI rank parallelism, increasing the overall GPU utilization.
NVIDIA Volta architecture has introduced new MPS capabilities. Compared to MPS on pre-Volta GPUs, Volta MPS provides a few key improvements:
Volta MPS clients submit work directly to the GPU without passing through the MPS server.
Each Volta MPS client owns its own GPU address space instead of sharing GPU address space with all other MPS clients.
Volta MPS supports limited execution resource provisioning for Quality of Service (QoS).
Tumblr media
Fig.1. Pascal and Volta MPS architectures (picture from NVIDIA MPS Documentation)
CUDA MPS Benefits
GPU utilization: a single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.
Reduced on-GPU context storage: without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.
Reduced GPU context switching: without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.
CUDA MPS Limitations
MPS is only supported on the Linux operating system. The MPS server will fail to start when launched on an operating system other than Linux.
MPS is not supported on NVIDIA Jetson platforms. The MPS server will fail to start when launched on Jetson platforms.
MPS requires a GPU with compute capability version 3.5 or higher. The MPS server will fail to start if one of the GPUs visible after applying CUDA_VISIBLE_DEVICES is not of compute capability 3.5 or higher.
The Unified Virtual Addressing (UVA) feature of CUDA must be available, which is the default for any 64-bit CUDA program running on a GPU with compute capability version 2.0 or higher. If UVA is unavailable, the MPS server will fail to start.
The amount of page-locked host memory that can be allocated by MPS clients is limited by the size of the tmpfs filesystem (/dev/shm).
Exclusive-mode restrictions are applied to the MPS server, not to MPS clients.
Only one user on a system may have an active MPS server.
The MPS control daemon will queue MPS server activation requests from separate users, leading to serialized exclusive access of the GPU between users regardless of GPU exclusivity settings.
All MPS client behavior will be attributed to the MPS server process by system monitoring and accounting tools (e.g. nvidia-smi, NVML API).
GPU Compute Modes
Three Compute Modes are supported via settings accessible in nvidia-smi.
PROHIBITED – the GPU is not available for compute applications.
EXCLUSIVE_PROCESS – the GPU is assigned to only one process at a time, and individual process threads may submit work to the GPU concurrently.
DEFAULT – multiple processes can use the GPU simultaneously. Individual threads of each process may submit work to the GPU simultaneously.
Using MPS effectively causes EXCLUSIVE_PROCESS mode to behave like DEFAULT mode for all MPS clients. MPS will always allow multiple clients to use the GPU via the MPS server.
When using MPS, it is recommended to use EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA processes for that GPU.
Client-Server Architecture
This diagram shows a likely schedule of CUDA kernels when running an MPI application consisting of multiple OS processes without MPS. Note that while the CUDA kernels from within each MPI process may be scheduled concurrently, each MPI process is assigned a serially scheduled time-slice on the whole GPU.
Tumblr media
Fig.2. Multi-Process Sharing GPU without MPS (picture from NVIDIA MPS Documentation)
Tumblr media
Fig.3. Multi-Process Sharing GPU with MPS (picture from NVIDIA MPS Documentation)
When using pre-Volta MPS, the server manages the hardware resources associated with a single CUDA context. The CUDA contexts belonging to MPS clients funnel their work through the MPS server. This allows the client CUDA contexts to bypass the hardware limitations associated with time sliced scheduling, and permit their CUDA kernels execute simultaneously.
Volta provides new hardware capabilities to reduce the types of hardware resources the MPS server must managed. A client CUDA context manages most of the hardware resources on Volta, and submits work to the hardware directly. The Volta MPS server mediates the remaining shared resources required to ensure simultaneous scheduling of work submitted by individual clients, and stays out of the critical execution path.
The communication between the MPS client and the MPS server is entirely encapsulated within the CUDA driver behind the CUDA API. As a result, MPS is transparent to the MPI program. MPS clients CUDA contexts retain their upcall handler thread and any asynchronous executor threads. The MPS server creates an additional upcall handler thread and creates a worker thread for each client.
Server
The MPS control daemon is responsible for the startup and shutdown of MPS servers. The control daemon allows at most one MPS server to be active at a time. When an MPS client connects to the control daemon, the daemon launches an MPS server if there is no server active. The MPS server is launched with the same user id as that of the MPS client.
If there is an MPS server already active and the user id of the server and client match, then the control daemon allows the client to proceed to connect to the server. If there is an MPS server already active, but the server and client were launched with different user id’s, the control daemon requests the existing server to shutdown once all its clients have disconnected. Once the existing server has shutdown, the control daemon launches a new server with the same user id as that of the new user's client process.
The MPS control daemon does not shutdown the active server if there are no pending client requests. This means that the active MPS server process will persist even if all active clients exit. The active server is shutdown when either a new MPS client, launched with a different user id than the active MPS server, connects to the control daemon or when the work launched by the clients has caused an exception.
The control daemon executable also supports an interactive mode where a user with sufficient permissions can issue commands, for example to see the current list of servers and clients or startup and shutdown servers manually.
Client Attach/Detach
When CUDA is first initialized in a program, the CUDA driver attempts to connect to the MPS control daemon. If the connection attempt fails, the program continues to run as it normally would without MPS. If however, the connection attempt succeeds, the MPS control daemon proceeds to ensure that an MPS server, launched with same user id as that of the connecting client, is active before returning to the client. The MPS client then proceeds to connect to the server.
All communication between the MPS client, the MPS control daemon, and the MPS server is done using named pipes and UNIX domain sockets. The MPS server launches a worker thread to receive commands from the client. Upon client process exit, the server destroys any resources not explicitly freed by the client process and terminates the worker thread.
Important Application Considerations
The NVIDIA VIDEO Codec SDK is not supported under MPS on pre-Volta MPS clients.
Only 64-bit applications are supported. The MPS server will fail to start if the CUDA application is not 64-bit. The MPS client will fail CUDA initialization.
If an application uses the CUDA driver API, then it must use headers from CUDA 4.0 or later (i.e. it must not have been built by setting CUDA_FORCE_API_VERSION to an earlier version). Context creation in the client will fail if the context version is older than 4.0.
Dynamic parallelism is not supported. CUDA module load will fail if the module uses dynamic parallelism features.
MPS server only supports clients running with the same UID as the server. The client application will fail to initialize if the server is not running with the same UID.
Stream callbacks are not supported on pre-Volta MPS clients. Calling any stream callback APIs will return an error.
CUDA graphs with host nodes are not supported under MPS on pre-Volta MPS clients.
The amount of page-locked host memory that pre-Volta MPS client applications can allocate is limited by the size of the tmpfs filesystem (/dev/shm). Attempting to allocate more page-locked memory than the allowed size using any of relevant CUDA APIs will fail.
Terminating an MPS client without synchronizing with all outstanding GPU work (via Ctrl-C / program exception such as segfault / signals, etc.) can leave the MPS server and other MPS clients in an undefined state, which may result in hangs, unexpected failures, or corruptions.
Performance measurements for Jpeg2jpeg application
For software testing we've utilized the following scenarious:
Source 24-bit RGB images with JPEG quality 90%, subsampling 4:2:0, restart interval 1
Initial image resolution: 1920×1080 (2K) or 1280×720 (1K)
2K resize: 1920×1080 to 480×270
1K resize: 1280×720 to 320×180
Output JPEG compression: quality 90%, subsampling 4:2:0, restart interval 10
Hardware and software
CPU Intel Core i7-5930K (Haswell-E, 6 cores, 3.5–3.7 GHz)
NVIDIA Quadro GV100
Linux Ubuntu 18.04 and CUDA-10.0
Fastvideo SDK 0.14.2.4
These are main components of Jpeg2jpeg software
Server is responsible for image processing on GPU.
Client is responsible for image read from disk, image send for processing, storing the processed images after convertion. We need at least two Clients per Server to hide load/store operations.
If we are working with CUDA MPS activated, then the total number of processes in Jpeg2jpeg software is limited by the amount of available CPU cores.
To check CUDA MPS mode, we executed the following commands
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
Then we started 2/4/6 daemons of JPEG Resize application on NVIDIA Quadro GV100 GPU.
We've also done the same without CUDA MPS to make a comparison.
How we measured the performance
To get reliable results which have good correspondence with JPEG Resize algorithm parameters, for each test we've utilized the same image and the same parameters for resizing and encoding. We've repeated each series 1,000 times and calculated average FPS (number of frames per second) for processing. Speedup is calculated as the current value of "FPS with MPS" divided to the best value from "FPS without MPS" column.
Jpeg2jpeg performance with and without MPS for 1K JPEG Resize from 1280×720 to 320×180
Source image | Servers | Clients per Server | FPS without MPS | FPS with MPS | Speedup
1K | 2 | 1 | 837 | 1198 | 1.4
1K | 2 | 2 | 815 | 1633 | 2.0
1K | 4 | 1 | 815 | 2326 | 2.9
1K | 4 | 2 | 813 | 2581 | 3.2
1K | 6 | 1 | 795 | 2857 | 3.4
1K | 6 | 2 | 805 | 2871 | 3.4
Jpeg2jpeg performance with and without MPS for 2K JPEG Resize from 1920×1080 to 480×270
Source image | Servers | Clients per Server | FPS without MPS | FPS with MPS | Speedup
2K | 2 | 1 | 769 | 1124 | 1.4
2K | 2 | 2 | 762 | 1368 | 1.8
2K | 4 | 1 | 761 | 1826 | 2.4
2K | 4 | 2 | 748 | 1975 | 2.8
2K | 6 | 1 | 769 | 2143 | 2.8
2K | 6 | 2 | 696 | 2087 | 2.7
We see that performance saturation in that task could probably be connected with the number of utilized CPU cores. We will check the performance on multicore Intel Xeon CPU to find the solution with the best balance between CPU and GPU to achieve maximum acceleration for Jpeg2jpeg application. This is essentially heterogeneous task and all hardware components should be carefully chosen and thoroughly tested.
Jpeg 2 Jpeg acceleration benchmarks for CUDA MPS on Linux
We've been able to boost the Jpeg2jpeg software with CUDA MPS on Linux significantly. According to our time measurements, total performance for JPEG Resize application at CUDA MPS mode was increased by 2.8–3.4 times, which is difficult to believe. We have been able to accelerate the solution that was already well-optimized and it was one of the fastest on the market. For standard use cases on NVIDIA Quadro GV100 we've got benchmarks around 760–830 fps (images per second) and with CUDA MPS at the same test conditions and at the same hardware we've reached 2140–2870 fps.
Such an impressive performance boost is absolutely astonishing and we've checked that many times. It's working well and very fast. Moreover, we have fair chances to get even better acceleration by utilizing more powerful multicore CPU.
GPU and FPGA solutions for Jpeg2jpeg applications were on par recently, but this is not the case anymore. Now NVIDIA GPU with Jpeg2jpeg software from Fastvideo have left behind both CPU and FPGA solutions.
References
Documentation for NVIDIA CUDA MPS
Web Resize on-the-fly: one thousand images per second on Tesla V100 GPU
JPEG Resize on-demand: FPGA vs GPU. Which is the fastest?
Original article see at: https://www.fastcompression.com/blog/jpeg2jpeg-cuda-mps-acceleration.htm
0 notes
mixaromasepeti · 6 years
Link
0 notes
dolly-momma · 3 years
Photo
Tumblr media
☮️💟🕉 https://www.instagram.com/p/CQGdE7BHeemTEccLrba83DIMiA_gjrqnr_-U280/?utm_medium=tumblr
0 notes
loskindividuo · 6 years
Photo
Tumblr media
#Repost @ictdynamicteam · · · · 1/365/BASILICA DI MASSENZIO⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ La Basilica di Massenzio è uno degli edifici più grandiosi di Roma antica che si possono ammirare oggi. La basilica venne fondata nel luogo che era stato già occupato dal triplice portico che serviva da vestibolo alla Domus Aurea di Nerone e che fu poi trasformato in “magazzino” di merci esotiche e di lusso provenienti dall’Oriente. ⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Costruita agli inizi del IV secolo d.C. da Massenzio che per pochi anni fu imperatore di Roma prima di essere sconfitto da Costantino, occupava un'area immensa. Quello che rimane oggi è solo una navata minore il cui muro esterno si può vedere da via dei Fori Imperiali. Tutto il resto della costruzione è andato perduto dopo un terremoto del IX secolo e le macerie sono state reimpiegate per l'abbellimento della vecchia basilica di San Pietro. ⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ La Basilica di Massenzio è stata uno dei modelli preferiti dall’architettura rinascimentale, in particolare per Bramante.⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ #365postiaroma La guida di @ictdynamicteam⠀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ #basilica #basilicadimassenzio #roma #romantica #bramante #domusaureao d.C. da Massenzio che per pochi anni fu imperatore di Roma prima di essere sconfitto da Costantino, occupava un'area immensa. Quello che rimane oggi è solo una navata minore il cui muro esterno si può vedere da via dei Fori Imperiali. Tutto il resto della costruzione è andato perduto dopo un terremoto del IX secolo e le macerie sono state reimpiegate per l'abbellimento della vecchia basilica di San Pietro. ⠀"u280 #nerone #viadeiforiimperiale #ruinesromaines #architetturarinascimentale #archittetura #cosedavedere #365 https://www.instagram.com/p/BoluB3ZjzuW/?utm_source=ig_tumblr_share&igshid=1wcn1pe1gma0d
0 notes
Obtain Some Automotive Diagnostic Devices
For http://bestobd2scannerreviews.com/ , they can't maintain their cars in great shape without furnishing with some automotive analysis tools of their very own. Also a straightforward automotive analysis resource are going to be actually a great assistance in inspecting the issue codes from the engine. Because of the swift advancement of scientific research and technology, these devices could be actually simply updated through some particular software program. After that there is actually no must get them year after year, which will definitely be actually a wonderful means to spare our money and also reduce the amount of times our company depend on the technicians for help. To conserve your spending plan, you may buy some automotive exercise products for vehicle diagnostic and also scan reason given that they are commonly inexpensive in price as well as practical in make use of. A memo scanner is made to read as well as wipe out issue codes in vehicles, particularly for those Do It Yourself repairers. At that point you need some computer programming devices or commanders to system or even reprogram. These resources possess straightforward interface and also easy to work. The initial thing you have to keep in thoughts before acquiring any type of analysis tool is your own personal requirement as well as the degree to which you will definitely manage to take care of the current modern technologies for personal or even specialist degree. Listed below I can easily name some vehicle health and fitness products, such as U280 VW/AUDI memorandum scanner, VAG PROG CZ Version, BMW OPPS, BMW OPS, BMW GT1, VAG pin viewers, OBD2 ELM327 USB CAN-BUS Scanner and so on. You will certainly discover all of them practical, I guarantee. Only by invest some funds on these devices at to begin with, and also after that you can easily browse your cars and truck concerns on your own as well as resolve all of them yourself. That is actually to claim, you acquire some standard evaluation resources at a reasonably cheap cost, however they have the ability to function with many kinds of motor vehicles as well as give access to all kinds from difficulty codes and then diagnose troubles connected to the internal shows from the car. In doing this, you can conserve cash to devote in the medical diagnosis of technical concern. That is actually truly a good deal.
0 notes
stampuscollection · 4 years
Photo
Tumblr media
USA SPECIMEN 2c Bidders Sample Special Request UPSS 818 U280 Stationery 86067 https://ift.tt/3avEb3G
0 notes
audiointerfaceker · 4 years
Photo
Tumblr media
((*)) Tripp Lite U280-016-RM interface hub - U280-016-RM https://ift.tt/36DJ0rb
0 notes
fastcompression · 5 years
Text
JPEG Resize on-demand: FPGA vs GPU. Which is the fastest?
https://www.fastcompression.com/blog/jpeg-resize-on-demand-fpga-vs-gpu-performance-comparison.htmHigh performance image processing is essential task for web applications. It's not a rare situation when companies need to process hundreds of millions images every day. In some services that number could reach several dozens of billions images per day. Here we review available hardware and software solutions for such tasks. We will not consider CPU-based implementations here and focus on FPGA vs GPU as the fastest hardware for parallel computations.
Tumblr media
Internet traffic is increasing by ~25% annually (according to white paper from CISCO, see below) and images take considerable part of that data. Many companies are handling huge volumes of images in their data centers:
Cloud storage, CDN
Mobile instant messaging
Images for social networks, image social platforms, cloud albums, photo hosting centers
User-Generated Content platforms (UGC)
E-Commerce platforms
Market demand
Users generate more images and video data every day
User devices have higher resolution in capture and display
Users strive to better viewing experience
Better image quality and QoS
Customers demand instant access to the resource (reduced latency)
Challenges
Huge consumption of computational and storage resources
Server and storage performance is not enough
As a task for performance and latency comparison for FPGA vs GPU at image processing, we will consider JPEG Resize on-demand which is widely utilized in web applications. Total number of JPEGs in the world is absolutely huge and it's significantly increasing every day. That's why companies spend quite a lot on storages for these pictures and on hardware/software for image processing. The simplest approach is to store the same JPEG picture at several different resolutions and to send an image which is slightly bigger at each request. Such an approach can't match desired image resolution exactly. To cut the expences on storage and to respond to user faster, we can resize JPEG images in realtime according to necessary dimensions to accomplish full match with the requirements of user's device. In that case we need to store at datacenter just one original picture which will be processed according to individual request parameters. That idea about on-demand resize for JPEG images is not new and it has already been implemented in several internet services on CPU, GPU and FPGA.
Image processing pipeline for JPEG Resize on-demand
JPEG Decoding
Resizing and Cropping
Sharpening
Color profile handling
JPEG Encoding
That pipeline consists of different algorithms and we could hardly implement batch processing here for the whole pipeline, as soon as final dimensions are different, images could have different subsampling, individual quantization tables and Huffman tables. It means that we have to process these algorithms sequentially, though still there is a possibility to get a boost from batch mode. Most of the time in the pipeline is spent on JPEG Decoding, so we could implement batch just for JPEG decoding. We can prepare images for future batching by utilizing the same subsampling, quantization tables and Huffman tables. File size will increase a little bit, but we will get an opportunity to accelerate JPEG decoding.
Image Processing on FPGA
FPGA (Field Program Gate Array) is a specialized reconfigurable hardware which could be also utilized for fast image and video processing. This is very complicated approach for programming and FPGA engineers should know quite a lot: hardware architecture, Verlog or VHDL language, Intel (Altera) or Xilinx development tools, etc. Total understanding of hardware functioning is a must, so not everyone could do that. Still, that approach is evolving rapidly and there are some outstanding results in that field.
FPGA-based solutions posess some features which could hardly be beaten: hardware-based computations are usually very fast, the hardware could have very low dimensions and it could have extremely low power consumption. These are indispensible conditions for many embedded applications. In comparison with FPGA, any CPU or GPU-based solutions have bigger dimensions and need more power for processing with the same performance.
Despite the fact that FPGA-programming is very complicated, there are quite a lof of imaging solutions which are based on FPGA/ASIC hardware. Moreover, neural networks have already been implemented on FPGA and that hardware platform is considered to be very promising for Deep Learning and Artificial Intelligence tasks. Currently, neural networks are much more easier to implement on GPU, than on FPGA. But FPGA is a competitive solution, though at the moment it requires super high skills to work with.
Some frame grabbers have built-in library for internal FPGA-based image processing. Silicon Software framegrabbers (now belongs to Basler company) have such a feature and this is the way to implement FPGA-based image processing for camera applications without writing any code on VHDL or Verilog. This is important task to ensure faster than realtime image processing on FPGA for high speed and high performance cameras.
CTAccel solutions on Intel FPGAs
Tumblr media
CTAccel company is based in Hong Kong and Shenzhen. This is a team of FPGA professionals with solid record of achievments in FPGA design, hardware/software co-design, system optimization and software engineering. The core team of CTAccel comes from the FPGA team of Clustertech Ltd.
CTAccel Image Processing Solutions
CTAccel Image Processor (CIP) effectively accelerates the following image processing/analytics workflows:
Thumbnail Generation/Transcoding
Image processing (sharpen/color filter)
Image analytics
CIP includes the following FPGA-based accelerated functions:
Decoder: JPEG
Pixel processing: Resize/Crop
Encoder: JPEG, WebP, Lepton
Software compatibility with OpenCV, ImageMagick and Lepton
CIP Image Processing Pipeline (image from that PDF)
Tumblr media
That diagram for suggested image processing pipeline shows that Crop, Sharp and Color conversions are implemented on Host CPU, not on FPGA. In that case we get combined heterogeneous solution, where image processing is done both on FPGA and CPU. This leads to additional load for CPU.
Intel FPGA board (image from that PDF)
The CIP accelerator features an Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA.
Tumblr media
Performance evaluation for CTAccel CIP solution on Intel PAC
JPEG Resize solution from CTAccel for Intel® Programmable Acceleration Card can do JPEG decoding, image resize and JPEG encoding. This is the screenshot for Real-world customer use case from CTAccel site:
Tumblr media
8 MPix image is converted to thumbnail on Intel Xeon E5-2639 v2 CPU within 10 ms (100 sec / 10,000 images), which means that on CTAccel CIP that image could be processed within 2 ms (total computation time 10 ms reduced by 80%). For further comparison we can conclude that 3 MPix jpeg image could be resized on Intel PAC with CTAccel CIP within 0.8 ms which is equal to 1250 fps (frames per second).
Averaged latency for 3 MPix image processing on FPGA is the following: (240 ms * (100 - 80) / 100 * 3 / 8 = 18 ms. This actually means that FPGA is processing around 18 / 0.8 = 22 images at the same time. It looks like a batch mode for JPEG decoding on CIP has already been implemented.
Unfortunately we don't have full info about parameters for that pipeline. Nevertheless, we can expect that original compressed image should be close to visually lossless compression. "Generating thumbnails" approach means that final image resolution is very small, so applied resize takes less time and output JPEG encoding takes almost no time. In real-life case of JPEG Resize on-demand we need to fit resolution of smartphone screen, and this is more computationally intensive (that scenario was used for tests with NVIDIA Tesla T4 GPU). Still, these performance results could be a ground to make indirect comparison with benchmarks on NVIDIA Tesla T4. To the best of our knowledge, pricing for Intel PAC board and NVIDIA T4 is comparable.
XILINX
Tumblr media
Xilinx Alveo, a portfolio of powerful accelerator cards designed to dramatically increase performance in various tasks. Xilinx® Alveo™ U280 Data Center accelerator cards are designed to meet the constantly changing needs of the modern Data Center. Built on the Xilinx 16nm UltraScale™ architecture, Alveo U280 ES offers 8 GB of HBM2 410 GB/s bandwidth to provide high-performance, adaptable acceleration for memory-bound, compute intensive applications including database, analytics, and machine learning inference. The U280 ES acceleration card includes PCI Express 4.0 support to leverage the latest server interconnect infrastructure for high-bandwidth host processors. And like all Xilinx technology, customers can reconfigure the hardware, enabling them to optimize for shifting workloads, new standards and updated algorithms without incurring replacement costs.
Alveo accelerator cards are adaptable to changing acceleration requirements and algorithm standards, capable of accelerating any workload without changing hardware, and reduce overall cost of ownership.
Tumblr media
Xilinx Alveo U280 Data Center accelerator card (image is taken here)
There are some libraries which allow to implement image processing algorithms on FPGA via C++ instead of Verilog/VHDL, though finally VHDL code will be generated. Xilinx Video and Image Processing Pack is an example how to do that. It includes full licenses to the following LogiCORE IP cores:
Chroma Resampler
Color Correction
Color Filter Array Interpolation
Gamma Correction
Image Edge Enhancement and Image Noise Reduction
On-Screen Display
Video Deinterlacer
Video DMA
Video Timing Controller
The Xilinx Video and Image Processing Pack provides bundled licensing option for all of the LogiCORE™ IP blocks. Video processing blocks provide optimized hardware implementations for video scaling, on-screen display, picture-in-picture, text overlay, video and image analysis and more. Image processing blocks enable pre-processing of images captured by a color image sensor with a Bayer Color Filter Array (CFA), correcting defective pixels, interpolating missing color components for every pixel, correcting colors to adjust to lighting conditions, setting gamma to compensate for the intensity distortion of different display devices and more.
CTAccel on Virtex UltraScale+ VCU1525
The above set of image processing features from Xilinx LogiCORE IP cores is not enough to accomplish the task of JPEG Resize on-demand. That task was solved on Xilinx FPGA hardware by CTAccel company, as in the case with Intel FPGA. That solution from CTAccel on Xilinx is utilized by Huawei Cloud for FPGA Image Transcoding Acceleration on Virtex VU9P to provide easy-to-use and cost-effective image transcoding services. For the task of thumbnail generation (which is alike the task of JPEG Resize on-demand, but more simple), published benchmarks for performance and latency are the same as for Intel FPGA solution - this is 0.8 ms for 3 MPix jpeg image. It should be noted that the same CTAccel CIP solution can work on Xilinx Alveo accelerator cards as well.
Tumblr media
Xilinx VCU1525 developer board (image is taken here)
GPU Image Processing on NVIDIA CUDA
GPU architecture was initially created for image display and finally it was transformed for parallel computations. Image processing could be considered as native task for GPU, though we need SDK not to program GPU at low level. There are quite a lot of such SDKs with sofisticated set of features.
Full set of features for JPEG resize on demand exists at Fastvideo Image Processing SDK. That SDK contains wide variety of image processing modules which show very high performance at all NVIDIA GPUs, starting from Jetson to high-end Quadro and Tesla products. Highly optimized modules show just exceptional results. For many features the performance is much higher that bandwidth of PCIe Gen3 x16. This is important issue to offer fast solution, particularly for JPEG resize on demand.
That solution has been heavily tested for years and this is the proof of its reliability in the taks of JPEG Resize on-demand. Several high-load internet services have been utilizing that software and total number of processed JPEGs exceeds several quadrillions. In comparison with recently developed FPGA-based solutions it looks much more reliable.
Fastvideo Image Processing SDK on NVIDIA T4 can do decode-resize-sharp-encode at 1.2 ms for 3 MPix image resize which is around 800 fps, but this is the result without batch mode. As soon as image resize on GPU takes small part of that time, we see that performance on JPEG Decoder is the key to achieve maximum speed for that task. Usually resized image has smaller resolution and output JPEG Encoder can process it very fast. In general, GPU JPEG Encoder is much faster than GPU JPEG Decoder, that's why JPEG decoding is the bottleneck for that task.
Tumblr media
Despite the fact that we haven't yet implemented our own batch mode for JPEG Resize on-demand, there is an opportunity to increase GPU occupancy with CUDA MPS on Linux. CUDA MPS allows to utilize NVIDIA Hyper-Q technology in multi-process environment and this is the way to get much better performance if GPU is underutilized (this is exactly our case). We have run 4 processes of JPEG Resize on-demand on NVIDIA T4 and have achieved significant speedup with CUDA MPS. We've got final performance 2200 fps with CUDA MPS on NVIDIA Tesla T4.
The feature of FPGA partial reconfiguration doesn't bring any additional benefits in comparison with GPU solutions for high performance image processing tasks. Basically, the performance is defined by the hardware and software. And if we need to run any other task on the same hardware, we just need to load another software from available image processing modules. In the case with GPU, this is actually a question about availability of necessary modules in the software. Fastvideo Image Processing SDK has great variety of such modules and it can also work with any other third-party SDK, so the choice of available software features for CUDA image processing on Tesla T4 is impressive.
It should be noted that currently, versatility of GPU is much more profound in comparison with FPGA. NVIDIA GPUs have a lot of libraries which help developers to build their applications. And the idea of utilizing COTS (Commercial Off-The-Shelf) is very handy. GPU and FPGA have totally different architectures and that's why their "killer applications" are also quite different. NVIDIA, Intel and Xilinx invest a lot into their software, hardware, infrastructure, community and try to promote their solutions at full extent. They do compete everywhere and NVIDIA is now succeeding with implementation of embedded solutions on Jetson platform, though embedded solutions were always the main market niche for FPGA/ASIC. Intel and Xilinx are moving towards DL/AI to compete with NVIDIA as well.
Conclusions
We have published available results for GPU vs FPGA performance and latency for JPEG Resize on-demand task. That comparison is indirect because we have benchmarks for different scenarios: for FPGA this is thumbnail generation case for JPEG images, for GPU this is standard JPEG resize to most frequently used resolution which was acquired from statistics.
Achieved performance for the above tasks look very good in both cases: NVIDIA Tesla T4 result is around 2200 fps for JPEG Resize with CUDA MPS. Intel/Xilinx result is around 1250 fps for thumbnail generation with internal batch decoding, which is more simple task because it requires less computations. We can see that NVIDIA Tesla T4 significantly outperforms Intel/Xilins FPGA boards and we need to bear in mind that FPGA benchmarks were received in simplified test.
Computational latency in these tasks for NVIDIA Tesla T4 is 2-3 ms, for FPGA this is 18 ms.
FPGA dimensions and power consumption are not vitally important for JPEG Resize on-demand task. FPGA boards look like GPU boards and their dimensions are the same, though power consumption is still less for FPGA-based solutions which have 20W - 40W per FPGA card and 70W for NVIDIA Tesla T4.
Ease of programming is also looks alike, because mostly developers are utilizing high-level SDKs. If we have a look at low-level programming, then FPGA is much more complicated to handle.
Partial Reconfiguration and flexibility are essential benefits of FPGA, but for the current task they are not really important.
Reliability of GPU-based solution is higher due to long-term presence on the market.
Both GPU and FPGA could be utilized for other tasks, including DL/AI applications.
GPU-based solutions are basically software implementations, though FPGA solutions imply that there is non-standard hardware to work with. This is a great advantage of NVIDIA GPUs - they are commodity things. At the moment we can't call FPGA-based boards from Intel and Xilinx as commodity things yet.
NVIDIA, Intel (Altera), Xilinx, CTAccel, Fastvideo are great companies which move progress forward and create impressive software and hardware solutions for high performance applications.
What we finally get
GPU and FPGA high-performance processing capabilities offers the following benefits:
Increased image processing throughput
Reduced TCO
Reduced computational latency
Reduced size of cluster or less total number of servers
Low CPU utilization
Better user experience and QoS:
Less time to wait for image loading
Less power consumption on users's device
Less traffic due to optimal image resolutions
Better image quality (no local image resize on user's device)
Links
Cisco Visual Networking Index: Forecast and Trends, 2017–2022 White Paper (Internet traffic growth estimation)
JPEG Resize on-demand on NVIDIA Tesla V100 with Fastvideo software
Accelerating JPEG, WebP and Lepton decoding, encoding and resizing on Intel® Xeon®-based servers by offloading all functions to the Intel FPGA.
Xilinx Alveo U200, U250 and U280
Xilinx Video and Image Processing Pack
NVIDIA Tesla T4 specification
Fastvideo Image Processing SDK for NVIDIA Tesla GPUs
Jpeg2jpeg Acceleration with CUDA MPS on Linux at NVIDIA Quadro GV100
Original article see at:
0 notes
jaiybi-gaumza · 7 years
Photo
Tumblr media
AUTHENTIC CHANEL LAMB SKIN BLACK QUILTED CHAIN SHOULDER BAG GHW W23.5 U280 http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337410312&customid=&lgeo=1&vectorid=229466&item=311965974202
0 notes