100 份 NVIDIA Research Summit 2010 论文 poster

Edison · 发表于 2010-9-27 01:04

http://www.nvidia.com/object/research_summit_posters_2010.html

Algorithms & Numerical Techniques

A02 - Accelerating Symbolic Computations  on NVIDIA Fermi
  We  present the first implementation of a complete modular resultant algorithm on  the graphics hardware. Our recent developments taking advantage of new NVidia  Fermi GPU architecture and instruction set allowed us to achieve about 150x  speed-up over a modular resultant algorithm from Maple 13.
  Author:  Pavel  Emeliyanenko (Max-Planck Institute for Informatics)
A03 - Particle-In-Cell Simulations on the  GPU
  Particle-In-Cell  simulations represent an important technique in the field of kinetic plasma  simulations. 2D particle pushing and conserved current aggregation has been  implemented in CUDA. On a TESLA C1060 the CUDA code is 4 times faster than SSE2  optimized code on a quad core INTEL XEON processor.
  Author:  Hartmut Ruhl  (Ludwig-Maximilians-University)
A04 - Parallel Ant Colony Optimization  with CUDA
  The Ant  Colony Optimization (ACO) Algorithm is a metaheuristic that is used to find  shortest paths in graphs. By using CUDA to implement an ACO algorithm, we  achieved significant improvement in performance over a highly-tuned sequential  CPU implementation. The construction step of the ACO algorithm consists of each  ant creating an independent solution, and this step is where most of the  computation is spent. Since the construction step is the same for most ACO  variations, parallelizing this step will also allow for easy adaptation to  different pheromone updating functions. Currently, our research tests this  hypothesis on the travelling salesmen problem.
  Author:  Octavian  Nitica (University of Delaware)
A05 - High Performance and Scalable Radix  Sorting for GPU Stream Architectures
  The need  to rank and order data is pervasive, and sorting operations are fundamental to  many algorithms.  This poster presents a  very efficient method for sorting large sequences of fixed-length keys (and  values) using GPU stream processors. Compared to the state-of-the-art, our implementation demonstrates  multiple factors of speedup (up to 3.8x) for all NVIDIA GPGPUs.  For this domain of sorting problems, we  believe our sorting primitive to be the fastest available for any  fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results  exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys  sorted per second).
  Author:  Duane Merrill  (University of Virginia)
A06 - Task Management for Irregular  Workloads on the GPU
  We  explore software mechanisms for managing irregular tasks on graphics processing  units.  Traditional GPU programming  guidelines teaches us how to efficiently program the GPU for data parallel  pipelines with regular input and output. We present a strategy for solving task parallel pipelines which can  handle irregular workloads on the GPU. We demonstrate that dynamic scheduling and efficient memory management  are critical problems in achieving high efficiency on irregular workloads.  We showcase our results on a real time Reyes  rendering pipeline.
  Author:  Stanley Tzeng  (University of California, Davis)
A07 - A Hybrid Method for Solving  Tridiagonal Systems on GPU
  Tridiagonal  linear systems are of importance to many problems in numerical analysis and  computational fluid dynamics, as well as to computer graphics applications in  video games and computer-animated films. This poster presents our study on the  performance of multiple tridiagonal algorithms on a GPU. We design a novel  hybrid algorithm that combines a work-efficient algorithm with a step-efficient  algorithm in a way well-suited for a GPU architecture. Our hybrid solver  achieves 8x and 2x speedup respectively in single precision and double  precision over a multi-threaded highly-optimized CPU solver and a 2x speedup  over a basic GPU solver.
  Author:  Yao Zhang  (University of California, Davis)
A08 - Development of Desktop Computing  Applications and Engineering Tools on GPUs
  A GPU  competence center and laboratory for research and collaboration within academia  and partners in industry has been established in 2008 at section for Scientific  Computing, DTU informatics, Technical University of Denmark. In GPULab we focus  on the utilization of GPUs for high-performance computing applications and  software tools in science and engineering, inverse problems, visualization,  imaging, dynamic optimization. This poster illustrates the latest and most  interesting projects that have been developed at our center.
  Author:  Hans Henrik B.  Soerensen (Technical University of  Denmark)
A09 - Ballot Counting for Optimal Binary  Prefix Sum
  This  poster describes a new technique for performing binary prefix sums using  Fermi's new __ballot() and __popc() functions. These instructions greatly increase intra-warp communication, allowing  for an 80% speedup over standard GPU methods in applications like Radix  Sort.  It also points to future research  that will enable suffix array construction, Burrows-Wheeler Transform, and the  BZIP algorithm to take advantage of these instructions for efficient GPU  compression.
  Author:  David  Whittaker (University of Alabama at  Birmingham)
A10 - Deriving Parallelism and GPU  Acceleration of Algorithms with Inter-Dependent Data Fields
  This  poster presents an  approach to derive  parallelism in algorithms that involve building sparse matrix that represents  relationships between inter-dependent data fields and enhancing its performance  on the GPU. This work compares the algorithm performance on the GPU to its CPU  variant that employs the traditional sparse matrix-vector multiplication (SpMV)  approach. We have also compared our algorithm performance with CUSP SpMV on  GPU. The softwares used in this work are MATLAB and Jacket – GPU engine for  MATLAB
  Author:  James Malcolm  (Accelereyes)
A11 - Parallelizing the Particle Level Set  Method
  The particle  level set is widely used as an accurate interface tracking tool in simulation,  computer vision and other related fields. However, high computation cost  prevents applying this method to real-time and interactive scenarios.
  This work  intensively used parallel design patterns that are implemented  in the thrust library, like compaction,  reduction and scattering, to parallelize the particle level set method in order  to attain real-time performance.
  Author:  Wen Zheng  (Stanford University)
A12 - Accelerating Cuda Graph Algorithms  at Maximum Warp
  Graphs  are powerful data representations favored in many computational domains. GPUs  have showed promising results in this domain, but their performance when the  graph is highly irregular. In this study, we propose three general schemes to  accelerate graph algorithms on a modern GPU architecture: (i) deferred  processing of outliers, (ii) efficient dynamic workload balancing and (iii)  warp-based execution exploiting threads in a SIMD-like manner.  Our evaluation reveals that our schemes  exhibit up to 9x speedup over previous GPU algorithms and 23x over single CPU  execution on irregular graphs.They also yield up to 30% improvement,even for  regular graphs
  Author:  Sungpack Hong  (Stanford University)
A13 - Implementation of Adaptive Cross  Approximation on NVIDIA GPUs
  The  Method of Moments is a popular computational method for solving integral  equations in electromagnetics. However, it suffers from high computational and  memory costs since it requires the solution of a dense linear system. The  Adaptive Cross Approximation (ACA) is an effective technique for compressing  the system matrix thereby reducing the necessary storage as well as the number  of operations required to solve the system. Acceleration of the ACA MoM with NVIDIA GPUs can finally enable the  solution of "real world" scattering problems on a personal  workstation in a practical timeframe.
  Author:  Daniel  Faircloth (Georgia Tech Research  Institute)
A14 - A GPU Accelerated Continuous-based  Discrete Element Method for Elastodynamics Analysis
  The  Continuum-based Distinct Element Method (CDEM) is the combination of Finite  Element Method (FEM) and Discrete Element Method (DEM), which is mainly used in  general structural analyses, as well as landslide stability evaluations, coal  and gas outburst analyses. By means of CUDA and a GTX-285 VGA card, the GPU  version achieves hundreds times speedup ratio.
  Author:  Zhaosong Ma  (Institute of Mechanics, Chinese Academy of  Sciences)
A15 - GPU Algorithms for NURBS Minimum  Distance and Clearance Computations
  We  present GPU algorithms and strategies for accelerating distance queries and  clearance computations on models made of trimmed NURBS surfaces. We provide a  generalized framework for using GPUs as co-processors in accelerating CAD  operations. The accuracy of our algorithm is based on the model space  precision, unlike earlier graphics algorithms that were based only on image  space precision. Our algorithms are at least an order of magnitude faster and  about two orders of magnitude more accurate than the commercial solid modeling  kernel ACIS.
  Author:  Adarsh  Krishnamurthy (University of California,  Berkeley)
A16 - Gate-Level Simulation with GP-GPUs
  This  poster describes my research work on how to leverage the GP-GPU execution parallelism  to achieve high performance in the time consuming problem of gate-level  simulation of digital hardware designs.
  Author:  Debapriya  Chatterjee (University of Michigan)
A17 - CUDA Implemenation of Barrier Option  Valuation using Jump-Diffusion Model and Browning Bridge
  Impressive  speedups up to 100x using GPUs compared to CPUs are achieved by taking  advantage data parallelism, increased bandwidth and the ability to hide  latency. We have implemented a Monte Carlo valuation of a barrier option  modeled by a standard diffusion process with a jump diffusion term obeying an  underlying Poisson process to account for rare events. In addition, a Brownian  Bridge is incorporated to account for barrier crossings in between diffusion  trajectories and to reduce bias. This option is representative of exotic  options which lack a closed-form solution and are amenable to Monte Carlo type  methods for valuation.
  Author:  Vincent Natoli  (Stone Ridge Technology)Astronomy  & Astrophysics

B01 - Black  Holes in Galactic Nuclei Simulated with Large GPU Clusters in CAS
  Many, if  not all galaxies harbour supermassive black holes. If galaxies merge, which is  quite common in the process of hierarchical structure formation in the  universe, their black holes sink to the centre of the merger remnant and form a  tight binary. Depending on initial conditions and time  supermassive black hole binaries are prominent gravitational wave sources, if  they ultimately come close together and coalesce. We model such systems as  gravitating N-body systems (stars) with two or more massive bodies (black  holes), including if necessary relativistic corrections to the classical  Newtonian gravitational forces (Kupi et al. 2006, Berentzen et al.2009).
  Author:  Rainer Spurzem  (National Astronomical Obersvatories, Chinese  Academy of Sciences)Audio  Processing

C01 - Exploring Recognition Network  Representations for Efficient Speech Inference on the GPU
  We  explore two contending recognition network representations for speech inference  engines: the linear lexical model (LLM) and the weighted ﬁnite state transducer  (WFST) on NVIDIA GTX285 and GTX480 GPUs. We demonstrate that while an inference  engine using the simpler LLM representation evaluates 22x more transitions per  second than the advanced WFST representation, the simple structure of the LLM  representation allows 4.7-6.4x faster evaluation and 53-65x faster operands  gathering for each state transition. We illustrate that the performance of a  speech inference engine based on the LLM representation is competitive with the  WFST representation on highly parallel GPUs.
  Author:  Jike Chong  (Parasians, LLC)
C02 - Efficient Automatic Speech  Recognition on the GPU
  Automatic  speech recognition (ASR) technology is emerging as a critical component in data  analytics for a wealth of media data being generated everyday. ASR-based  applications contain fine-grained concurrency that has great potential to be  exploited on the GPU. However, the state-of-art ASR algorithm involves a highly  parallel graph traversal on an irregular graph with millions of states and  arcs, making efficient parallel implementations highly challenging. We present  four generalizable techniques including: dynamic data-gather buffer,  find-unique, lock-free data structures using atomics, and hybrid global/local  task queues. When used together, these techniques can effectively resolve ASR  implementation challenges on an NVIDIA GPU.
  Author:  Jike Chong  (Parasians, LLC)Computational  Fluid Dynamics

D01 - High-Order Unstructured Compressible  Flow Solver on the GPU
  The  objective of this project is to develop a scalable and efficient high-order  unstructured compressible flow solver for GPUs. The solver allows the  achievement of arbitrary order of accuracy for flows over complex geometries.  High-order solvers require more operations per degree of freedom, thus making  them highly suitable for massively parallel processors. Preliminary results  indicate speed-ups up to 70x with the Tesla C1060 compared to the Intel i7 CPU.  Memory access was optimized using shared and texture memory.
  Author:  Patrice  Castonguay (Stanford University)
D02 - Parallel 3D Geometric Multigrid  Solver on GPU Clusters
  An  investigation of the performance and scalability of a multigrid pressure  Poisson equation solver running on a GPU cluster.
  Author:  Dana Jacobsen  (Boise State University)
D03 - Acceleration of mesh-free CFD using  CUDA
  In this  work, the acceleration of a mesh-free Computational Fluid Dynamics (CFD) code  is performed using CUDA. The poster gives an overview of the CUDA implementation  strategy and the resulting performance increase.
  Author:  Gilles Civario  (Irish Centre for High-End Computing)
D04 - Airblast Modelling on Multiple Tesla  units
  We used  NVIDIA Tesla GPUs to accelerate the solution of hyperbolic partial differential  equations, with application to modelling airblast generated by industrial bench  mining operations. Parallelisation over multiple GPUs was achieved using MPI.
  Author:  Sean Lovett  (University of Cambridge)
D05 - Implementation of High-Order  Adaptive  CFD Methods on GPGPUs
  This  poster describes our implementation of adaptive high-order CFD methods on GPUs.  A speedup factor of up to 44 has been achieved for 2D flow problems.
  Author:  Z.J. Wang  (Iowa State University)
D06 - Computational Fluid Dynamics on GPU
  Computational  Fluid Dynamics, an important branch in HPC field, has a history of seeking and  requiring higher computational performance. The traditional way to satisfy this  quest is to use faster machines or supercomputers. Yet these approaches seem  inconvenient and costly to many individual researchers. We investigated the use  of GPU to accelerate CFD codes and tested the performances on CUDA and OpenCL  platform. We have ported 2D cave flow, 2D Riemann, and 2D flow over a RAE2882  airfoil to the GPU and explored some GPU-specific optimization strategies. In  most cases, approximately 16 to 63 x speed up can be achieved.
  Author:  Long Wang  (Supercomputing Center, Chinese Academy of  Sciences)Computer  Graphics

E01 - Dynamic and Implicit Trees for  Graphics and Visualization on the GPU
  We  propose a new way to represent trees that allows for faster algorithms, that  are simple to implement (especially on the GPU), and with a lower memory  overhead than previous approaches.  Using  our data structure, we have seen significant improvements in both volume ray  casting and ray tracing applications over previous state-of-the-art methods.
  Author:  Nathan  Andrysco (Purdue University)
E02 - Fragment-Parallel Composite and  Filter
  In this  poster, we describe our recent work in the area of programmable graphics  pipelines by presenting a fragment-parallel formulation of an A-buffer-style  composite and filter equation, and describe its implementation on a modern GPU.
  Author:  Anjul Patney  (University of California, Davis)Computer  Vision

F01 - Architecture Aware Design for a  Parallel Object Recognition System
  We have  developed a parallel object recognition system using CUDA, achieving 70x-80x  speedup against the original serial implementation. In order to optimize our  implementation, we evaluated the performance of different parallelization  strategies on some key computations in the object recognition system. Finally  we concluded that the parallel implementation performance is sensitive to input  datPro-Aperties. Therefore, we should dynamically adjust the parallelization  strategy at runtime to optimize key computations.
  Author:  Bor-Yiing Su  (University of California, Berkeley)
F02 - Dense Point Trajectories by  GPU-Accelerated Large Displacement Optical Flow
  In this  poster we discuss a method for computing point trajectories based on a fast  parallel implementation of a recent optical ﬂow algorithm that tolerates fast  motion. The parallel implementation of large displacement optical ﬂow runs  about 78x faster than the serial C++ version. We use this implementation is a  point tracking application. Our resulting technique tracks up to three orders  of magnitude more points and is 46% more accurate than the Kanade-Lucas-Tomasi  tracker. Compared to the Particle Video tracker, we achieve 66% better  accuracy  while retaining the ability to  handle large displacements while running an order of magnitude faster.
  Author:  Narayanan  Sundaram (University of California,  Berkeley)
F03 - Visual Cortex on a Chip:  Large-scale, Real-Time Functional Models of Visual Cortex on a GPGPU
  Los  Alamos National Laboratory’s Petascale Synthetic Visual Cognition project is  exploring full-scale, real-time functional models of human visual cortex to  understand how human vision achieves its accuracy, robustness and speed.  Commercial-off-the-shelf hardware to support this modeling is rapidly  improving, e.g., a teraflop GPGPU card costs ~$500 and is ~size of mouse  cortex. We present results demonstrating image classification on UAV aerial  video with a visual cortex model running on a 240-core NVIDIA GeForce GTX285,  and see >x10 speed-up.  As this  technology continues to improve, cortical modeling on GPGPU devices has the  potential to revolutionize computer vision.
  Author:  Steven Brumby  (Los Alamos National Laboratory)
F04 - Fermi in Action: Robust Background  Subtraction for Real-time Video Analysis
  Background  subtraction  is one of the important  image processing steps for video surveillance and many computer vision problems  such as tracking & recognition. However, robust background subtraction that  adapts well to variable environment changes is highly computational and  consumed large amount of  memory. Thus,  its practical application is often limited. Here, we aimed to expand its usage  and tackle vision problems that requires high frame rate camera such as  real-time sports analysis, real-time object detection and recognition. Using  recent advances in accelerator hardware – NVIDIA Fermi Architecture and taking  advantage of heterogeneous computing , we are able to gain good performance  that allows to use in these practical applications.
  Author:  Melvin Wong  (Institute for Infocomm Research)
F05 - Bridging Neuroscience and GPU  Computing to Build General Purpose Computer Vision
  The  construction of artificial vision systems and the study of biological vision  are naturally intertwined as they represent simultaneous efforts to forward-  and reverse-engineer systems with similar goals. Here, we present a  high-throughput approach to more expansively explore biologically-inspired models  by leveraging GPUs.  We show that this  approach can yield significant gains in performance on object and face  recognition (including "Labeled Faces in the Wild" challenge and  faces from Facebook), consistently outperforming the state-of-the-art.  We highlight how the application of flexible  programming tools, such as high-level scripting, template  metaprogramming/auto-tuning, can enable large performance gains, while managing  complexity for the developer.
  Author:  Nicolas Pinto  (Massachusetts Institute of Technology)
F06 - CUDA for Vision and Imaging Library
  CUVI Lib  (CUDA for Vision and Imaging Library) is a software library that provides a set  of GPU accelerated computer vision and image processing functions. CUVI can  both be utilized as an add-on library for the NVIDIA's NPP (NVIDIA Performance  Primitives) as it compliments the functionality present in NPP as well as it  can be used as a standalone library ready to be plugged into end-user C/C++  applications.
  Author:  Salman Ul Haq  (TunaCode)
F07 - GPU-Friendly Multi-View Stereo  Reconstruction Using Surfel Representation and Graph Cuts
  We  present a new surfel (surface element) based multi-view stereo algorithm which  runs entirely on GPU.  We utilize  flexibility of surfel-based 3D shape representation and global optimization by  graph cuts in a same framework.The orientation of the constructed surfel  candidates imposes an effective constraint that reduces the effect of the  minimal surface bias. The entire processing pipeline is implemented on the  latest GPU to speed up the processing significantly. Experimental results show  that the proposed approach reconstructs the 3D shape of an object accurately  and efficiently, which runs more than 100 times faster than on CPU.
  Author:  In Kyu Park  (Inha University)
F08 - CUDA Accelerated Face Recognition
  A GPU  based implementation of a face recognition solution using PCA with Eigenfaces  algorithm.
  Author:  Jayadeep  Vijayan (NeST Software)
F09 - GPU Driven Dense Reconstruction for  Community Photo Collections
  We present  a system to reconstruct dense 3D models from community photo collections. First  images are described using GIST and are clustered using hamming distances. Each  of these clusters is geometrically verified and connected using Geotags.  Connected clusters are bundle adjusted and the obtained registration is used to  estimate depthmaps that are finally fused to obtain dense 3D models. Each of  the above steps, except Bundle Adjustment, is implemented in CUDA and runs on  multiple GPUs . The performance of our pipeline is two order of magnitude  faster on one order more images compared to state of the art method.
  Author:  Jan-Michael  Frahm (University of North Carolina,  Chapel Hill)
F10 - Portable Central Vision Enhancement  System for Macular Degeneration Patients
  Vision  enhancement systems is an alternative visual aid device to enhance the remaining vision for visual impairment  subjects. Our aim is to develop a mobile central vision enhancement system for  macular degeneration patients. Three different types of enhancement algorithms  have been developed and their efficiency was tested on low vision patients.  These three algorithms have been implemented on a portable low power devic. The  Nvidia system-on-a-chip Tegra has been chosen for this implementation.
  Author:  Chloe Vaniet  (Imperial College London)
F11 - Dense Stereo Vision on GPU
  A dense  stereo vision for a material handling dual-arm industrial robot have been  implemented with the Rectification, Stereo Correspondence and 3D Pose from  depth are ported out to GPU using CUDA.
  Author:  Esubalew  Bekele (Universal Robotics Inc.)
F12 - Upsampling Range Data in Dynamic  Environments
  We  present a flexible, parallelized method for fusing information from optical and  range sensors based on an accelerated high-dimensional filtering approach. Our  system takes as input a sequence of monocular camera images as well as a stream  of sparse range measurements as obtained from a laser or other sensor system.  Our method produces a dense, high-resolution depth map of the scene, automatically  generating confidence values for every interpolated depth point. We describe  how to integrate priors on object shape, motion and appearance and how to  achieve an efficient implementation using parallel processing hardware such as  GPUs.
  Author:  Hendrik  Dahlkamp (Stanford University)
F13 - GPU Accelerated Marker-less Motion  Capture
  In this  work, we derive an efficient filtering algorithm for tracking human pose at  4-10 frames per second using a stream of monocular depth images. The key idea  is to combine an accurate generative model-which is achievable in this setting  using state of the art GPU hardware-with a discriminative model that feeds  data-driven evidence about body part locations. We describe a novel algorithm  for propagating the noisy evidence about body part locations up the kinematic  chain using the unscented transform.We provide extensive experimental results  on 28 real-world sequences using automatic ground-truth annotations from a  commercial motion capture system.
  Author:  Varun Ganapathi  (Stanford University)
F14 - 3D Facial Feature Modeling with  Active Appearance Models
  Active  Appearance Models (AAM) is a powerful tool for modeling and matching objects  under shape deformations and texture variations. It learns characteristics of  objects by building a compact statistical model from applying Principal  Component Analysis (PCA) to a set of labeled data. Although AAM has been widely  applied in the fields of computer vision, due to its flexible framework, it  still cannot satisfy the requirement of real-time situations. To alleviate this  problem, we address the computational complexity of the fitting procedure by  running the AAM optimization algorithm on a GPU using a hybrid CPU / GPU block  processing architecture.
  Author:  Tim Llewellynn  (nViso / EPFL)
F15 - OpenCV on GPU
  OpenCV is  a free open source library of computer vision algorithms. Recently a new module  consisting of functions implemented on GPU was introduced in OpenCV. It  consists of several methods for calculating stereo correspondence between two  images that is used to reconstruct a 3D scene. A simple block-matching  algorithm works up to 10x faster compared to a CPU implementation in OpenCV  providing real-time processing of HD stereo pairs on Tesla cards. Belief  propagation-based algorithms show 20-50x speedup compared to a CPU  implementation.
  Author:  Anatoly  Baksheev (ITEEZ)Databases  & Data Mining

  G02 - Speculative Query Processing
  With an  increasing amount of data and user demands for fast query processing, the  optimization of database operations continues to be a challenging task. A  common optimization method is to leverage parallel hardware architectures. With  the introduction of general-purpose GPU computing, massively parallel hardware  has become available within commodity hardware. To efficiently exploit this  technology, we introduce the method of speculative query processing. This  speculative query processing works on index structures to efficiently support  heavily used database operations. To show the benefits and opportunities of our  approach, we present a fine and coarse grain implementation for  multidimensional queries.
  Author:  Peter Volk  (Technische Universität Dresden)
G03 - Virtual Local Stores
  We  propose a mechanism to provide the benefits of a software-managed memory  hierarchy on top of a hierarchy of hardware-managed caches.  A virtual local store (VLS) is mapped into  the virtual address space of Pro-Acess and backed by physical main memory, but  is stored in a partition of the hardware-managed cache when active. This  reduces context switch cost, and allows VLSs to migrate with their process  thread. The partition allocated to the VLS can be rapidly reconfigured without  flushing the cache, allowing programmers to selectively use VLS in a library  routine with low overhead.
  Author:  Henry  Cook (University  of California, Berkeley)  Embedded  & Automotive

H01 - Driver Assistance: Speed-Limit Sign  Recognition on the GPU
  We  investigate the use of differentGPU-based implementations for  performing real-time speed limit sign  recognition on a resource-constrainedembedded system. The system recognized US  and European Union speed-limits at over 88% accuracy while running in  real-time. The system is hardware-accelerated using CUDA and OpenGL. It  introduces a novel technique for detecting speed-limit signs which is only  possible with the aid of GPU processing.
  Author:  Vladimir  Glavtchev (BMW)
H02 - Complex Automotive Applications
  NVIDIA  GPU architecture becomes a very interesting hardware target for complex  automotive application. We implemented the same automotive application on  several different hardware targets and analyzed the maximum frame rate and the  effective CPU charge. This paper shows how real-time applications like  pedestrian detection and driving assistance take benefits from a massively  parallel “central” architecture like GPU/CUDA. Real-time performance and  zero-delay transfers can be achieved using a full asynchronous implementation.  The same approach can really multiply the application performance by the number  of GPU devices present on the embedded system, at a reasonable power  consumption.
  Author:  Marius Vasiliu  (University of Paris Sud)High  Performance Computing

I01 - A GPU-based Architecture for  Real-Time Data  Assessment at Synchrotron  Experiments
  Modern  X-ray imaging cameras provide millions of pixels and several thousand frames  per second. To process such an amount of information we have optimized the  reconstruction software employed at the tomography beamlines of ANKA and ESRF  synchrotrons to use the computational power of modern graphic cards.  Using GPUs as compute coprocessors we were  able to reduce the reconstruction time by a factor 30 and process a  typical data set of 20GB in 40 seconds. The  time needed for the first evaluation of the reconstructed sample is reduced  significantly and quasi real-time visualization is now possible.
  Author:  Suren  Chilingaryan (Karlsruhe Institute of  Technology)
I02 - Automatic High-Performance GPU code  Generation using CUDA-CHiLL
  This  poster presents a system to automatically generate high-performance GPU code  starting from an input sequential loop nest computation. The compiler analyzes  input computation in C and automatically generates a set of equivalent code  variants represented by transformation recipe. These recipes guide the  underlying code transformation and generation framework to apply code  transformations and ultimately produces CUDA code.
  We use  the system to generate high performing CUDA code for four BLAS functions,  matrix transpose and convolution stencils. The results mostly outperform  CUBLAS2.2/CUDA_SDK2.2 and naive GPU kernel and can achieve perform up to  435GF(mm) with avg speedup up to 1.78x.
  Author:  Malik M Khan  (USC/ UoU)
I03 - CSIRO Advances in GPU Computing.  What could you do with 256 GPUs?
  The  Commonwealth Scientific and Industrial Research Organisation (CSIRO) is  Australia's national science agency. CSIRO is currently applying GPU Computing  on a scale ranging from single GPU workstations through to their 256 GPU  cluster. This poster showcases some of CSIRO's work in the areas GPU  accelerated biological imaging, image deconvolution, synchrotron science and CT  reconstruction, and statistical inference in complex environmental models.  Speedups of between 8 to 230x have been seen across these applications areas  using a broard range of GPU computing platforms.
  Author:  Luke Domanski  (CSIRO)
I04 - High Performance Agent-Based  Simulation with FLAME for the GPU
  The  Flexibile Large-scale Agent Modelling Environment for the GPU (FLAME GPU)  addresses the performance and architecture limitations of previous work by  presenting a flexible framework approach to ABM on the GPU. Most importantly it  addresses the issue of agent heterogeneity through the use of state machine  based agent representation.  This  representation allows agents to be separated into associated state lists which  are processed in batches to allow very diverse population of agents whilst  avoiding large divergence in parallel code kernels. The use of the GPU allows  AB models to be visualised in real time, which further widens the application  of ABM to real-time simulations.
  Author:  Paul Richmond  (University of Sheffield)
I05 - The Scalable HeterOgeneous Computing  (SHOC) Benchmark Suite
  SHOC is a  benchmark suite for heterogeneous systems. This poster describes the suite and presents recent performance  measurements.
  Author:  Kyle Spafford  (Oak Ridge National Lab)
I06 - HyperFlow: An Efficient Dataflow  Architecture for Multi CPU-GPU Systems
  We  propose a new pipeline architecture that can take advantage of the many  processing elements available in modern CPU-GPU systems to maximize performance  in visualization and computational tasks. Our architecture is very flexible and  allows the construction of classical parallel algorithms such as data streamers  and map/reduce templates. We also discuss examples and performance benchmarks  that demonstrate the potential of our system.
  Author:  Huy Vo  (University of Utah)
I07 - MPI-CUDA Applications Checkpointing
  We  propose a checkpoint/restart tool for multi-GPU applications such as MPI-CUDA  applications
  Author:  Nguyen Toan  (Tokyo Institute of Technology)
I08 - Particle Simulations using DEM on  GPUs
  Particle  based numerical methods are an emerging field since the GPU/CUDA technique  became widely accepted in the last years.
   80% of the whole material,used in  pharmaceutical technology are powders. Numerical simulations of such material  is possible by using the Discrete Element Method (DEM). The main restrictions  here is compute power together with the problem size. Only a few ten-thousand  particles lead to weeks to months of compute time in order to reflect processes  of a few minutes in real time.DEM scales excelent with the massively-parallel  CUDA environment, enabling us to access the million particle range in  acceptable job runtimes.
  Author:  Charles Radeke  (University Graz)
I09 - Mastering Multi-GPU Computing on a  Torus Network
  We  describe APEnet+, the new generationof our 3D torus network which scales up to  tens of thousands of cluster nodes with linear cost. The basic component is a  custom PCIe adapter with six high-speed links, designed around Pro-Agrammable  HW component (FPGA), a nice environment for studying integration techniques  between GPUs and network interfaces. The highlevel programming model is MPI,  while a low-level RDMA API is also available.
  Author:  Davide  Rossetti (National Institute of Nuclear  Physics)
I10 - Poster: Atmospheric Modelling,  Simulation and Visualization using CUDA
  The  Laboratory Meteorological Dynamics (LMD) by CNRS weather model is used  extensively for research and weather forecasting purposes.
   Simulation of atmospheric climate is one of  the most challenging computational tasks because of its numerical complexity  and simulation time. The numerical simulations must be obviously achieved  faster than in real time to use them in decision support.
  Author:  Priyanka Sah  (Indian Institute of Technology, Delhi)
I11 - Automatic Program Generation for the  Fermi - DFT Transform
  The goal  of SPIRAL is to push the limits of automation in software and hardware  development and optimization numerical kernels beyond what is possible with  current tools.  In this research, we  address the problem of an efficient high performance computing platform of libraries  automatically generated by a computer forNVIDIA GPU architectures. Spiral  generates code that automatically bypasses all the architectural restrictions  on GPUs, shared memory bank conflicts, global memory coalescing and pushes code  to the limits (maximum number of threads, register pressure, etc.). The  procedure of code generation is fast, platform dependent, easy to rewrite and  problem adaptable.
  Author:  Christos  Angelopoulos (Carnegie Mellon University)
I12 - Fast N-body Algorithms for Dynamic  Problems on the GPU
  we  present an extension of the earlier algorithm by Gumerov  & Duraiswami (J. Comput. Phys., 2008)  which adapts the FMM to the GPU, where the data structures are efficiently  generated on the GPU as well. Details and performance on current architectures  will be presented.
  Author:  Qi Hu (University of Maryland)
I13 - GPU Acceleration of Cube Calculus  Operations
  In our  current work, we present the first massively parallel, GPU accelerated  implementation of the Cube Calculus operations for multivalued and binary  logic, also called Cube Calculus Machine (CCM). Substantial speedups upto the  order of 85x are achieved using the CUDA enabled nVIDIA Tesla GPU compared to  the CPU implementation on a sequential processor.CC is a very efficient and  convenient mathematical formalism for representation, processing and synthesis  of binary and multivalued logic which has significant applications in logic  synthesis, image processing and machine learning. Thus, massive speedups  achieved using GPUs are very encouraging to build future parallel VLSI EDA  systems
  Author:  Vamsi Parasa  (Portland State University)
I14 - An Atomic Tesla
  We  examined the possibility of using an Atom-based host system to control a Tesla  S1070. Our simple benchmarks found that Atom-based systems should be viable for  codes with serial portions small enough to make Amdahl's Law irrelevant. Such  systems would have a much lower power draw than 'traditional' GPU clusters.
  Author:  Richard Edgar  (Massachusetts General Hospital)
I15 - ICHEC’s GPU Research: Porting of  Scientific Application on NVIDIA GPU
  ICHEC is  the Irish National HPC centre, with a mission to provide both high performance  computing resources and expertise for the Irish research community. In addition  to its core mission of research enablement, ICHEC started in May 2009 an  exploratory activity in GPGPU and CUDPro-Agramming. Quantum Espresso is an  increasingly popular molecular dynamic package, mainly developed by the  DEMOCRITOS group in Trieste (IT). PWscf is part of the Qauntum Espresso suite  which performs electronic and ionic structure calculations. Interesting part on  the porting of PWscf is an high performance [ZD]gemm which execute in parallel  between CPU and GPU.
  Author:  Ivan Girotto  (Irish Centre for High-End Computing)
I16 - Implementation of Smith-Waterman  algorithm in OpenCL for GPUs
  In the  poster is presented the implementation of Smith-Waterman algorithm done in  OpenCL. This implementation is capable of computing similarity indexes between  query sequences and a reference sequence with or without sequence alignment  paths. In accordance with the requirement for the target application in cancer  research the implementation provides processing of very long reference  sequences (in the order of millions of nucleotides). Performance compares  favorably against CPU, being on the order of 14 - 610 times faster; 4.5 times  faster than the Farrar's implementation. It is also on par with CUDASW++v2.0.1  performance, but with less constraints in sequence length.
  Author:  Dzmitry  Razmyslovich (Institute of Computer  Engineering, University of Heidelberg)
I17 - Computing Strongly Connected  Components in Parallel on CUDA
  The  problem of decomposition of a directed graph into its strongly connected  components is a fundamental graph problem inherently present in many  scientific and commercial applications. We  show how existing  parallel algorithms  can be reformulated in order to be accelerated by NVIDIA CUDA technology. We  design a new CUDA-aware procedure for pivot selection and we redesign the  parallel algorithms in order to allow for CUDA accelerated computation. We  experimentally demonstrate that with a single GTX 280 GPU card we can easily  outperform optimal serial CPU algorithm.
  Author:  Milan Ceska  (Masaryk University)
I18 - A CUDA Runtime Target for the  Sequoia Compiler
  We  describe an implementation of the Sequoia Runtime interface in CUDA that  enables the Sequoia compiler to target programs written in Sequoia for single  and multiple GPU systems.
  Author:  Michael Bauer  (Stanford University)
I19 - GPU Computing for Real-Time Optical  Measurement Techniques
  Measuring  displacement and strains during deformation of advanced materials which are too  small, big, compliant, soft or hot are typical scenarios where non-contact  techniques are needed. Using  Digital  Image Correlation and Tracking, strain can be calculated from a series of  consecutive images with sub pixel resolution. However, the image processing is  a computation intensive task and can't be performed in real time using general  purpose processors. We implemented 3 stage pipelined architecture: images are  loaded, preprocessed using CPU, and correlated on GPUs. Using two GTX295 cards  we were able to reach 35 times speedup compared to fastest Core i7 processor.
  Author:  Suren  Chilingaryan (Karlsruhe Institute of  Technology)
I20 - An MPI/CUDA Implementation of  Discontinuous Galerkin Time Domain Method for Maxwell’s Equations
  We  describe an MPI/CUDA approach to solve Maxwell's equations in time domain by  means of an Interior Penalty Discontinuous Galerkin Time Domain Methods and a  local time stepping algorithm. We show that MPI/CUDPro-Avides 10x speed up  versus MPI/CPU, in double precision. Moreover, we present scalability results  and an 85% parallelization efficiency up to 40 GPUs on the Glenn cluster of  Ohio Supercomputing Center. Finally, we study an electromagnetic cloaking  example for a broad band signal(8-11GHz), to show the potential of our approach  to solve real life examples in short simulation times.
  Author:  Stylianos  Dosopoulos (Ohio State University)
I22 – Development and  Application of a Peta-Scale GPU Cluster for Multi-Scale Discrete Simulation –  Mole-8.5
  Mole-8.5 is the first GPGPU  supercomputer of petascale using Tesla C2050 in the world, designed and established in April 2010 by  Institute of Process Engineering (IPE), Chinese Academy of Sciences. A designing philosophy utilizing the similarity between hardware,  software and the problems to be solved is embodied, based on the multi-scale  method and discrete simulation approaches developed at IPE. With the multi-scale  discrete software developed by IPE, Mole-8.5 has already carried out large-scale  simulations of high scientific significance covering areas such as chemical  engineering, oil exploitation, metallurgy, demonstrating the supercomputer as a  paradigm of green computation in innovative architecture.
  Author:  Xiaowei Wang (Institute of Process  Engineering, Chinese Academy of Sciences)
I23 – Early Linpack  Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster
  Linpack  is a de facto standard benchmark for supercomputer. We introduce the  implementation and tuning technology of Linpack benchmark on IPE Mole-8.5  Cluster equipped with NVIDA Tesla C2050 (Fermi) GPU, including CPU/GPU overlap,  streaming (pipeline) technology and CPU/GPU affinity. As a result, we got  207.3TFlops and IPE Mole-8.5 Cluster ranked No.19 on Top500 June 2010 list. In  addition, we analyze the bottleneck of Linpack benchmark on this system.
  Author:  Xianyi Zhang (Institute of Software, Chinese  Academy of Sciences)
I24  - Atomic Hedgehog:  Productive  High-Performance Computing with Python
  cl.oquence is a new programming language which embeds  OpenCL's semantics into Python as a library, allowing the intermixing of  dynamically typed Python code and statically typed OpenCL code and demonstrating  new concepts in programming language design. By utilizing automatic type  inference and other features, it aims to make programming highly productive  without sacrificing any of the performance associated with GPU languages. We  describe this system as well as an application of it to large-scale  simulations, particularly those used in theoretical neurobiology.
  Cyrus  Omar (Carnegie Mellon University)Imaging

J01 - Neurite Detection using CUDA, GPU  Accelerated Biological Imaging for High-Content Analysis
  The  analysis of microscopic neurite structures in images is an important for  studying the effects of lead compounds on brain diseases or the regeneration of  brain cells after trauma. In High-Content Analysis (HCA) 100s to 1000s of  microscopy images are processed during automated experiments. The speed of the  image processing in these situations greatly affects the workflow throughput.  We report some early results on GPU acceleration of the Neurite Detection  module in our groups’ HCA-Vision. The most time consuming algorithm steps are  accelerated by up to 13.6x resulting in a 3.3x speedup for the entire algorithm  (70% of theretical maximum).
  Author:  Luke Domanski  (CSIRO)
J02 - Fast Radon Transform via Fast  Non-uniform FFTs on GPUs
  Fast  Radon Transform is required in X-ray Phase Contrast Tomography performed at the  Advanced Light Source, Lawrence Berkeley National Lab. We describe a fast  implementation based on fast non-uniform FFTs on GPUs.
  Author:  Chao Yang  (Lawrence Berkeley National Laboratory)
J03 - Projected Conjugate Gradient Solvers  on GPU and its Applications
  In this  work, the focus is specifically on how to speedup the projected CG algorithm  utilizing the GPU. It is shown that the projected CG method can be used within  the single precision accuracy of the current GPU. One benefit gained through  use of the projected CG is that it reduces the total number of matrix vector  multiplications, which is usually a bottleneck for an efficient GPU-based  Krylov-based algorithm. A modified projection based CG algorithm in the thesis  is further proposed which shows a better performance. Numerical results using  the GPU are provided to support the proposed algorithm.
  Author:  Youzuo Lin  (Arizona State University)
J04 - Real-time Direct Georeferencing of  Images from Airborne Line Scan Cameras
  The  Norwegian Defense Research Establishment (FFI) is developing a technology  demonstrator for airborne real-time hyperspectral target detection. The system  includes two nadir-pointing line scan cameras. The line scanned images are georeferenced in real-time by intersecting  rays cast from the cameras with a 3D model of the terrain underneath. The  georeferenced images may then easily be ortho-rectified (e.g by using texture  mapping in OpenGL) and overlaid digital maps. This poster presents the  performance of a cuda implementation of the georeferencing method.
  Author:  Trym Vegard  Haavardsholm (Norwegian Defence Research  Establishment (FFI))
J05 - CUDA Acceleration of Color Histogram  Matching
  Histogram  matching techniques are methods for the adjustment of color in a pair of  images. It can be used as a preliminary stage for several video applications as  for example 3D content creation. In such application two cameras separated a  known distance acquire video streams that can be combined in order to compute a  depth map. As both cameras take slightly different scenes they can be lit by  different sources becoming a possible color shift between their streams and  thus penalizing the quality and the user experience. Our approach considers the  use of a NVIDIA 3D broadcast solution system with professional HD cameras.
  Author:  Antonio Sanz  (Universidad Rey Juan Carlos)Life  Sciences

K01 - Generalized Linear Model (GLM) Based  Quantitative Trait Locus (QTL) Analysis
  Relating  Genotype to Phenotype in Complex Environments has been identified as one of the  grand challenges of plant sciences. Under the umbrella of the iPlant Collaborative funded by the Plant  Science Cyberinfrastructure Collaborative program of the NSF, our goal is to  develop GPU implementation of the General Linear Model (GLM) to statistically  link genotype to phenotype and dramatically decrease the execution time for GLM  analyses.  GPU based highly parallelized  Forward Regression stage of the GLM achieved 177x speedup over the Matlab based  serial version. Results of this study will enable larger, more intensive  genetic mapping analyses to be conducted.
  Author:  Ali Akoglu  (University of Arizona)
K02 - GPU-REMuSiC: The Implementation of  Constrain Multiple Sequence Alignment on Graphics Processing Unit
  We  implement RE-MuSiC tool on multi-GPUs (called GPU-REMuSiC) with NVIDIA CUDA. By  a special model implementation, the DP computation time in GPU-REMuSiC running  on single and two GeForce GTX 260 cards achieves more than 75 and 130 speedups  comparing to that in sequential RE-MuSiC running on Intel i7 920 CPU,  respectively.
  Author:  Chun-Yuan Lin  (Chang Gung University)
K03 – The Virtual Heart:  Working Towards Interactive CUDA Based Simulations of Cardiac Function
  Heart  disease is the leading cause of death in the developed world. Despite this, our  understanding of cardiac dysfunction is limited. Our goal is to create a  realistic virtual model of the heart to develop insight into this clinically  important problem. The computational complexity of the ‘virtual heart’ has been  prohibitive until very recently. However, the continued development of massive  parallelization using CUDA and GPU technology has now made this a realistic and  achievable goal.
  Author:  Stefano Charissis (Victor Chang Cardiac  Research Institute)Machine  Learning & Artificial Intelligence

L01 - CUDA Creatures
  CUDA  Creatures applies parallel algorithms to the iterated Prisoner's Dilemma, a  classic study of the evolution of cooperation. We bring interactivity to  parameter space exploration by achieving 600x to 800x speedups on GTX 260.
  Author:  Andrew  Hershberger (Stanford University)Medical  Imaging & Visualization

M01 - Real-time Ultrasound DatPro-Acessing  for Regional Anesthesia Guidance
  Ultrasound  imaging techniques such as Doppler flow imaging and acoustic radiation force  impulse (ARFI) imaging require estimation of velocity or displacement from the  received echoes.  Real-time processing  and display of images allows for real-time guidance of procedures, improving patient  safety and efficacy.  Using CUDA, the  processing code has been implemented in pre-clinical regional anesthesia  studies investigating new methods for localizing where fluid is being injected.  The computation time has been reduced from 20  minutes to 18 seconds, resulting in the rapid display of dynamic images of the  fluid being injected.
  Author:  Stephen  Rosenzweig (Duke University)
M02 - GPU-Accelerated Texture  Decompression of Biomedical Image Stacks
  Histopathology  is the microscopic examination of tissue in order to study the manifestations  of disease. High resolutions images are vital for accurate diagnoses and a  major obstacle to the use of digital imaging in histopathology has been the  inability to display these large images at interactive rates. We have created a  tool for interactive visualization of biomedical image stacks using  GPU-accelerated on-the-fly texture decompression. The image stacks are  compressed using a novel approach custom tailored for the data we are dealing  with, i.e. data exhibiting exceptionally high coherence between the slices of  each image stack.
  Author:  Chirantan  Ekbote (Harvard University)
M03 - Accelerated Large Scale Spherical  Model Forward Solutions for the EEG/MEG using CUDA
  The study  presented in the poster looks at the utility of a CUDA based approach to  improve the computational speed of the spherical model EEG and MEG forward  solution for large scale 3-D dipole grid (on order of 1000 and up) and sensor  locations (on order of 100 and up). Fast computation of the forward solution is  critical in improving the speed of the inverse solution in biosource imaging.  The inverse solution gives the location of the epileptogenic foci from the EEG  and MEG measurements.
  Author:  Nitin Bangera  (MIND Research Network)
M04 - CUDA Accelerated Real Time  Volumetric Cardiac Image Enhancement
  CUDA  enables high data rate real time volumetric cardiac ultrasound image  enhancement.  Substantial improvements in  processing data rate and memory bandwidth demand over a CPU based approach were  found with CUDA.
  Author:  Ismayil  Guracar (Siemens Medical Solutons)
M05 - Efficient Visualization of Salient  Manifolds in Scalar, Vector, and Tensor Fields
  Our  research focuses on harnessing the massively parallel compute power of the GPU  to visually explore complex datasets. We propose adaptive GPU-based approaches  that intertwines computation and rendering. Along side we present novel dynamic  data structures for the GPU. Our research include the visualization of salient  structures in vector fields using LCS, extraction of ridge and valley surfaces  from volumetric scalar fields with scale analysis, and efficient volume /  surface rendering.
  Author:  Samer Barakat  (Purdue University)
M06 - Highly Parallel Image Reconstruction  for Positron Emission Tomography (PET)
  We  present a novel method of computing line projection operations required for  list-mode ordered-subsets expectation-maximization (OSEM) for fully 3-D PET  image reconstruction on a GPU using the CUDA framework. Our method overcomes  challenges such as compute thread divergence and exploits GPU capabilities such  as shared memory and atomic operations. This new GPU-CUDA implementation is  120X faster than a reference CPU implementation. The image quality is preserved  with root mean squared (RMS) deviation between the images generated using the  CPU and the GPU being 0.08%, which has negligible effect in typical clinical  applications.
  Author:  Jingyu Cui  (Stanford University)Molecular  Dynamics

N01 - Energy Evaluation of Rosetta  Proteins Using CUDA
  In this  poster, we describe preliminary results using CUDA to accelerate the energy  evaluation of proteins folded by the Rosetta software suite.
  Author:  Will Kohut  (University of California, Davis)
N02 - GPU Accelerated Molecular Dynamics  Algorithms for Soft Matter Systems using HOOMD-Blue
  The  rheological, thermodynamic, and self-assembly behavior of liquids, colloids,  polymers, foams, gels, granular materials and biological systems are often  studied in simulation by  using  coarse-grained models based on molecular dynamics algorithms. The open source  general purpose particle dynamics code  HOOMD-Blue has been expanded to include the simulation techniques and pair  potentials used to study this class of problems.
  Author:  Carolyn  Phillips (University of Michigan)
N03 – Accelerating  Molecular Modeling using GPUs
  Computing electrostatic  interactions in a biomolecule contributes towards the understanding of its  structure and function, e.g., ligand binding, complex formation, and proton  transport.  However, such calculations on a desktop computer can take on  the order of days, or even weeks, to run.  Consequently, scientists seek  to either reduce the algorithmic complexity, massively accelerate the  computation with a GPU, or both.  Our approach, based on an analytical  linearized Poisson Boltzmann algorithm, delivers a 120-fold speed-up on a GPU  (vs. a CPU-optimized -O3 with hand-tuned SSE).  When combined with our  hierarchical charge partitioning (HCP) multiscale method, however, the  delivered speed-up approaches 20,000-fold.
  Author:  Wuchun Feng (Virginia Tech)Neuroscience

O01 - Distributed Multi-Level Out-of-Core  Volume Rendering
  In  neuroscience, scans of brain tissue are acquired using electron microscopy,  resulting in extremely high-resolution volume data with sizes of many  terabytes. To support the work of neurobiologists, interactive exploration of  such volumes requires new approaches for distributed out-of-core volume  rendering. A major goal of our distributed GPU volume rendering system is to  sustain a pixel-to-voxel ratio of about 1:1. This display-aware approach  effectively bounds the working set size required for ray-casting, which makes  it largely independent of the volume resolution. Currently, our system achieves  interactive volume rendering of 43GB and 92GB volumes on 1 to 8 Tesla nodes.
  Author:  Markus  Hadwinger (King Abdullah University of  Science and Technology)Programming  Languages & Techniques

P01 - GPU-to-CPU Callbacks
  Our  poster outlines GPU-to-CPU callbacks, a method for the GPU to request work from  the CPU. We give some motivation, demonstrate the code architecture, and give  samples of CPU and GPU code that show callbacks being executed.
  Author:  Jeff Stuart  (University of California, Davis)Physics  Simulation

Q01 - Acceleration of Computational  Electromagnetics Physical Optics - Shooting and Bouncing Ray Method
  Electromagnetic  fields radiated by a 1964 Ford Thunderbird are calculated over 50 times faster  than a standard CPU by using a Quadro FX 5800 GPU.
  Author:  Huan-Ting Meng  (University of Illinois at Urbana-Champaign)
Q02 - Massively Parallel Micromagnetic FEM  Calculations with Graphical Processing Units
  We  adapted our Micromagnetic Simulator "TetraMag" to NVIDIA's CUDA  architecture, resulting in a significant increase in calculation speed and cost  efficiency over the most recent PC-based machines. The poster gives an outline  of the general challenges and the methods used to adapt the solutions to GPUs  as well as benchmark results obtained using standard micromagnetic problems.
  Author:  Elmar Westphal  (Forschungszentrum Juelich)
Q03 - Multiplying Speedups:  GPU-Accelerated Fast Multipole BEM, for Applications in Protein Electrostatics
  We have  developed a fast multipole boundary element method (BEM) for biomolecular  electrostatics.  With GPU acceleration of  the FMM, there is a multiplicative speed-up resulting from the fast O(N)  algorithm and GPU hardware.  With this  method, we can obtain converged results for multi-million atom systems in less than  an hour, using multi-GPU clusters.
  Author:  Lorena Barba  (Boston University)
Q04 - GPU-Powered Control of a Compliant  Humanoid Robot
  The  ECCEROBOT project deals with the construction and control of a robot with a  humanoid skeleton and muscle-like compliant, elastic actuators. The nonlinear  passive and active coupling between the skeletal elements, combined with the  effect of environmental interaction, present an extremly complex control  problem.  Our solution; motor programs  are found using physics-based simulation of both the robot and its environment  to locate candidate movements.  For real  time control multiple copies of  the  simulation must be run in faster than real time, requiring the use of GPU  acceleration.  Further, in order to  capture the environment we use GPU-accelerated dense reconstruction vision.
  Author:  Alan Diamond  (University Of Sussex, UK)Programming  Languages & Techniques

R01 - A Speech Recognition Application  Framework for Highly Parallel Implementations on the GPU
  Data  layout, data placement, and synchronization processes are not usually part of a  speech application expert's daily concerns. Yet failure to carefully take these  concerns into account in a highly parallel implementation on the graphics  processing units (GPU) could mean an order of magnitude of loss in application  performance. We present an application framework for parallel programming of  automatic speech recognition (ASR) applications that allows a speech  application expert to effectively implement speech applications on the GPU, and  demonstrate how the ASR application framework has enabled a Matlab/Java  programmer to achieve a 20x speedup in application performance on a GPU.
  Author:  Jike Chong  (Parasians, LLC)
R02 - Scalable Computer Vision  Applications
  We are  developing a domain specific language for computer vision algorithms that  facilitates rapid implementation of algorithms that are scalable and portable  across CPU-GPU architectures.  The  presented approach significantly lowers the barrier of implementation of  computer vision algorithms for heterogeneous CPU-GPU architectures, and enables  a single implementation to automatically scale to use additional hardware as it  becomes available.
  Author:  Rami Mukhtar  (NICTA)
R03 - Language and Compiler Extensions for  Heterogeneous Computing
  GPGPU  architectures offer large performance gains over their traditional CPU  counterparts for many applications. However, current GPU programming models  present numerous challenges to the programmer: lower-level languages, explicit  data movement, loss of portability, and performance optimization  challenges.  In this paper, we present  novel methods and compiler transformations that increase productivity by  enabling users to easily program GPUs using the high productivity programming  language Chapel.
  Author:  Albert  Sidelnik (University of Illinois at  Urbana-Champaign)Signal  processing

S01 - Achieving 1 TFLOP for the Radio  Astronomy Correlator
  In this  work we apply CUDA, using the Fermi architecture, to the problem of  cross-correlation arising in radio astronomy. This accounts for the bulk of computation in radio astronomy, and  essentially is described by vector outer-products.  Traditionally this task is performed using  FPGAs, and the goal of this work was to see how efficiently GPUs could be used  for this task.  We describe the tiling  strategies and optimization techniques employed to maximize performance.  We achieve in excess of 1 teraflop per second  using a single GeForce GTX 480, which corresponds to 78% of peak performance,
  Author:  Michael Clark  (Harvard University)
S02 - CUDA Implementation of Software for  Identifying Post-Translational Modifications
  InsPecT  is a software for identifying post-translational modifications of protein. With  the help of the MS-Alignment algorithm, InsPecT can search PTMs in unrestrictive  mode, even reveal unknown types of modifications. However, the MS-Alignment has  a tremendous time complexity and takes more than 99% computing time of InsPecT.  We accelerated MS-Alignment on GPUs. After optimization and parallelization  with MPI, cuda-InsPecT, a new open source software based on MPI+CUDA with high  efficiency is born.
  Author:  Long Wang  (Supercomputing Center, Chinese Academy of  Sciences)Tools  & Libraries

U01 - Mint: An OpenMP to CUDA Translator
  We aim to  facilitate GPU programming for finite difference applications. We have  developed Mint, a source to source compiler to generate CUDA code from OpenMP  code. Mint transforms omp parallel for loops into CUDA kernels and applies  domain specific optimizations such as shared memory, register and kernel fuse  optimizations. Since our translator targets structured grid problems, it  optimizes the code better than the general purpose compilers. In this poster,  we present translation and optimization steps along with our initial  performance results.
  Author:  Didem Unat  (University of California, San Diego)
U02 - Real-Time Particle Simulation in the  Blender Game Engine with OpenCL
  The goal  of this project is to produce interactive scientific visualizations that can be  used in educational games. We use the computational power of OpenCL to enable  features in the Blender Game Engine that would otherwise not be possible in  real-time. By adding an interactive particle system to the game engine, we set  the stage to demonstrate many interesting scientific phenomena (molecular  dynamics, fluid dynamics, statistics) with the added benefit of real-time  special effects for games in general.
  Author:  Ian Johnson  (Florida State University)
U03 - GStream: A General-Purpose Data  Streaming Framework on GPU Clusters
  In this  poster, we propose GStream, a general-purpose, scalable data streaming  framework on GPUs. The contributions of GStream are as follows: (1) We provide  powerful, yet concise language abstractions suitable to describe conventional  algorithms as streaming problems. (2) We project these abstraction onto GPUs to  fully exploit their inherent massive data- parallelism. (3) We demonstrate the  viability of streaming on accelerators. Experiments show that the proposed  framework provides flexibility, programmability and performance gains for  various benchmarks from a variety of domains, including but not limited to data  streaming, data parallel problems, numerical codes and text search.
  Author:  Yongpeng Zhang  (North Carolina State University)
U04 - NukadaFFT : An Auto-Tuning FFT Library  for CUDA GPUs
  We have  released our FFT library for CUDA GPUs. Most of algorithms and auto-tuning  technologies of FFT for CUDA are already published. The library now supports  new Fermi architecture and works with CUDA 3.0 or later.
  Author:  Akira Nukada  (Tokyo Institute of Technology)Video  Processing

V01 - Real-Time Color Space Conversion for  High Resolution Video
  Color  space conversion or color correction is a widely used technique to adapt the  color characteristics of video material to the display technology employed  (e.g. CRT, LCD, projection) or to create a certain artistic look. As color  correction often is an interactive task and colorists need a direct response,  state-of-the-art real-time color correction systems for video are so far based  on expensive dedicated hardware. This submission shows the feasibility to  replace dedicated color correction systems by General Purpose GPUs. It is shown  that a single Tesla C2050 GPU supports real-time color correction up to a  resolution of 4096x2048 pixel.
  Author:  Klaus Gaedke  (Technicolor)
V02 - 3D Object Detection in Digital  Holographic Microscope Images
  Digital  Holographic Microscopy (DHM) is based on the classical holographic principle  invented by Hungarian physicist Dennis Gabor. The holographic images are  acquired by a CCD camera. Depth slices can be reconstructed using Fourier  transform. The numerical reconstruction and further image processing for object  detection is done using General Purpose Graphical Processor Units (GPGPU).
  Author:  Vilmos Szabo  (Pazmany Peter Catholic University)

nom8393 · 发表于 2010-9-27 08:51

Poster都是简要介绍性质的吧？没有Paper么？

smsunny · 发表于 2010-11-9 08:45

急切需要，谢谢上传

帐号		自动登录	找回密码
密码			注册

100 份 NVIDIA Research Summit 2010 论文 poster

浏览过的版块