目录 Preface Acknowledgments CHAPTER 1 The Linux/ARM embedded platform 1.1 Performance-Oriented Programming 1.2 ARM Technology 1.3 Brief History of ARM 1.4 ARM Programming 1.5 ARM Architecture Set Architecture 1.5.1 ARM general purpose registers 1.5.2 Status register 1.5.3 Memory addressing modes 1.5.4 GNU ARM assembler 1.6 Assembly Optimization #1: Sorting 1.6.1 Reference implementation 1.6.2 Assembly implementation 1.6.3 Result verification 1.6.4 Analysis of compiler-generated code 1.7 Assembly Optimization #2: Bit Manipulation. 1.8 Code Optimization Objectives 1.8.1 Reducing the number of executed instructions 1.8.2 Reducing average CPI 1.9 Runtime Profiling with Performance Counters. 1.9.1 ARM performance monitoring unit 1.9.2 Linux Perf_Event 1.9.3 Performance counter infrastructure 1.10 Measuring Memory Bandwidth 1.11 Performance Results 1.12 Performance Bounds 1.13 Basic ARM Instruction Set 1.13.1 Integer arithmetic instructions 1.13.2 Bitwise logical instructions 1.13.3 Shift instructions 1.13.4 Movement instructions 1.13.5 Load and store instructions 1.13.6 Comparison instructions 1.13.7 Branch instructions 1.13.8 Floating-point instructions 1.14 Chapter Wrap-Up Exercises CHAPTER 2 Multicore and data-level optimization: OpenMP and SIMD 2.1 Optimization Techniques Covered by this Book.. 2.2 Amdahl's Law 2.3 Test Kernel: Polynomial Evaluation 2.4 Using Multiple Cores: OpenMP 2.4.1 OpenMP directives 2.4.2 Scope 2.4.3 Other OpenMP directives 2.4.4 OpenMP synchronization 2.4.5 Debugging OpenMP code 2.4.6 The OpenMP parallel for pragma 2.4.7 0penMP with performance counters 2.4.8 0penMP support for the Homer kernel 2.5 Performance Bounds 2.6 Performance Analysis 2.7 Inline Assembly Language in GCC 2.8 Optimization #h Reducing Instructions per Flop 2.9 Optimization #2: Reducing CPI 2.9.1 Software pipelining 2.9.2 Software pipelining Homer's method 2.10 Optimization #3: Multiple Flops per Instruction with Single Instruction, Multiple Data 2.10.1 ARM11 VFP short vector instructions 2.10.2 ARM Cortex NEON instructions 2.10.3 NEON intrinsics 2.11 Chapter Wrap-Up Exercises CHAPTER 3 Arithmetic optimization and the Linux Framebuffer 3.1 The Linux Framebuffer 3.2 Affme Image Transformations 3.3 Bilinear Interpolation 3.4 Floating-Point Image Transformation 3.4.1 Loading the image 3.4.2 Rendering frames 3.5 Analysis of Floating-Point Performance 3.6 Fixed-Point Arithmetic 3.6.1 Fixed point versus floating point: Accuracy 3.6.2 Fixed point versus floating point: Range 3.6.3 Fixed point versus floating point: Precision 3.6.4 Using fixed point 3.6.5 Efficient fixed-point addition 3.6.6 Efficient fixed-point multiplication 3.6.7 Determining radix point position 3.6.8 Range and accuracy requirements for image transformation 3.6.9 Converting from floating-point to fixed-point arithmetic 3.7 Fixed-Point Performance 3.8 Real-Time Fractal Generation 3.8.1 Pixel coloring 3.8.2 Zooming in 3.8.3 Range and accuracy requirements 3.9 Chapter Wrap-Up Exercises CHAPTER 4 Memory optimization and video processing 4.1 Stencil Loops 4.2 Example Stencil: The Mean Filter 4.3 Separable Filters 4.3.1 Gaussian blur 4.3.2 The Sobel filter 4.3.3 The Harris comer detector 4.3.4 Lucas-Kanade optical flow 4.4 Memory Access Behavior of 2D Filters 4.4.1 2D data representation 4.4.2 Filtering along the row 4.4.3 Filtering along the column 4.5 Loop Tiling 4.6 Tiling and the Stencil Halo Region 4.7 Example 2D Filter Implementation 4.8 Capturing and Converting Video Frames 4.8.1 YUV and chroma subsampling 4.8.2 Exporting tiles to the frame buffer 4.9 Video4Linux Driver and API 4.10 Applying the 2D Tiled Filter 4.11 Applying the Separated 2D Tiled Filter 4.12 Top-Level Loop 4.13 Performance Results 4.14 Chapter Wrap-Up Exercises CHAPTER 5 Embedded heterogeneous programming with OpenCL 5.1 GPU Microarchitecture 5.2 0penCL 5.3 0penCL Programming Model, Idioms, and Abstractions 5.3.1 The host/device programming model 5.3.2 Error checking 5.3.3 Platform layer: Initializing the platforms... 5.3.4 Platform layer: Initializing the devices 5.3.5 Platform layer: Initializing the context 5.3.6 Platform layer: Kernel control 5.3.7 Platform layer: Kernel compilation 5.3.8 Platform layer: Device memory allocation 5.4 Kernel Workload Distribution 5.4.1 Device memory 5.4.2 Kernel parameters 5.4.3 Kernel vectorization 5.4.4 Parameter space for Homer kernel 5.4.5 Kernel attributes 5.4.6 Kernel dispatch 5.5 0penCL Implementation of Homer's Method: Device Code 5.5.1 Verification 5.6 Performance Results 5.6.1 Parameter exploration 5.6.2 Number of workgroups 5.6.3 Workgroup size 5.6.4 Vector size 5.7 Chapter Wrap-Up Exercises Appendix A Adding PMU support to Raspbian for the Generation 1 Raspberry Pi Appendix B NEON intrinsic reference Appendix C OpenCL reference Index
以下为对购买帮助不大的评价