<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
<channel>
<title>IEEE Computer Architecture Letters</title>
<link>http://www.computer.org/cal</link>
<description>	</description>
	<language>en-us</language>
	<pubDate>Wed, 4 Jan 2012 11:00:01 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/cal.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/cal</link>
	</image>
  <item>
     <title>PrePrint: Leveraging Sharing in Second Level Translation-Lookaside Buffers for Chip Multiprocessors</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.35</link>
     <description>Traversing page table during virtual to physical address translation causes significant pipeline stalls when misses occur in the translation-lookaside buffer (TLB). To mitigate this penalty, we propose a fast, scalable, multi-level TLB organization that leverages page sharing behaviors and performs efficient TLB entry placement. Our proposed partial sharing TLB (PSTLB) reduces TLB misses by around 60%. PSTLB also improves TLB performance by nearly 40% compared to traditional private TLBs and 17% over the state of the art scalable TLB proposal.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.35</guid>
  </item>
  <item>
     <title>PrePrint: B-Fetch:Branch Prediction Directed Prefetching for In-Order Processors</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.33</link>
     <description>Computer architecture is beset by two opposing trends. Technology scaling and deep pipelining have led to high memory access latencies; meanwhile, power and energy considerations have revived interest in traditional in-order processors. In-order processors, unlike their superscalar counterparts, do not allow execution to continue around data cache misses. In-order processors, therefore, suffer a greater performance penalty in the light of the current high memory access latencies. Memory prefetching is an established technique to reduce the incidence of cache misses and improve performance. In this paper, we introduce B-Fetch, a new technique for data prefetching which combines branch prediction based lookahead deep path speculation with effective address speculation, to efficiently improve performance in in-order processors. Our results show that B-Fetch improves performance 38.8% on SPEC CPU2006 benchmarks, beating a current, state-of-the-art prefetcher design at ~1/3 the hardware overhead.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.33</guid>
  </item>
  <item>
     <title>PrePrint: Cache Impacts of Datatype Acceleration</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.25</link>
     <description>Hardware acceleration is a widely accepted solution for performance and energy efficient computation because it removes unnecessary hardware for general computation while delivering exceptional performance via specialized control paths and execution units. The spectrum of accelerators available today ranges from coarse-grain off-load engines such as GPUs to fine-grain instruction set extensions such as SSE. This research explores the benefits and challenges of managing memory at the data-structure level and exposing those operations directly to the ISA. We call these instructions Abstract Datatype Instructions (ADIs). This paper quantifies the performance and energy impact of ADIs on the instruction and data cache hierarchies. For instruction fetch, our measurements indicate that ADIs can result in 21&amp;#8211;48% and 16&amp;#8211;27% reductions in instruction fetch time and energy respectively. For data delivery, we observe a 22&amp;#8211;40% reduction in total data read/write time and 9&amp;#8211;30% in total data read/write energy.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.25</guid>
  </item>
  <item>
     <title>PrePrint: Atomic Streaming: A Framework of On-Chip Data Supply System for Task-Parallel MPSoCs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.21</link>
     <description>State of the art fabrication technology for integrating numerous hardware resources such as Processors/DSPs and memory arrays into a single chip enables the emergence of Multiprocessor System-on-Chip (MPSoC). Stream programming paradigm based on MPSoC is highly efficient for single functionality scenario due to its dedicated and predictable data supply system. However, when memory traffic is heavily shared among parallel tasks in applications with multiple interrelated functionalities, performance suffers through task interferences and shared memory congestions which lead to poor parallel speedups and memory bandwidth utilizations. This paper proposes a framework of stream processing based on-chip data supply system for task-parallel MPSoCs. In this framework, stream address generations and data computations are decoupled and parallelized to allow full utilization of on-chip resources. Task granularities are dynamically tuned to jointly optimize the overall application performance. Experiments show that proposed framework as well as the tuning scheme are effective for joint optimization in task-parallel MPSoCs.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.21</guid>
  </item>
  <item>
     <title>PrePrint: Mitigating the Effects of Process Variation in Ultra-low Voltage Chip Multiprocessors using Dual Supply Voltages and Half-Speed Units</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.36</link>
     <description>Energy efficiency is a primary concern for microprocessor designers. One very effective approach to improving the energy efficiency is to lower chip supply voltage very near to the transistor threshold voltage. This reduces power consumption dramatically, improving energy efficiency by an order of magnitude. Low voltage operation, however, increases the effects of parameter variation resulting in significant frequency heterogeneity between otherwise identical cores. This heterogeneity severely limits the maximum frequency of the entire CMP. We present a combination of techniques aimed at reducing the effects of variation on the performance and energy efficiency of near-threshold, many-core CMPs. Dual Voltage Rail, mitigates core-to-core variation with a dual-rail power delivery system that allows post-manufacturing assignment of different supply voltages to individual cores. This speeds up slow cores by assigning them to a higher voltage and saves power on fast cores by assigning them to a lower voltage. Half-Speed Unit mitigates within-core variation by halving the frequency of select functional blocks with the goal of boosting the frequency of individual cores, thus raising the frequency ceiling for the entire CMP. Together, these variation-reduction techniques result in almost 50% improvement in CMP performance for the same power consumption over a mix of workloads.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.36</guid>
  </item>
  <item>
     <title>PrePrint: Instruction Shuffle Achieving MIMD-like Performance on SIMD Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.34</link>
     <description>SIMD architectures are less efficient for applications with the diverse control-flow behavior, which can be mainly attributed to the requirement of identical control-&amp;#64258;ow. In this paper, we propose a novel instruction shuf&amp;#64258;e architecture that features an ef&amp;#64257;cient control-&amp;#64258;ow handling mechanism. The cornerstones are composed of a shuf&amp;#64258;e source instruction buffer array and a shuf&amp;#64258;e unit. The shuf&amp;#64258;e unit can concurrently deliver instructions of multiple distinct control-&amp;#64258;ows from the instruction buffer array to eligible SIMD lanes. Our instruction shuf&amp;#64258;e scheme combines the best attributes of both the SIMD and MIMD execution paradigms. Applications with diverse control-&amp;#64258;ows are evaluated on our instruction shuf&amp;#64258;e architecture. Experimental results show that, the architecture can achieve 74% performance improvement on average, at the cost of only 5.8% area overhead introduced by the instruction shuf&amp;#64258;ing mechnism.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.34</guid>
  </item>
  <item>
     <title>PrePrint: DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.32</link>
     <description>GPGPU architectures (applications) have several different characteristics compared to traditional CPU architectures (applications): highly multithreaded architectures and SIMD-execution behavior are the two important characteristics of GPGPU computing. In this paper, we propose a potential function that models the DRAM behavior in GPGPU architectures and a DRAM scheduling policy, Alpha-SJF policy to minimize the potential function. The scheduling policy essentially chooses between SJF and FR-FCFS at run-time based on the number of requests from each thread and whether the thread has a row buffer hit.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.32</guid>
  </item>
  <item>
     <title>IEEE Computer Architecture Letters - July-December 2011 (Vol. 10, No. 2)</title>
     <link>http://opac.ieeecomputersociety.org/opac?year=2011&amp;volume=10&amp;issue=02&amp;acronym=cal</link>
     <description>IEEE Computer Architecture Letters</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/cal/</guid>
  </item>
  <item>
     <title>PrePrint: Including Variability in Large-Scale Cluster Power Models</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.27</link>
     <description>Studying the energy efficiency of large-scale computer systems requires models of the relationship between resource utilization and power consumption. Prior work on power modeling assumes that models built for a single node will scale to larger groups of machines. However, we find that inter-node variability in homogeneous clusters leads to substantially different models for different nodes. Moreover, ignoring this variability will result in significant prediction errors when scaled to the cluster level. We report on inter-node variation for four homogeneous five-node clusters using embedded, laptop, desktop, and server processors. The variation is manifested quantitatively in the prediction error and qualitatively on the resource utilization variables (features) that are deemed relevant for the models. These results demonstrate the need to sample multiple machines in order to produce accurate cluster models.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.27</guid>
  </item>
  <item>
     <title>PrePrint: An Overview of Static Pipelining</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.26</link>
     <description>A new generation of mobile applications requires reduced energy consumption without sacri&amp;#64257;cing execution performance. In this paper, we propose to respond to these con&amp;#64258;icting demands with an innovative statically pipelined processor supported by an optimizing compiler. The central idea of the approach is that the control during each cycle for each portion of the processor is explicitly represented in each instruction. Thus the pipelining is in effect statically determined by the compiler. The bene&amp;#64257;ts of this approach include simpler hardware and that it allows the compiler to perform optimizations that are not possible on traditional architectures. The initial results indicate that static pipelining can signi&amp;#64257;cantly reduce power consumption without adversely affecting performance.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.26</guid>
  </item>
  <item>
     <title>PrePrint: A High-Level Power Model for MPSoC on FPGA</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.24</link>
     <description>This paper presents a framework for high-level power estimation of multiprocessor systems-on-chip (MPSoC) architectures on FPGA. The technique is based on abstract execution profiles, called event signatures. As a result, it is capable of achieving good evaluation performance, thereby making the technique highly useful in the context of early system-level design space exploration. We have integrated the power estimation technique in a system-level MPSoC synthesis framework. Using this framework, we have designed a range of different candidate MPSoC architectures and compared our power estimation results to those from real measurements on a Virtex-6 FPGA board.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.24</guid>
  </item>
  <item>
     <title>PrePrint: A HW/SW Co-designed Programmable Functional Unit</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.23</link>
     <description>In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern out-of-order x86 processor. Code is transformed and instructions are generated that run on the PFU using a co-designed virtual machine (Cd-VM). Results presented in this paper show that this HW/SW co-designed approach produces average speedups in performance of 29% in SPECFP and 19% in SPECINT, and up-to 55%, over modern out-of-order processor.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.23</guid>
  </item>
  <item>
     <title>PrePrint: A Case for Hybrid Discrete-Continuous Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.22</link>
     <description>Current technology trends indicate that power- and energyefficiency will limit chip throughput in the future. Current solutions to these problems, either in the way of programmable or fixed-function digital accelerators will soon reach their limits as microarchitectural overheads are successively trimmed. A significant departure from current computing methods is required to carry forward computing advances beyond digital accelerators. In this paper we describe how the energy-efficiency of a large class of problems can be improved by employing a hybrid of the discrete and continuous models of computation instead of the ubiquitous, traditional discrete model of computation. We present preliminary analysis of domains and benchmarks that can be accelerated with the new model. Analysis shows that machine learning, physics and up to one-third of SPEC, RMS and Berkeley suite of applications can be accelerated with the new hybrid model.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.22</guid>
  </item>
  <item>
     <title>PrePrint: Decoupling Datacenter Studies from Access to Large-Scale Applications: A Modeling Approach for Storage Workloads</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.37</link>
     <description>The cost and power impact of suboptimal storage con&amp;#64257;gurations is significant in datacenters (DCs). Designing performance, power and cost-optimized systems requires a deep understanding of target workloads, and mechanisms to effectively model different storage design choices. Traditional benchmarking is invalid in cloud data-stores, representative storage pro&amp;#64257;les are hard to obtain, while replaying applications in all storage con&amp;#64257;gurations is highly impractical. However, current workload generators cannot reproduce key aspects of real application patterns (e.g., spatial/temporal locality). We propose a modeling and characterization framework for large-scale storage applications. We use a state diagram-based model, extend it to a hierarchical representation and implement a tool that consistently recreates DC application I/O loads. We present the framework and the validation process performed against ten original DC applications traces. Finally, we examine two use cases of this methodology: the benefits of SSD caching and defragmentation in enterprise storage. Since knowledge of the workload&amp;#8217;s spatial and temporal locality is necessary to model these use cases, our framework was instrumental in quantifying their performance benefits. The proposed methodology provides detailed understanding of the storage activity of large-scale applications and enables a wide spectrum of storage studies without the requirement to access application code and full application deployment.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.37</guid>
  </item>
   </channel>
</rss>
