<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
<channel>
<title>IEEE Computer Architecture Letters</title>
<link>http://www.computer.org/cal</link>
<description>	</description>
	<language>en-us</language>
	<pubDate>Fri, 17 May 2013 10:00:05 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/cal.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/cal</link>
	</image>
  <item>
     <title>PrePrint: Cache-aware Roofline model: Upgrading the loft</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.6</link>
     <description>The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.6</guid>
  </item>
  <item>
     <title>PrePrint: Software Transactional Memory for GPU Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.4</link>
     <description>To make applications with dynamic data sharing among threads benefit from GPU acceleration, we propose a novel software transactional memory system for GPU architectures (GPU-STM). The major challenges include ensuring good scalability with respect to the massively multithreading of GPUs, and preventing livelocks caused by the SIMT execution paradigm of GPUs. To this end, we propose (1) a hierarchical validation technique and (2) an encounter-time lock-sorting mechanism to deal with the two challenges, respectively. Evaluation shows that GPU-STM outperforms coarse-grain locks on GPUs by up to 20x.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.4</guid>
  </item>
  <item>
     <title>PrePrint: Energy Aware Race to Halt: A Down to EARtH Approach for Platform Energy Management</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.32</link>
     <description>The EARtH algorithm finds the optimal voltage and frequency operational point of the processor in order to achieve minimum energy of the computing platform. The algorithm is based on a theoretical model employing a small number of parameters, which are extracted from real systems using off-line and run-time methods. The model and algorithm have been validated on real systems using 45nm, 32nm and 22nm Intel&#x00AE; Core processors. The algorithm can save up to 44% energy compared with the commonly used fixed frequency policies.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.32</guid>
  </item>
  <item>
     <title>PrePrint: Accelerator Memory Reuse in the Dark Silicon Era</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.29</link>
     <description>Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we present a novel technique aimed at mitigating this opportunity cost by allowing GP-CPU cores to reuse accelerator memory as a non-uniform cache architecture (NUCA) substrate. On a system with a last level-2 cache of 128kB, our technique achieves on average a 25% performance improvement when reusing four 512 kB accelerator memory blocks to form a level-3 cache. Making these blocks reusable as NUCA slices incurs on average in a 1.89% area overhead with respect to equally-sized ad hoc cache slices.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.29</guid>
  </item>
  <item>
     <title>PrePrint: Demystifying multicore throughput metrics</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.25</link>
     <description>Several different metrics have been proposed for quantifying the throughput of multicore processors. There is no clear consensus about which metric should be used. Some studies even use several throughput metrics. We show that there exists a relation between single-thread average performance metrics and throughput metrics, and that throughput metrics inherit the meaning or lack of meaning of the corresponding single-thread metric. We show that two popular throughput metrics, the weighted speedup and the harmonic mean of speedups, are inconsistent: they do not give equal importance to all benchmarks. Moreover we demonstrate that the weighted speedup favors unfairness. We show that the harmonic mean of IPCs, a seldom used throughput metric, is actually consistent and has a physical meaning. We explain under which conditions the arithmetic mean or the harmonic mean of IPCs can be used as a strong indicator of throughput increase.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.25</guid>
  </item>
  <item>
     <title>IEEE Computer Architecture Letters - </title>
     <link>http://opac.ieeecomputersociety.org/opac?year=&amp;volume=&amp;issue=&amp;acronym=cal</link>
     <description>IEEE Computer Architecture Letters</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/cal/</guid>
  </item>
  <item>
     <title>PrePrint: Optimization of Application-Specific Memories</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.7</link>
     <description>Memory access times are the primary bottleneck for many applications today. This &#x0022;memory wall&#x0022; is due to the performance disparity between processor cores and main memory. To address the performance gap, we propose the use of custom memory subsystems tailored to the application rather than attempting to optimize the application for a fixed memory subsystem. Custom subsystems can take advantage of application-specific properties as well as memory-specific properties to improve access times or write-backs given constraints on size or power.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.7</guid>
  </item>
  <item>
     <title>PrePrint: Exploiting Virtual Addressing for Increasing Reliability</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.2</link>
     <description>A novel method to protect a system against errors resulting from soft errors occurring in the virtual address (VA) storing structures such as translation lookaside buffers (TLB), physical register file (PRF) and the program counter (PC) is proposed in this paper. The work is motivated by showing how soft errors impact the structures that store virtual page numbers (VPN). A solution is proposed by employing linear block encoding methods to be used as a virtual addressing scheme at link time. Using the encoding scheme to assign VPNs for VAs, it is shown that the system can tolerate soft errors using software with the help of the discussed decoding techniques applied to the page fault handler. The proposed solution can be used on all of the architectures using virtually indexed addressing. The main contribution of this paper is the decreasing of AVF for data TLB by 42.5%, instruction TLB by 40.3%, PC by 69.2% and PRF by 33.3%.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.2</guid>
  </item>
  <item>
     <title>PrePrint: Exploiting Webpage Characteristics for Energy-Efficient Mobile Web Browsing</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.33</link>
     <description>Web browsing on mobile devices is undoubtedly the future. However, with the increasing complexity of webpages, the mobile device&amp;amp;#8217;s computation capability and energy consumption become major pitfalls for a satisfactory user experience. In this paper, we propose a mechanism to effectively leverage processor frequency scaling in order to balance the performance and energy consumption of mobile web browsing. This mechanism explores the performance and energy tradeoff in webpage loading, and schedules webpage loading according to the webpages&amp;amp;#8217; characteristics, using the different frequencies. The proposed solution achieves 20.3% energy saving compared to the performance mode, and improves webpage loading performance by 37.1% compared to the battery saving mode.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.33</guid>
  </item>
  <item>
     <title>PrePrint: A Case for a Value-Aware Cache</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.31</link>
     <description>Replication of values causes poor utilization of on-chip cache memory resources. This paper addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of value-aware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case for a value-aware cache design using Huffman-based compression. Since the value set is rather stable across the execution of an application, one can afford to reconstruct the coding tree in software. The decompression latency is kept short by our proposed novel pipelined Huffman decoder that uses canonical codewords. While the (loose) upper-bound compression factor is 5.2X, we show that, by eliminating cache-block alignment restrictions, it is possible to achieve a compression factor of 3.4X for practical designs.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.31</guid>
  </item>
  <item>
     <title>PrePrint: Block Unification IF-conversion for High Performance Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.28</link>
     <description>Graphics Processing Units accelerate data-parallel graphic calculations using wide SIMD vector units. IF-conversion is a compiler transformation, which converts control dependencies into data dependencies, and it is used by vectorizing compilers to eliminate control &amp;amp;#64258;ow and enable ef&amp;amp;#64257;cient code generation. In this work we enhance the IF-conversion transformation by using a block uni&amp;amp;#64257;cation method to improve the currently used block &amp;amp;#64258;attening method. Our experimental results demonstrate that our IF-conversion method is effective in reducing the number of predicated instructions and in boosting kernel execution speed.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.28</guid>
  </item>
  <item>
     <title>PrePrint: Multicore Model from Abstract Single Core Inputs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.27</link>
     <description>This paper describes a first order multicore model to project a tighter upper bound on performance than previous Amdahl's Law based approaches. The speedup over a known baseline is a function of the core performance, microarchitectural features, application parameters, chip organization, and multicore topology. The model is flexible enough to consider both CPU and GPU like organizations as well as modern topologies from symmetric to aggressive heterogeneous (asymmetric, dynamic, and fused) designs. This extended model incorporates first order effect exposing more bottlenecks than previous applications of Amdahl's Law while remaining simple and flexible enough to be adapted for many applications.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.27</guid>
  </item>
  <item>
     <title>PrePrint: A Hybrid PRAM and STT-RAM Cache Architecture for Extending the Lifetime of PRAM Caches</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.24</link>
     <description>To extend the lifetime of phase change RAM (PRAM) caches, we propose a hybrid cache architecture that integrates a relatively small capacity of spin transfer torque RAM (STT-RAM) write buffer with a PRAM cache. Our hybrid cache improves the endurance limitation of the PRAM cache by judiciously redirecting the write traffic from an upper memory layer to the STT-RAM write buffer. We have demonstrated through simulation that the proposed hybrid cache outperforms existing write-traffic reduction schemes with the same area overhead. Moreover, our approach is orthogonal to the existing schemes, providing an effective way of investing die area for cache lifetime extension by being used in combination with them.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.24</guid>
  </item>
  <item>
     <title>PrePrint: Clumsy Flow Control for High-Throughput Bufferless On-Chip Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.22</link>
     <description>Bufferless on-chip networks are an alternative type of on-chip network organization that can improve the cost-efficiency of an on-chip network by removing router input buffers. However, bufferless on-chip network performance degrades at high load because of the increased network contention and large number of deflected packets. The energy benefit of bufferless network is also reduced because of the increased deflection. In this work, we propose a novel flow control for bufferless on-chip networks in high-throughput manycore accelerator architectures to reduce the impact of deflection routing. By using a clumsy flow control (CFC), instead of the per-hop flow control that is commonly used in buffered on-chip networks, we are able to reduce the amount of deflection by up to 92% on high-throughput workloads. As a result, on average, CFC can approximately match the performance of a baseline buffered router while reducing the energy consumption by approximately 39%.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.22</guid>
  </item>
  <item>
     <title>PrePrint: GreenRouter: Reducing Power by Innovating Router&amp;#8217;s Architecture</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.23</link>
     <description>High speed routers in Internet are becoming more powerful, as well as more energy hungry. In this paper, we present a new architecture of router named GreenRouter which separates a line-card into two parts: network interface card (DB) and packet processing card (MB), connected by a two-stage switch fabric in traffic flows&amp;amp;#8217; ingress and egress direction respectively. Traffic from all DBs shares all the MBs in GreenRouter, thus can be aggregated to a few active MBs on demand and other MBs can be shut down to save power. Several key issues to this new architecture are addressed. We evaluate the power saving efficiency and give preliminary simulation results. GreenRouter can well adapt the traffic fluctuation and real trace evaluations over one week shows that up to 63.7% power saving can be achieved while QoS constraints are guaranteed.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.23</guid>
  </item>
  <item>
     <title>PrePrint: High Performance, Energy Efficient Chipkill Correct Memory with Multidimensional Parity</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.21</link>
     <description>It is well-known that a significant fraction of server power is consumed in memory; this is especially the case for servers with chipkill correct memories. We propose a new chipkill correct memory organization that decouples correction of errors due to local faults that affect a single symbol in a word from correction of errors due to device-level faults that affect an entire column, sub-bank, or device. By using a combination of two codes that separately target these two fault modes, the proposed chipkill correct organization reduces code overhead by half as compared to conventional chipkill correct memories for the same rank size. Alternatively, this allows the rank size to be reduced by half while maintaining roughly the same total code overhead. Simulations using PARSEC and SPEC benchmarks show that, compared to a conventional double chipkill correct baseline, the proposed memory organization, by providing double chipkill correct at half the rank size, reduces power by up to 41%, 32% on average over a conventional baseline with the same chipkill correct strength and access granularity that relies on linear block codes alone, at only 1% additional code overhead.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.21</guid>
  </item>
  <item>
     <title>PrePrint: Metrics for Early-Stage Modeling of Many-Accelerator Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.9</link>
     <description>The term &#8220;Dark Silicon&#8221; has been coined to describe the threat to microprocessor performance caused by increasing transistor power density. Improving energy efficiency is now the primary design goal for all market segments of microprocessors from mobile to server. Specialized hardware accelerators, designed to run only a subset of workloads with orders of magnitude energy efficiency improvement, are seen as a potential solution. Selecting an ensemble of accelerators to best cover the workloads run on a platform remains a challenge. We propose metrics for accelerator selection derived from a detailed communication-aware performance model and present an automated methodology to populate this model. Employing a combination of characterized RTL and our selection metrics, we evaluate a set of accelerators for a sample application and compare performance to selections based on execution time and Pollack&#8217;s rule. We find that the architecture selected by our communication-aware metric shows improved performance over architectures selected based on execution time and Pollack&#8217;s rule, as they do not account for speedup being limited by communication.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.9</guid>
  </item>
  <item>
     <title>PrePrint: Compiler-Assisted, Selective Out-Of-Order Commit</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.8</link>
     <description>This paper proposes an out-of-order instruction commit mechanism using a novel compiler/architecture interface. The compiler creates instruction &amp;#8220;blocks&amp;#8221; guaranteeing some commit conditions and the processor uses the block information to commit certain instructions out of order. Micro-architectural support for the new commit mode is made on top of the standard, ROB-based processor and includes out-of-order instruction commit with register and load queue entry release. The commit mode may be switched multiple times during execution. Initial results for a 4-wide processor show that, on average, 52% instructions are committed out of order resulting in 10% to 26% speedups over in-order commit, with minimal hardware overhead. The performance improvement is a result of an effectively larger instruction window that allows more cache misses to be overlapped for both L1 and L2 caches.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.8</guid>
  </item>
  <item>
     <title>PrePrint: Shrink-Fit: A Framework for Flexible Accelerator Sizing</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.7</link>
     <description>RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to "shrink-fit" accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.7</guid>
  </item>
  <item>
     <title>PrePrint: Enhanced Duplication: a Technique to Correct Soft Errors in Narrow Values</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.6</link>
     <description>Soft errors are transient errors that can alter the logic value of a register bit causing data corruption. They can be caused by radiation particles such as neutrons or alpha particles. Narrow values are commonly found in the data consumed or produced by processors. Several techniques have recently been proposed to exploit the unused bits in narrow values to protect them against soft errors. These techniques replicate the narrow value over the unused register bits such that errors can be detected when the value is duplicated and corrected when the value is tripled. In this letter, a technique that can correct errors when the narrow value is only duplicated is presented. The proposed approach stores a modified duplicate of the narrow value such that errors on the original value and the duplicate can be distinguished and therefore corrected. The scheme has been implemented at the circuit level to evaluate its speed and also at the architectural level to assess the benefits in correcting soft errors. The results show that the scheme is significantly faster than a parity check and can improve substantially the number of soft errors that are corrected compared to existing techniques.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.6</guid>
  </item>
  <item>
     <title>PrePrint: A New Optimal Worst-Case Throughput Bound for Oblivious Routing in Odd Radix Mesh Network</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.5</link>
     <description>1/2 network capacity is often believed to be the limit of worst-case throughput for mesh networks. However, this letter provides a new optimal worst-case throughput bound, which is higher than 1/2 network capacity, for odd radix two-dimensional mesh networks. In addition, we propose a routing algorithm called U2TURN that can achieve this worst-case throughput bound. U2TURN considers all routing paths with at most 2 turns and distributes the traffic loads uniformly in both X and Y dimensions. Theoretical analysis and simulation results show that U2TURN outperforms existing routing algorithms in worst-case throughput. Moreover, U2TURN achieves good average-throughput as well as good latency performance.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.5</guid>
  </item>
  <item>
     <title>PrePrint: Network-on-SSD: A Scalable and High-Performance Communication Design Paradigm for SSDs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.4</link>
     <description>In recent years, flash memory solid state disks (SSDs) have shown a great potential to change storage infrastructure because of its advantages of high speed and high throughput random access. This promising storage, however, greatly suffers from performance loss because of frequent ``erase-before-write'' and ``garbage collection'' operations. Thus, novel circuit-level, architectural, and algorithmic techniques are currently explored to address these limitations. In parallel with others, current study investigates replacing shared buses in multi-channel architecture of SSDs with an interconnection network to achieve scalable, high throughput, and reliable SSD storage systems. Roughly speaking, such a communication scheme provides superior parallelism that allows us to compensate the main part of the performance loss related to the aforementioned limitations through increasing data storage and retrieval processing throughput.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/L-CA.2012.4</guid>
  </item>
   </channel>
</rss>