<?xml version="1.0" encoding="ISO-8859-1"?>

<?xml-stylesheet href="/css/rss20.xsl" type="text/xsl"?>
<rss xmlns:pheedo="http://www.pheedo.com/namespace/pheedo" version="2.0">
<channel>
<title>IEEE Computer Architecture Letters</title>
<link>http://www.computer.org/cal</link>
<description>	</description>
	<language>en-us</language>
	<pubDate>Sun, 7 Sep 2008 10:00:02 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/cal.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/cal</link>
	</image>
  <item>
     <title>PrePrint: DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals</title>
     <link>http://www.pheedo.com/click.phdo?i=4a995815b8a9b26a5cf1a500ad7b6d05</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12</pheedo:origLink>
     <description>DMR (Dual Modular Redundancy) was suggested for increasing reliability. Classical DMR consists of pairs of cores that check each other and are pre-connected during manufacturing by dedicated links. In this paper we introduce the Dynamic Dual Modular Redundancy (DDMR) architecture. DDMR supports run-time scheduling of redundant threads, which has significant benefits relative to static binding. To allow dynamic pairing, DDMR replaces the special links with a novel ring architecture. DDMR uses short instruction sequences for validation, smaller than the processor reorder buffer. Such short sequences reduce latencies in parallel programs and save resources needed to buffer uncommitted data. DDMR scales with the number of cores and may be used in large multicore architectures.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=4a995815b8a9b26a5cf1a500ad7b6d05&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=4a995815b8a9b26a5cf1a500ad7b6d05&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12</guid>
  </item>
  <item>
     <title>PrePrint: BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP</title>
     <link>http://www.pheedo.com/click.phdo?i=d387ca95b61f187441f818cc5f2bd3e9</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.11</pheedo:origLink>
     <description>Network-on-Chips (NoCs) outperform buses in terms of scalability, parallelism and system modularity and therefore are considered as the main interconnect infrastructure in future chip multi-processor (CMP). However, while NoCs are very efficient for delivering high throughput point-to-point data from sources to destinations, their multi-hop operation is too slow for latency sensitive signals. In addition, current NoCS are inefficient for broadcast operations and centralized control of CMP resources. Consequently, state-of-the-art NoCs may not facilitate the needs of future CMP systems. In this paper, the benefit of adding a low latency, customized shared bus as an internal part of the NoC architecture is explored. BENoC (Bus-Enhanced Network on-Chip) possesses two main advantages: First, the bus is inherently capable of performing broadcast transmission in an efficient manner. Second, the bus has lower and more predictable propagation latency. In order to demonstrate the potential benefit of the proposed architecture, an analytical comparison of the power saving in BENoC versus a standard NoC providing similar services is presented. Then, simulation is used to evaluate BENoC in a dynamic non-uniform cache access (DNUCA) multiprocessor system.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=d387ca95b61f187441f818cc5f2bd3e9&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=d387ca95b61f187441f818cc5f2bd3e9&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.11</guid>
  </item>
  <item>
     <title>PrePrint: Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors</title>
     <link>http://www.pheedo.com/click.phdo?i=7e37ea5e1dc23e89c3c4734670ffe8e4</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.10</pheedo:origLink>
     <description>The software and hardware techniques to exploit the potential of multi-core processors are falling behind, even though the number of cores and cache levels per chip is increasing rapidly. There is no explicit communications support available, and hence inter-core communications depend on cache coherence protocols, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we present Software Controlled Eviction (SCE) to improve the performance of multithreaded applications running on multi-core processors by moving shared data to shared cache levels before it is demanded from remote private caches. Simulation results show that SCE offers significant performance improvement (8-28%) and reduces L3 cache misses by 88-98%.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=7e37ea5e1dc23e89c3c4734670ffe8e4&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=7e37ea5e1dc23e89c3c4734670ffe8e4&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.10</guid>
  </item>
  <item>
     <title>PrePrint: Transaction-Aware Network-on-Chip Resource Reservation</title>
     <link>http://www.pheedo.com/click.phdo?i=fdf9417d025a604886fefdd663163c56</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.9</pheedo:origLink>
     <description>Performance and scalability are critically-important for on-chip interconnect in many-core chip-multiprocessor systems. Packet-switched interconnect fabric, widely viewed as the de facto on-chip data communication backplane in the many-core era, offers high throughput and excellent scalability. However, these benefits come at the price of router latency due to run-time multi-hop data buffering and resource arbitration. The network accounts for a majority of on-chip data transaction latency. In this work, we propose dynamic in-network resource reservation techniques to optimize run-time on-chip data transactions. This idea is motivated by the need to preserve existing abstraction and general-purpose network performance while optimizing for frequently-occurring network events such as data transactions. Experimental studies using multithreaded benchmarks demonstrate that the proposed techniques can reduce on-chip data access latency by 28.4% on average in a 16-node system and 29.2% on average in a 36-node system.&lt;br style=&quot;clear: both;&quot;/&gt;
      &lt;a href=&quot;http://www.pheedo.com/click.phdo?s=fdf9417d025a604886fefdd663163c56&quot;&gt;&lt;img alt=&quot;&quot; style=&quot;border: 0;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?s=fdf9417d025a604886fefdd663163c56&quot;/&gt;&lt;/a&gt;
  &lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=fdf9417d025a604886fefdd663163c56&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.9</guid>
  </item>
  <item>
     <title>PrePrint: Beyond Fat--tree: Unidirectional Load--Balanced Multistage Interconnection Network</title>
     <link>http://www.pheedo.com/click.phdo?i=234e941d7db93071705ad6fbe091d71b</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.8</pheedo:origLink>
     <description>The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat--tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat--tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=234e941d7db93071705ad6fbe091d71b&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=234e941d7db93071705ad6fbe091d71b&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.8</guid>
  </item>
  <item>
     <title>PrePrint: Hierarchical Instruction Register Organization</title>
     <link>http://www.pheedo.com/click.phdo?i=9f2539c44b32a50a4c679901e887833d</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.7</pheedo:origLink>
     <description>This paper analyzes a range of architectures for efficient delivery of VLIW instructions for embedded media kernels. The analysis takes an efficient Filter Cache as a baseline and examines the benefits from 1) removing the tag overhead, 2) distributing the storage, 3) adding indirection, 4) adding efficient NOP generation, and 5) sharing instruction memory. The result is a hierarchical instruction register organization that provides a 56% energy and 40% area savings over an already efficient Filter Cache.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=9f2539c44b32a50a4c679901e887833d&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=9f2539c44b32a50a4c679901e887833d&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.7</guid>
  </item>
  <item>
     <title>PrePrint: Randomized Partially-Minimal Routing on Three-Dimensional Mesh Networks</title>
     <link>http://www.pheedo.com/click.phdo?i=fc0afb5ca624c180f8cb3be50a312a43</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.6</pheedo:origLink>
     <description>This letter presents a new oblivious routing algorithm for 3D mesh networks called Randomized Partially- Minimal (RPM) routing that provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal when k is odd. Although this optimality result has been achieved with the minimal routing algorithm O1TURN [9] for the 2D case, the worst-case throughput of O1TURN degrades tremendously in higher dimensions. Other existing routing algorithms suffer from either poor worst-case throughput (DOR [10], ROMM [8]) or poor latency (VAL [14]). RPM on the other hand achieves near optimal worst-case and good average-case throughput as well as good latency performance.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=fc0afb5ca624c180f8cb3be50a312a43&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=fc0afb5ca624c180f8cb3be50a312a43&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.6</guid>
  </item>
  <item>
     <title>PrePrint: Pipelined Architecture for Multi-String Matching</title>
     <link>http://www.pheedo.com/click.phdo?i=b0b34fc8038311a024fdb9e416a2086c</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.5</pheedo:origLink>
     <description>We present a pipelined approach to hardware implementation of the Aho-Corasick (AC) algorithm for string matching called P-AC. By incorporating pipelined processing, the state graph is reduced to a character trie that only contains forward edges. Edge reduction in P-AC is very impressive and is guaranteed algorithmically. For a signature set with 4434 strings extracted from the Snort rule set, the memory cost of P-AC is only 21.5 bits/char. The simplicity of the pipeline control plus the availability of 2-port memories allow us to implement two pipelines sharing the set of lookup tables on the same device. By doing so, the system throughput can be doubled with little overhead. The throughput of our method is up to 8.8 Gbps when the system is implemented using 550MHz FPGA.&lt;br style=&quot;clear: both;&quot;/&gt;
      &lt;a href=&quot;http://www.pheedo.com/click.phdo?s=b0b34fc8038311a024fdb9e416a2086c&quot;&gt;&lt;img alt=&quot;&quot; style=&quot;border: 0;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?s=b0b34fc8038311a024fdb9e416a2086c&quot;/&gt;&lt;/a&gt;
  &lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=b0b34fc8038311a024fdb9e416a2086c&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.5</guid>
  </item>
  <item>
     <title>PrePrint: A Parallel Deadlock Detection Algorithm with O(1) Overall Run-time Complexity</title>
     <link>http://www.pheedo.com/click.phdo?i=d40d3dcfda71d271a16739cdba0cade8</link>
<pheedo:origLink>http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.4</pheedo:origLink>
     <description>This article proposes a novel parallel, hardware-oriented deadlock detection algorithm for multiprocessor system-on-chips. The proposed algorithm takes full advantage of hardware parallelism in computation and maintains information needed by deadlock detection through classifying all resource allocation events and performing class specific operations, which together make the overall run-time complexity of the new method O(1). We implement the proposed algorithm in Verilog HDL and demonstrate in the simulation that each algorithm invocation takes at most four clock cycles in hardware.&lt;br style=&quot;clear: both;&quot;/&gt;
  &lt;img alt=&quot;&quot; style=&quot;border: 0; height:1px; width:1px;&quot; border=&quot;0&quot; src=&quot;http://www.pheedo.com/img.phdo?i=d40d3dcfda71d271a16739cdba0cade8&quot; height=&quot;1&quot; width=&quot;1&quot;/&gt;
&lt;img src=&quot;http://www.pheedo.com/feeds/tracker.php?i=d40d3dcfda71d271a16739cdba0cade8&quot; style=&quot;display: none;&quot; border=&quot;0&quot; height=&quot;1&quot; width=&quot;1&quot; alt=&quot;&quot;/&gt;</description>
     <guid isPermaLink="false">http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.4</guid>
  </item>
  <item>
     <title>IEEE Computer Architecture Letters - January-June 2008 (Vol. 7, No. 1)</title>
     <link>http://www.computer.org/portal/site/cal/</link>
     <description>IEEE Computer Architecture Letters</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/cal/</guid>
  </item>
   </channel>
</rss>