<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
<channel>
<title>IEEE/ACM Transactions on Computational Biology and Bioinformatics</title>
<link>http://www.computer.org/tcbb</link>
<description>The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development and optimization of biological databases; and important biological results that are obtained from the use of these methods, programs, and databases.	</description>
	<language>en-us</language>
	<pubDate>Wed, 4 Jan 2012 11:00:01 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/tcbb.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/tcbb</link>
	</image>
  <item>
     <title>PrePrint: Mutation Region Detection for Closely Related Individuals without a Known Pedigree Using High-Density Genotype Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.134</link>
     <description>The fundamental problem in linkage analysis is to identify regions whose allele is shared by all or almost all affected members but by none or few unaffected members. Almost all the existing methods for linkage analysis are for families with clearly given pedigrees. Little work has been done for the case where the sampled individuals are closely related, but their pedigree is not known. This situation occurs very often when the individuals share a common ancestor at least six generations ago. Solving this case will tremendously extend the use of linkage analysis for finding genes that cause genetic diseases. In this paper, we propose a mathematical model (the shared center problem) for inferring the allele-sharing status of a given set of individuals using a database of confirmed haplotypes as reference. We show the NP-completeness of the shared center problem and present a ratio-2 polynomial-time approximation algorithm. We then convert the approximation algorithm into a heuristic algorithm for the shared center problem. Based on this heuristic, we finally design a heuristic algorithm for mutation region detection. We further implement the algorithms to obtain a software package. Our experimental data shows that the software works very well. The package is available at http://www.cs.cityu.edu.hk/~lwang/software/LDWP/index.html for non-commercial use.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.134</guid>
  </item>
  <item>
     <title>PrePrint: SimBioNeT: A Simulator of Biological Network Topology</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.116</link>
     <description>Studying biological networks at topological level is a major issue in computational biology studies and simulation is often used in this context, either to assess reverse engineering algorithms or to investigate how topological properties depend on network parameters. In both contexts, it is desirable for a topology simulator to reproduce the current knowledge on biological networks, to be able to generate a number of networks with the same properties and to be flexible with respect to the possibility to mimic networks of different organisms. We propose a biological network topology simulator, SimBioNeT, in which module structures of different type and size are replicated at different level of network organization and interconnected, so to obtain the desired degree distribution, e.g. scale free, and a clustering coefficient constant with the number of nodes in the network, a typical characteristic of biological networks. Empirical assessment of the ability of the simulator to reproduce characteristic properties of biological network and comparison with E. coli and S. cerevisiae transcriptional networks demonstrates the effectiveness of our proposal.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.116</guid>
  </item>
  <item>
     <title>PrePrint: Exploiting Intra-Structure Information for Secondary Structure Prediction with Multifaceted Pipelines</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.159</link>
     <description>Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular for tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel perspective, in which understanding how available information sources are dealt with plays a central role. After revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which sources of information have been considered and which have not), we propose a generic software architecture designed to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with the proposed generic architecture has been implemented and compared with several state-of-the-art secondary structure predictors. Experiments have been carried out on standard datasets, and the corresponding results confirm the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web application or as downloadable stand-alone portable unpack-and-run bundle.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.159</guid>
  </item>
  <item>
     <title>PrePrint: A Co-clustering Approach for Mining Large Protein-protein Interaction Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.158</link>
     <description>Several approaches have been presented in the literature to cluster Protein-Protein Interaction (PPI) networks. They can be grouped in two main categories: those allowing a protein to participate in different clusters and those generating only non-overlapping clusters. In both cases, a challenging task is to find a suitable compromise between the biological relevance of the results and a comprehensive coverage of the analyzed networks. Indeed, methods returning high accurate results are often able to cover only small parts of the input PPI network, specially when low characterized networks are considered. We present a co-clustering based technique able to generate both overlapping and on-overlapping clusters. The density of the clusters to search for can also be set by the user. We tested our method on the two networks of yeast and human, and compared it to other five well known techniques on the same interaction datasets. The results showed that, for all the examples considered, our approach always reaches a good compromise between accuracy and network coverage. Furthermore, the behavior of our algorithm is not influenced by the structure of the input network, different from all the techniques considered in the comparison, which returned very good results on the yeast network, while on the human network their outcomes are rather poor.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.158</guid>
  </item>
  <item>
     <title>PrePrint: A Metric for Phylogenetic Trees Based on Matching</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.157</link>
     <description>Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes---reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.157</guid>
  </item>
  <item>
     <title>PrePrint: Predicting Protein Function by Multi-label Correlated Semi-supervised Learning</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.156</link>
     <description>Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amount of protein-protein interaction (PPI) data has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are interdependent on one another naturally. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by PPI network and functional class network. The guiding intuition is that the classification function should be sufficient smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intra-class and inter-class consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.156</guid>
  </item>
  <item>
     <title>PrePrint: On the Application of Active Learning and Gaussian Processes in Post-Cryopreservation Cell Membrane Integrity Experiments</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.155</link>
     <description>Biological cell cryopreservation permits storage of specimens for future use. Stem cell cryostorage in particular is fast becoming a broadly spread practice due to their potential for use in regenerative medicine. For the optimal cryopreservation process, ultra-low temperatures are needed. However, elevated temperatures are often unavoidable in a typical sample handling cycle which in turn negatively affects post-cryopreservation integrity of cells. In this paper, we present an application of active learning using an underlying Gaussian Process (GP) model in an experimental study on post-cryopreservation membrane integrity response to a range of elevated temperature conditions. We developed an algorithm which enabled identification of the sampling locations for the experiments in order to obtain the highest information return from a limited size sample set. We applied this algorithm in the experimental study investigating the effects of severe temperature elevation (ranging from -40&amp;#x00B0;C to 20&amp;#x00B0;C) over a short term event (48 hours) on the post-cryopreservation membrane integrity of Mesenchymal Stem Cells (MSCs) derived from human bone marrow. The algorithm showed excellent performance by selecting the locations which maximised the reduction of variance of the process response estimate. An approximating GP model developed from this experimental data shows that the elevated temperatures during cryopreservation have an imminent detrimental effect on cell integrity.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.155</guid>
  </item>
  <item>
     <title>PrePrint: Quantitative Analysis of the Self-assembly Strategies of Intermediate Filaments from Tetrameric Vimentin</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.154</link>
     <description>In vitro assembly of intermediate filaments from tetrameric vimentin consists of a very rapid phase of tetramers laterally associating into unit-length filaments and a slow phase of filament elongation. We focus in this paper on a systematic quantitative investigation of two molecular models for filament assembly, recently proposed in (Kirmse et al, J. Biol. Chem. 282, 52 (2007), 18563--18572), through mathematical modeling, model fitting, and model validation. We analyze the quantitative contribution of each filament elongation strategy: with tetramers, with unit-length filaments, with longer filaments, or combinations thereof. In each case, we discuss the numerical fitting of the model with respect to one set of data, and its separate validation with respect to a second, different set of data. We introduce a high-resolution model for vimentin filament self-assembly, able to capture the detailed dynamics of filaments of arbitrary length. This provides much more predictive power for the model, in comparison to previous models where only the mean length of all filaments in the solution could be analyzed. We show how kinetic observations on low-resolution models can be extrapolated to the high-resolution model and used for lowering its complexity.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.154</guid>
  </item>
  <item>
     <title>PrePrint: Stochastic Gene Expression Modeling with Hill Function for Switch-like Gene Responses</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.153</link>
     <description>Gene expression models play a key role to understand the mechanisms of gene regulation whose aspects are grade and switch-like responses. Though many stochastic approaches attempt to explain the gene expression mechanisms, the Gillespie algorithm which is commonly used to simulate the stochastic models requires additional gene cascade to explain the switch-like behaviors of gene responses. In this study, we propose a stochastic gene expression model describing the switch-like behaviors of a gene by employing Hill functions to the conventional Gillespie algorithm. We assume eight processes of gene expression and their biologically appropriate reaction rates are estimated based on published literatures. We observed that the state of the system of the toggled switch model is rarely changed since the Hill function prevents the activation of involved proteins when their concentrations stay below a criterion. In ScbA-ScbR system which can control the antibiotic metabolite production of microorganisms, our modified Gillespie algorithm successfully describes the switch-like behaviors of gene responses and oscillatory expressions which are consistent with the published experimental study.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.153</guid>
  </item>
  <item>
     <title>PrePrint: Gene Classification using Parameter-free Semi-supervised Manifold Learning</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.152</link>
     <description>A new manifold learning method, called parameter-free semi-supervised local fisher discriminant analysis (pSELF), is proposed to map the gene expression data into a low dimensional space for tumor classification. Motivated by the fact that semi-supervised and parameter-free are two desirable and promising characteristics for dimension reduction, a new difference-based optimization objective function with unlabeled samples has been designed. The proposed method preserves the global structure of unlabeled samples in addition to separating labeled samples in different classes from each other. The semi-supervised method has an analytic form of the globally optimal solution, which can be computed efficiently by eigen decomposition. Experimental results on synthetic data and SRBCT, DLBCL and Brain Tumor gene expression datasets demonstrate the effectiveness of the proposed method.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.152</guid>
  </item>
  <item>
     <title>PrePrint: A top-r Feature Selection Algorithm for Microarray Gene Expression Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.151</link>
     <description>Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r&amp;#x003C;h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression datasets. Our method shows promising classification accuracy for all the test datasets. We also show the relevance of the selected genes in terms of their biological functions.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.151</guid>
  </item>
  <item>
     <title>PrePrint: Eigen-genomic System Dynamic-pattern Analysis (ESDA): Modeling mRNA Degradation and Self-regulation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.150</link>
     <description>High-throughput methods systematically measure the internal state of the entire cell, but powerful computational tools are needed to infer dynamics from their raw data. Therefore, we have developed a new computational method, Eigen-genomic System Dynamic-pattern Analysis (ESDA), which uses systems theory to infer dynamic parameters from a time series of gene expression measurements. As many genes are measured at a modest number of time points, estimation of the system matrix is underdetermined and traditional approaches for estimating dynamic parameters are ineffective; thus, ESDA uses the principle of dimensionality reduction to overcome the data imbalance. Since degradation rates are naturally confounded by self-regulation, our model estimates an effective degradation rate that is the difference between self-regulation and degradation. We demonstrate that ESDA is able to recover effective degradation rates with reasonable accuracy in simulation. We also apply ESDA to a budding yeast dataset, and find that effective degradation rates are normally slower than experimentally measured degradation rates. Our results suggest that either self-regulation is widespread in budding yeast and that self-promotion dominates self-inhibition, or that self-regulation may be rare and that experimental methods for measuring degradation rates based on transcription arrest may severely overestimate true degradation rates in healthy cells.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.150</guid>
  </item>
  <item>
     <title>PrePrint: Designing  Filters for Fast Known NcRNA Identification</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.149</link>
     <description>Detecting members of known non-coding RNA (ncRNA) families in genomic DNA is an important part of sequence annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect ncRNA instances lacking strong sequence conservation remains challenging. In this work, we design three types of filters based on multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e. a position weight matrix) with secondary structure information but can still be efficiently scanned against long sequences. Multi-SSP-based filters combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are tested in BRAliBase III data set, Rfam, and a published metagenomic data set. We compare SSP-based filters with Infernal (with profile HMMs as filters), ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity. The designed filters and filter-scanning programs are available at: www.cse.msu.edu/~yannisun/ssp/.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.149</guid>
  </item>
  <item>
     <title>PrePrint: A Framework for Incorporating Functional Inter-relationships into Protein Function Prediction Algorithms</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.148</link>
     <description>The functional annotation of proteins is one of the most important tasks in the post-genomic era. In this study, we propose a new functional similarity measure in the form of Jaccard coefficient to quantify these inter-relationships and also develop a framework for incorporating GO term similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large size ones when considering functional inter-relationships. We also compare our similarity measure with other two widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed measure is more effective. Finally, we show that our method is robust to annotations in the database which are not complete at present.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.148</guid>
  </item>
  <item>
     <title>PrePrint: Identification of Essential Proteins Based on Edge Clustering Coefficient</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.147</link>
     <description>Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for drug design. The rapid increase of available protein-protein interaction data has made it possible to detect protein essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based on network topology. However, most of them tended to focus only on topologies of single proteins, but ignored the relevance between interactions and protein essentiality. In this paper, a new centrality measure based on edge clustering coefficient, named as NC, is proposed. NC considers both the centrality of a node and the relationship between it and its neighbors. A node's essentiality is determined by the sum of the edge clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC is applied to three different types of yeast protein-protein interaction networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The experimental results on the three different networks show that the number of essential proteins discovered by NC universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC and IC. Moreover, the essential proteins discovered by NC show significant cluster effect.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.147</guid>
  </item>
  <item>
     <title>PrePrint: A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.146</link>
     <description>Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 &amp;#x00D7; 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.146</guid>
  </item>
  <item>
     <title>IEEE/ACM Transactions on Computational Biology and Bioinformatics - January/February 2012 (Vol. 9, No. 1)</title>
     <link>http://opac.ieeecomputersociety.org/opac?year=2012&amp;volume=9&amp;issue=01&amp;acronym=tcbb</link>
     <description>IEEE/ACM Transactions on Computational Biology and Bioinformatics</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/tcbb/</guid>
  </item>
  <item>
     <title>PrePrint: The GA and the GWAS: Using Genetic Algorithms to Search for Multi-locus Associations</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.145</link>
     <description>Enormous data collection efforts and improvements in technology have made large genome-wide association studies a promising approach for better understanding the genetics of common diseases. Still, the knowledge gained from these studies may be extended even further by testing the hypothesis that genetic susceptibility is due to the combined effect of multiple variants or interactions between variants. Here we explore and evaluate the use of a genetic algorithm to discover groups of SNPs (of size 2, 3, or 4) that are jointly associated with bipolar disorder. The algorithm is guided by the structure of a gene interaction network, and is able to find groups of SNPs that are strongly associated with the disease, while performing far fewer statistical tests than other methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.145</guid>
  </item>
  <item>
     <title>PrePrint: GSGS: A Computational Approach to Reconstruct Signaling Pathway Structures from Gene Sets</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.143</link>
     <description>Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells. Existing approaches often rely on unrealistic biological assumptions and do not explicitly consider signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from cell surface to nucleus and characterize a signaling pathway. We propose a novel approach, Gene Set Gibbs Sampling, to reverse engineer signaling pathway structures from gene sets related to pathways. We hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the underlying IF but not their ordering. GSGS offers a Gibbs sampling procedure to reconstruct the underlying signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof-of-concept studies, our approach is shown to outperform existing network inference approaches using data generated from benchmark networks in DREAM. We perform a sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to reconstruct signaling mechanisms in breast cancer cells.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.143</guid>
  </item>
  <item>
     <title>PrePrint: Clustering 100,000 Protein Structure Decoys in Minutes</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.142</link>
     <description>Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER's selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.142</guid>
  </item>
  <item>
     <title>PrePrint: Molecular Dynamics Trajectory Compression with a Coarse-Grained Model</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.141</link>
     <description>Molecular dynamics trajectories are very data-intensive thereby limiting sharing and archival of such data. One possible solution is compression of trajectory data. Here, trajectory compression based on conversion to the coarse-grained model PRIMO is proposed. The compressed data is about one third of the original data and fast decompression is possible with an analytical reconstruction procedure from PRIMO to all-atom representations. This protocol largely preserves structural features and to a more limited extent also energetic features of the original trajectory.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.141</guid>
  </item>
  <item>
     <title>PrePrint: A Hybrid EKF and Switching PSO Algorithm for Joint State and Parameter Estimation of Lateral Flow Immunoassay Models</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.140</link>
     <description>In this paper, a hybrid extended Kalman filter (EKF) and switching particle swarm optimization (SPSO) algorithm is proposed for jointly estimating both the parameters and states of the lateral flow immunoassay model through available short time-series measurement. Our proposed method generalizes the well-known EKF algorithm by imposing physical constraints on the system states. Note that the state constraints are encountered very often in practice that give rise to considerable difficulties in system analysis and design. The main purpose of this paper is to handle the dynamic modeling problem with state constraints by combining the extended Kalman filtering and constrained optimization algorithms via the maximization probability method. More specifically, a recently developed SPSO algorithm is used to cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding a penalty term to the objective function. The proposed algorithm is then employed to simultaneously identify the parameters and states of a lateral flow immunoassay model. It is shown that the proposed algorithm gives much improved performance over the traditional EKF method.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.140</guid>
  </item>
  <item>
     <title>PrePrint: The LASSO and Sparse Least Square Regression Methods for SNP Selection in Predicting Quantitative Traits</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.139</link>
     <description>Recent work concerning quantitative traits of interest has focused on selecting a small subset of single nucleotide polymorphisms (SNPs) from amongst the SNPs responsible for the phenotypic variation of the trait. When considered as covariates, the large number of variables (SNPs) and their association with those in close proximity pose challenges for variable selection. The features of sparsity and shrinkage of regression coefficients of the least absolute shrinkage and selection operator (LASSO) method appear attractive for SNP selection. Sparse partial least squares (SPLS) is also appealing as it combines the features of sparsity in subset selection and dimension reduction to handle correlations amongst SNPs. In this paper we investigate application of the LASSO and SPLS methods for selecting SNPs that predict quantitative traits. We evaluate the performance of both methods with different criteria and under different scenarios using simulation studies. Results indicate that these methods can be effective in selecting SNPs that predict quantitative traits but are limited by some conditions. Both methods perform similarly overall but each exhibit advantages over the other in given situations. Both methods are applied to Canadian Holstein cattle data to compare their performance.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.139</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Approaches for Retrieving Protein Tertiary Structures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.138</link>
     <description>The 3D conformation of a protein in the space is the main factor which determines its function in living organisms. Due to the huge amount of newly discovered proteins, there is a need for fast and accurate computational methods for retrieving protein structures. Their purpose is to speed up the process of understanding the structure-to-function relationship which is crucial in the development of new drugs. There are many algorithms addressing the problem of protein structure retrieval. In this paper, we present several novel approaches for retrieving protein tertiary structures. We present our voxel based descriptor. Then we present our protein ray based descriptors which is applied on the interpolated protein backbone. We introduce five novel wavelet descriptors which perform wavelet transforms on the protein distance matrix. We also propose an efficient algorithm for distance matrix alignment MASASW (Matrix Alignment by Sequence Alignment within Sliding Window), which has shown as much faster than DALI, CE and MatAlign. We compared our approaches between themselves and with several existing algorithms, and they generally prove to be fast and accurate. MASASW achieves the highest accuracy. The ray and wavelet based descriptors as well as MASASW are more accurate than CE.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.138</guid>
  </item>
  <item>
     <title>PrePrint: Algorithms for Reticulate Networks of Multiple Phylogenetic Trees</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.137</link>
     <description>A reticulate network N of multiple phylogenetic trees may have vertices with two or more parents (called reticulation vertices). There are two ways to define the reticulation number of N. One is to define it as the number of reticulation vertices in N; in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate network of the trees. The other is to define it as the total number of parents of reticulation vertices in N minus the number of reticulation vertices in N; in this case, a reticulate network with the smallest reticulation number is called an optimal type-II reticulate network of the trees. In this paper, we present a fast algorithm for constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal type-II reticulate network of the input trees. To our knowledge, these are the first fast algorithms for the problems. Our experimental data shows that our algorithms can construct optimal type-I reticulate networks rapidly and can compute better lower bounds for optimal type-II reticulate networks within much shorter time than the previously best program.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.137</guid>
  </item>
  <item>
     <title>PrePrint: Predicting Ligand Binding Residues and Functional Sites using Multi-positional Correlations with Graph Theoretic Clustering and Kernel CCA</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.136</link>
     <description>We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved but also exhibit strong correlation due to the selection presure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues' evolutionary characterization at the sites and the structure based functional classification of the proteins in the context of a functional family. The results of testing the method on two well curated datasets show that the prediction accuracy as measured by ROC scores improves significantly when multi-positional correlations are accounted for.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.136</guid>
  </item>
  <item>
     <title>PrePrint: Robust Classification Method of Tumor Subtype by Using Correlation Filters</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.135</link>
     <description>Tumor classification based on gene expression profiles, which is of great benefit to the accurate diagnosis and personalized treatment for different types of tumor, has drawn a great attention in recent years. This paper proposes a novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in differentially expressed genes. Concretely, two correlation filters, i.e., Minimum Average Correlation Energy (MACE) and Optimal Tradeoff Synthetic Discriminant Function (OTSDF), are introduced to determine whether a test sample matches the templates synthesized for each subclass. The experiments on six publicly available datasets indicate that the proposed method is robust to noise, and can more effectively avoid the effects of dimensionality curse. Compared with many model-based methods, the correlation filter based method can achieve better performance when balanced training sets are exploited to synthesize the templates. Particularly, the proposed method can detect the similarity of overall pattern while ignoring small mismatches between test sample and the synthesized template. And it performs well even if only few training samples are available. More importantly, the experimental results can be visually represented, which is helpful for the further analysis of results.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.135</guid>
  </item>
  <item>
     <title>PrePrint: On Complexity of Protein Structure Alignment Problem under Distance Constraint</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.133</link>
     <description>We study the well known LCP (Largest Common Point-Set) under Bottleneck Distance Problem. Given two proteins a and b (as sequences of points in 3D space) and a distance cutoff &amp;#x03C3;, the goal is to find a spatial superposition and an alignment that maximizes the number of pairs of points from a and b that can be fit under the distance &amp;#x03C3; from each other. The best to date algorithms for approximate and exact solution to this problem run in time O(n^8) and O(n^32), respectively, where n represents the protein length. This work improves the runtime of the approximation algorithm and the algorithm for absolute optimum for both order-dependent and order-independent alignments. More specifically, our algorithms for near-optimal and optimal sequential alignments run in time O(^7 log n) and O(n^14 log n), respectively. For non-sequential alignments, corresponding running times are O(n^7.5) and O(n^14.5).</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.133</guid>
  </item>
  <item>
     <title>PrePrint: Quantifying Dynamic Stability of Genetic Memory Circuits</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.132</link>
     <description>Bistability/Multistability has been found in many biological systems including genetic memory circuits. Proper characterization of system stability helps to understand biological functions and has potential applications in fields such as synthetic biology. Existing methods of analyzing bistability are either qualitative or in a static way. Assuming the circuit is in a steady state, the latter can only reveal the susceptibility of the stability to injected DC noises. However, this can be inappropriate and inadequate as dynamics are crucial for many biological networks. In this paper, we quantitatively characterize the dynamic stability of a genetic conditional memory circuit by developing new dynamic noise margin (DNM) concepts and associated algorithms based on system theory. Taking into account the duration of the noisy perturbation, the DNMs are more general cases of their static counterparts. Using our techniques, we analyze the noise immunity of the memory circuit and derive insights on dynamic hold and write operations. Considering cell-to-cell variations, our parametric analysis reveals that the dynamic stability of the memory circuit has significantly varying sensitivities to underlying biochemical reactions attributable to differences in structure, time scales and nonlinear interactions between reactions. With proper extensions, our techniques are broadly applicable to other multi-stable biological systems.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.132</guid>
  </item>
  <item>
     <title>PrePrint: Subcellular Localization Prediction through Boosting Association Rules</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.131</link>
     <description>Computational methods for predicting protein subcellular localization have used various types of features, including N-terminal sorting signals, amino acid compositions, and text annotations from protein databases. Our approach does not use biological knowledge such as the sorting signals or homologues, but use just protein sequence information. The method divides a protein sequence into short $k$-mer sequence fragments which can be mapped to word features in document classification. A large number of class association rules are mined from the protein sequence examples that range from the N-terminus to the C-terminus. Then, a boosting algorithm is applied to those rules to build up a final classifier. Experimental results using benchmark datasets show our method is excellent in terms of both the classification performance and the test coverage. The result also implies that the $k$-mer sequence features which determine subcellular locations do not necessarily exist in specific positions of a protein sequence. Online prediction service implementing our method is available at http://isoft.postech.ac.kr/research/BCAR/subcell.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.131</guid>
  </item>
  <item>
     <title>PrePrint: Comment on "SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition"</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.130</link>
     <description>We comment on the flexibility profiles calculated by Zeng et al., and show that these profiles do not represent the local flexibility of the DNA molecule. If one takes into account the physics of elasticity, the averaged flexibility profile show an additional peak which is missed in the original calculation. We show that it is not possible to calculate the flexibility of a 6-mer using tetranucleotide elastic constants, the shortest sequence is a 7-mer. For 6-mers, dinucleotide or trinucleotide parameters are needed. We present calculations for dinucleotide flexibility parameters and show that the same additional peak is present for both 7-mers and 6-mers.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.130</guid>
  </item>
  <item>
     <title>PrePrint: DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.129</link>
     <description>Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.129</guid>
  </item>
  <item>
     <title>PrePrint: On the Elusiveness of Clusters</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.128</link>
     <description>Rooted phylogenetic networks are often used to represent conflicting phylogenetic signals. Given a set of clusters, a network is said to represent these clusters in the softwired sense if, for each cluster, at least one tree embedded in the network contains it. Motivated by parsimony we might wish to construct such a network using as few reticulations as possible, or minimizing the level of the network, i.e. the maximum number of reticulations used in any "tangled" region of the network. Although these are NP-hard problems, here we prove that, for every fixed k &amp;#x2265; 0, it is polynomial-time solvable to construct a phylogenetic network with level equal to k representing a cluster set, or to determine that no such network exists. However, this algorithm does not lend itself to a practical implementation. We also prove that the comparatively efficient CASS algorithm correctly solves this problem (and also minimizes the reticulation number) when input clusters are obtained from two not necessarily binary gene trees on the same set of taxa but does not always minimize level for general cluster sets. Finally, we describe a new algorithm which generates in polynomial-time all binary phylogenetic networks with exactly r reticulations representing a set of input clusters (for every fixed r &amp;#x2265; 0).</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.128</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.127</link>
     <description>Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19--50 times the text size with the best engineering efforts, prohibiting their usability on massive data. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage is less than double the sequence size. Our method is also orders of magnitude faster than the prior methods for processing massive texts, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.127</guid>
  </item>
  <item>
     <title>PrePrint: Inference of Biological S-system Using Separable Estimation Method and Genetic Algorithm</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.126</link>
     <description>Reconstruction of a biological system from its experimental time series data is a challenging task in systems biology. The S-system which consists of a group of nonlinear ordinary differential equations is an effective model to characterize molecular biological systems and analyze the system dynamics. However, inference of S-systems without the knowledge of system structure is not a trivial task due to its nonlinearity and complexity. In this paper, a pruning separable parameter estimation algorithm is proposed for inferring S-systems. This novel algorithm combines the separable parameter estimation method and a pruning strategy, which includes adding an &amp;#8467;1 regularization term to the objective function and pruning the solution with a threshold value. Then, this algorithm is combined with the continuous genetic algorithm to form a hybrid algorithm who owns the properties of these two combined algorithms. The performance of the pruning strategy in the proposed algorithm is evaluated from two aspects: the parameter estimation error and structure identification accuracy. The results show that the proposed algorithm with the pruning strategy has much lower estimation error and much higher identification accuracy than the existing method.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.126</guid>
  </item>
  <item>
     <title>PrePrint: Algorithms to Detect Multiprotein Modularity Conserved during Evolution</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.125</link>
     <description>Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein modularity conserved during evolution that improves on the running time of previous algorithms for related problems and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through randomization experiments we demonstrate that conserved modularity is a defining characteristic of interactomes. Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila melanogaster, combining results across algorithms, show that nearly 10% of current interactome proteins participate in multiprotein modules with good evidence in the protein interaction data of being conserved between human and Drosophila.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.125</guid>
  </item>
  <item>
     <title>PrePrint: Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.117</link>
     <description>The study of reliable automatic system for predicting bacterial virulent proteins has several important applications for finding novel drug/vaccine and for understanding virulence mechanisms in pathogens. In this work we study several feature extraction approaches for representing proteins and propose a novel bacterial virulent protein prediction method based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein. In particular, several ensembles are evaluated and compared, obtained combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e. support vector machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the system is validated in three different independent datasets, allows selecting the most performing system and demonstrates the validity of the proposed method. Based on the results obtained using the blind testing protocol, it is interesting to note that, even if in each independent dataset the most performing stand-alone method is not always the same, the fusion among different methods permits to obtain a good performance stability in all the tested independent datasets.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.117</guid>
  </item>
  <item>
     <title>PrePrint: Constructing and Drawing Regular Planar Split Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.115</link>
     <description>Split networks are commonly used to visualize collections of bipartitions, also called splits, of a finite set. Such collections arise, for example, in evolutionary studies. Split networks can be viewed as a generalization of phylogenetic trees and may be generated using the SplitsTree package. Recently, the NeighborNet method for generating split networks has become rather popular, in part because it is guaranteed to always generate a circular split system, which can always be displayed by a planar split network. Even so, labels must be placed on the "outside" of the network, which might be problematic in some applications. To help circumvent this problem, it can be helpful to consider so-called flat split systems, which can be displayed by planar split networks where labels are allowed on the inside of the network too. Here we present a new algorithm that is guaranteed to compute a minimal planar split network displaying a flat split system in polynomial time, provided the split system is given in a certain format. We will also briefly discuss two heuristics that could be useful for analyzing phylogeographic data and that allow the computation of flat split systems in this format in polynomial time.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.115</guid>
  </item>
  <item>
     <title>PrePrint: Structural SCOP Superfamily Level Classification Using Unsupervised Machine Learning</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.114</link>
     <description>One of the major research directions in bioinformatics is that of assigning superfamily classification to a given set of proteins. The classification reflects the structural, evolutionary, and functional relatedness. These relationships are embodied in a hierarchical classification, such as the Structural Classification of Protein (SCOP), which is mostly manually curated. Such a classification is essential for the structural and functional analyses of proteins. Yet a large number of proteins remains unclassified. In this study, we have proposed an unsupervised machine learning approach to classify and assign a given set of proteins to SCOP superfamilies. In this method, we have constructed a database and similarity matrix using P-values obtained from an all-against-all BLAST run and trained the network with ART2 unsupervised learning algorithm using the rows of the similarity matrix as input vectors, enabling the trained network to classify the proteins from 0.82 to 0.97 f-measure accuracy. The performance of ART2 has been compared with that of spectral clustering, Random forest, SVM and HHpred. ART2 performs better than the others except HHpred. HHpred performs better than ART2 and the sum of errors is smaller than that of the other methods evaluated.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.114</guid>
  </item>
  <item>
     <title>PrePrint: Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.113</link>
     <description>In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e. microbial core and gene core, correspondingly). We firstly show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a 'document', which has a mixture of functional groups, while each functional group (also known as a 'latent topic') is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Secondly, we show that, generative topic model can also be used to study the genome-level composition of 'N-mer' features (DNA sub-reads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the 'N-mer' features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.113</guid>
  </item>
  <item>
     <title>PrePrint: Output-Sensitive Algorithms for Finding the Nested Common Intervals of Two General Sequences</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.112</link>
     <description>The focus of this paper is the problem of finding all nested common intervals of two general sequences. Blin, Faye, and Stoye introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + N_out)-time algorithms, where N_out denotes the size of the output. For the free-inclusion model, we give an O(n^(1+e) + N_out)-time algorithm, where e &amp;#x003E; 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-inclusion models, we show that N_out = O(n^2). Let C = Sum_{a in A}(o1(a)o2(a)), where A is the set of distinct genes, and o1(a) and o2(a) are, respectively, the numbers of copies of gene a in the two given sequences. For the bijection model, we show that N_out = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(dn + N_out)-time algorithm is presented, where d denotes the maximum number of allowed gaps. In addition, we show that for this problem N_out is O(dn^3).</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.112</guid>
  </item>
  <item>
     <title>PrePrint: The Impact of Normalization and Phylogenetic Information on Estimating the Distance for Metagenomes</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.111</link>
     <description>Metagenomics enables the study of unculturable microorganisms in different environments directly. Discriminating between the compositional differences of metagenomes is an important and challenging problem. Several distance functions have been proposed to estimate the differences based on functional profiles or taxonomic distributions; however, the strengths and limitations of such functions are still unclear. Initially, we analyzed three well-known distance functions and found very little difference between them in the clustering of samples. This motivated us to incorporate suitable normalizations and phylogenetic information into the functions so that we could cluster samples from both real and synthetic datasets. The results indicate significant improvement in sample clustering over that derived by rank-based normalization with phylogenetic information, regardless of whether the samples are from real or synthetic microbiomes. Furthermore, our findings suggest that considering suitable normalizations and phylogenetic information is essential when designing distance functions for estimating the differences between metagenomes. We conclude that incorporating rank-based normalization with phylogenetic information into the distance functions helps achieve reliable clustering results.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.111</guid>
  </item>
  <item>
     <title>PrePrint: On Parameter Synthesis by Parallel Model Checking</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.110</link>
     <description>An important problem in current computational systems biology is to analyse models of biological systems dynamics under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model checking. The algorithm is conceptually universal with respect to the modelling approach employed. We introduce the algorithm, show its scalability, and examine its applicability on several biological models.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.110</guid>
  </item>
  <item>
     <title>PrePrint: Optimizing Phylogenetic Networks for Circular Split Systems</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.109</link>
     <description>We address the problem of realizing a given distance matrix by a planar phylogenetic network with a minimum number of faces. With the help of the popular software SplitsTree4, we start by approximating the distance matrix with a distance metric that is a linear combination of circular splits. The main results of this paper are the necessary and sufficient conditions for the existence of a network with a single face. We show how such a network can be constructed, and we present a heuristic for constructing a network with few faces using the first algorithm as the base case. Experimental results on biological data show that this heuristic algorithm can produce phylogenetic networks with far fewer faces than the ones computed by SplitsTree4, without affecting the approximation of the distance matrix.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.109</guid>
  </item>
  <item>
     <title>PrePrint: Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics Simulation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.106</link>
     <description>Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behaviour in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper we analyse Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular and bimolecular reaction, the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.106</guid>
  </item>
  <item>
     <title>PrePrint: A Sparse Regulatory Network of Copy-number Driven Gene Expression Reveals Putative Breast Cancer Oncogenes</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.105</link>
     <description>Copy number aberrations are recognized to be important in cancer as they may localize to regions harboring oncogenes or tumor suppressors. Such genomic alterations mediate phenotypic changes through their impact on expression. Both cis- and trans- acting alterations are important since they may help to elucidate putative cancer genes. However, trans-effects are less well studied due to the computational difficulty in detecting weak and sparse signals in the data, and yet may influence multiple genes on a global scale. We propose an integrative approach to learn a sparse interaction network of DNA copynumber regions with their downstream transcriptional targets in breast cancer. With respect to goodness of fit on both simulated and real data, the performance of sparse network inference is no worse than other state-of the art models but with the advantage of simultaneous feature selection. Further, our approach yields a quantitative copy-number dependency score, which distinguishes cis- versus trans-effects. When applied to a breast cancer dataset, numerous expression profiles were impacted by cis-acting copy-number alterations, including several known oncogenes such as GRB7, ERBB2 and LSM1. Several trans-acting alterations were also identified, impacting genes such as ADAM2 and BAGE, which warrant further investigation.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.105</guid>
  </item>
  <item>
     <title>PrePrint: Quantum Gate Circuit Model of Signal Integration in Bacterial Quorum Sensing.</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.104</link>
     <description>Bacteria evolved cell to cell communication processes to gain information about their environment and regulate gene expression. Quorum sensing is such a process in which signaling molecules, called autoinducers, are produced, secreted and detected. In several cases bacteria use more than one autoinducers and integrate the information conveyed by them. It has not yet been explained adequately why bacteria evolved such signal integration circuits and what can learn about their environments using more than one autoinducers since all signaling pathways merge in one. Here quantum information theory, which includes classical information theory as a special case, is used to construct a quantum gate circuit that reproduces recent experimental results. Although the conditions in which bio-systems exist do not allow for the appearance of quantum mechanical phenomena, the powerful computation tools of quantum information processing can be carefully used to cope with signal and information processing by these complex systems. A simulation algorithm based on this model has been developed and numerical experiments that analyse the dynamical operation of the quorum sensing circuit were performed for various cases of autoinducer variations, which revealed that these variations contain significant information about the environment in which bacteria exist.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.104</guid>
  </item>
  <item>
     <title>PrePrint: A New Efficient Algorithm for the Gene Team Problem on General Sequences</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.96</link>
     <description>A famous model to capture the essential biological features of a conserved gene cluster is called the gene team model.The problem of finding the gene teams of two general sequences is the focus of this paper. For this problem, He and Goldwasser had an efficient algorithm that requires O(mn) time using O(m + n) working space, where m and n are, respectively, the numbers of genes in the two given sequences. In this paper, a new efficient algorithm is presented. Assume m &amp;#x2264; n. Let C = &amp;#x03A3;_&amp;#x03B1;&amp;#x2208;&amp;#x03A3; o&amp;#x2081;(&amp;#x03B1;)o&amp;#x2082;(&amp;#x03B1;) where &amp;#x03A3; is the set of distinct genes, and o&amp;#x2081;(&amp;#x03B1;) and o&amp;#x2082;(&amp;#x03B1;) are, respectively, the numbers of copies of &amp;#x03B1; in the two given sequences. Our new algorithm requires O(min{C 1g n, mn}) time using O(m + n) working space. As compared with He and Goldwasser's algorithm, our new algorithm is more practical, as C is likely to be much smaller than  mn in practice. In addition, our new algorithm is output-sensitive. Its running time is O(1g n) times the size  of the output. Moreover, our new algorithm can be efficiently extended to find the gene teams of k general sequences in O(k C 1g (n&amp;#x2081;n&amp;#x2082; ... n_k)) time, where n&amp;#x2081; is the number of genes in the i^th input sequence.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.96</guid>
  </item>
  <item>
     <title>PrePrint: A Swarm Intelligence Framework for Reconstructing Gene Networks: Searching for Biologically Plausible Architectures</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.87</link>
     <description>In this paper, we investigate the problem of reverse-engineering the topology of gene regulatory networks from temporal gene expression data. We adopt a computational intelligence approach comprising swarm intelligence techniques, namely particle swarm optimization (PSO) and ant colony optimization (ACO). In addition, the recurrent neural network (RNN) formalism is employed for modelling the dynamical behaviour of gene regulatory systems. More specifically, ACO is used for searching the discrete space of network architectures and PSO for searching the corresponding continuous space of RNN model parameters. We propose a novel solution construction process in the context of ACO for generating biologically plausible candidate architectures. The objective is to concentrate the search effort into areas of the structure space that contain architectures which are feasible in terms of their topological resemblance to real-world networks. The proposed framework is first applied to an artificial data set with added noise for reconstructing a subnetwork of the genetic interaction network of S. cerevisiae (yeast). The framework is also applied to a real-world data set for reverse-engineering the SOS response system of the bacterium Escherichia coli. Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast space of network architectures.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.87</guid>
  </item>
  <item>
     <title>PrePrint: Mutual Information Optimization for Mass Spectra Data Alignment</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.80</link>
     <description>"Signal" alignments play critical roles in many clinical setting. This is the case of mass spectrometry data, an important component of many types of proteomic analysis. A central problem occurs when one needs to integrate (mass spectrometry) data produced by different sources, e.g., different equipment and/or laboratories. In these cases some form of "data integration'" or "data fusion'" may be necessary in order to discard some source specific aspects and improve the ability to perform a classification task such as inferring the "disease classes'" of patients. The need for new high performance data alignments methods is therefore particularly important in these contexts. In this paper we propose an approach based both on an information theory perspective, generally used in a feature construction problem, and on the application of a mathematical programming task (i.e. the weighted bipartite matching problem). We present the results of a competitive analysis of our method against other approaches. The analysis was conducted on data from plasma/ethylenediaminetetraacetic acid (EDTA) of "control" and Alzheimer patients collected from three different hospitals. The results point to a significant performance advantage of our method with respect to the competing ones tested.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.80</guid>
  </item>
   </channel>
</rss>
