<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
<channel>
<title>IEEE Transactions on Knowledge and Data Engineering</title>
<link>http://www.computer.org/tkde</link>
<description>The IEEE Transactions on Knowledge and Data Engineering is an archival journal published monthly. The information published in this Transactions is designed to inform researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area. We are interested in well-defined theoretical results and empirical studies that have potential impact on the acquisition, management, storage, and graceful degeneration of knowledge and data, as well as in provision of knowledge and data services. Specific topics include, but are not limited to: a) artificial intelligence techniques, including speech, voice, graphics, images, and documents; b) knowledge and data engineering tools and techniques; c) parallel and distributed processing; d) real-time distributed; e) system architectures, integration, and modeling; f) database design, modeling and management; g) query design and implementation languages; h) distributed database control; j) algorithms for data and knowledge management; k) performance evaluation of algorithms and systems; l) data communications aspects; m) system applications and experience; n) knowledge-based and expert systems; and, o) integrity, security, and fault tolerance.	</description>
	<language>en-us</language>
	<pubDate>Fri, 24 May 2013 10:00:16 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/tkde.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/tkde</link>
	</image>
  <item>
     <title>PrePrint: Runtime Optimizations for Prediction with Tree-Based Models</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.73</link>
     <description>Tree-based models have proven to be an effective solution for web ranking as well as other problems in diverse domains. This paper focuses on optimizing the runtime performance of applying such models to make predictions, given an already-trained model. Although exceedingly simple conceptually, most implementations of tree-based models do not efficiently utilize modern superscalar processor architectures. By laying out data structures in memory in a more cache-conscious fashion, removing branches from the execution flow using a technique called predication, and micro-batching predictions using a technique called vectorization, we are able to better exploit modern processor architectures and significantly improve the speed of tree-based models over hard-coded if-else blocks. Our work represents the first instance of an architecture-conscious runtime implementation of tree-based models that we are aware of.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.73</guid>
  </item>
  <item>
     <title>PrePrint: EMR: A Scalable Graph-Based Ranking Model for Content-Based Image Retrieval</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.70</link>
     <description>Graph-based ranking models have been widely applied in information retrieval area. In this paper, we focus on a well known graph-based model - the Ranking on Data Manifold model, or Manifold Ranking (MR). Particularly, it has been successfully applied to content-based image retrieval, because of its outstanding ability to discover underlying geometrical structure of the given image database. However, manifold ranking is computationally very expensive, which significantly limits its applicability to large databases especially for the cases that the queries are out of the database (new samples). We propose a novel scalable graph-based ranking model called Efficient Manifold Ranking (EMR), trying to address the shortcomings of MR from two main perspectives: scalable graph construction and efficient ranking computation. Specifically, we build an anchor graph on the database instead of a traditional k-nearest neighbor graph, and design a new form of adjacency matrix utilized to speed up the ranking. An approximate method is adopted for efficient out-of-sample retrieval. Experimental results on some large scale image databases demonstrate that EMR is a promising method for real world retrieval applications.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.70</guid>
  </item>
  <item>
     <title>PrePrint: Quasi-SLCA Based Keyword Query Processing Over Probabilistic XML Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.67</link>
     <description>The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold keyword queries (PrTKQ) over XML data, which is not studied before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted (PI) index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our approaches.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.67</guid>
  </item>
  <item>
     <title>PrePrint: Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.65</link>
     <description>Many pattern analysis and data mining problems have witnessed high-dimensional data represented by a large number of features, which are often redundant and noisy. Feature selection is one main technique for dimensionality reduction that involves identifying a subset of the most useful features. In this paper, a novel unsupervised feature selection algorithm, named Clustering-Guided Sparse Structural Learning (CGSSL), is proposed by integrating cluster analysis and sparse structural analysis into a joint framework and experimentally evaluated. Nonnegative spectral clustering is developed to learn more accurate cluster labels of the input samples, which guide feature selection simultaneously. Meanwhile, the cluster labels are also predicted by exploiting the hidden structure shared by different features, which can uncover feature correlations to make the results more reliable. Row-wise sparse models are leveraged to make the proposed model suitable for feature selection. To optimize the proposed formulation, we propose an efficient iterative algorithm. Finally, extensive experiments are conducted on 12 diverse benchmarks, including face data, handwritten digit data, document data and biomedical data. The encouraging experimental results in comparison with several representative algorithms and the theoretical analysis demonstrate the efficiency and effectiveness of the proposed algorithm for feature selection.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.65</guid>
  </item>
  <item>
     <title>PrePrint: Bias correction in small sample from big data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.220</link>
     <description>This paper discusses the bias problem when estimating the population size of big data such as online social networks (OSN) using uniform random sampling and simple random walk. Unlike the traditional estimation problem where the sample size is not very small relative to the data size, in big data a small sample relative to the data size is already very large and costly to obtain. We point out that when small samples are used, there is a bias that is no longer negligible. This paper shows analytically that the relative bias can be approximated by the reciprocal of the number of collisions, thereby a bias correction estimator is introduced. The result is further supported by both simulation studies and the real Twitter network that contains 41.7 million nodes.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.220</guid>
  </item>
  <item>
     <title>PrePrint: Event Characterization and Prediction Based on Temporal Patterns in Dynamic Data System</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.60</link>
     <description>The new method proposed in this paper applies a Multivariate Reconstructed Phase Space (MRPS) for identifying multivariate temporal patterns that are characteristic and predictive of anomalies or events in a dynamic data system. The new method extends the original univariate reconstructed phase space framework, which is based on fuzzy unsupervised clustering method, by incorporating a new mechanism of data categorization based on the definition of events. In addition to modeling temporal dynamics in a multivariate phase space, a Bayesian approach is applied to model the first-order Markov behavior in the multi-dimensional data sequences. The method utilizes an exponential loss objective function to optimize a hybrid classifier which consists of a radial basis kernel function and a log-odds ratio component. We performed experimental evaluation on three data sets to demonstrate the feasibility and effectiveness of the proposed approach.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.60</guid>
  </item>
  <item>
     <title>PrePrint: Responsibility Analysis for Lineages of Conjunctive Queries with Inequalities</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.58</link>
     <description>This paper investigates the problem of efficiently computing responsibility for lineages of conjunctive queries with inequalities on databases. We classify the lineages of a class of queries with inequalities, called IQ queries, into path and composite lineages. We first compile path lineages into lineage graphs and transform lineage graphs into matrices. Then we reduce the problem of computing responsibility for path lineages to the shortest path problem, which can be solved by the dynamic programming algorithm in PTIME. We further prove composite lineages can be decomposed into path lineages for responsibility analysis. Thus, our first main result shows it is in PTIME to compute responsibility for lineages of IQ queries. We generalize the previous results on dichotomy of responsibility analysis for lineages of conjunctive queries with equalities, now in the presence of inequalities. After decomposing composite lineages into path lineages, the data population needed for computing responsibility decreases by one order of magnitude. Thus, our algorithm can efficiently compute responsibility for composite lineages. In order to compute responsibility for lineages in general, we introduce a greedy algorithm, consisting of a reduction to the set cover problem. Finally, we demonstrate the benefits of the proposed algorithms with extensive experimental results.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.58</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Ranking on Entity Graphs with Personalized Relationships</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.52</link>
     <description>Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; Edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of precomputation and runtime algorithms. In particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, DataApprox has a provable error bound. Both SchemaApprox and DataApprox are expensive so we develop efficient heuristic implementations, ScaleRank and PickOne respectively. Extensive experiments on the DBLP data graph show that ScaleRank provides a fast and accurate personalized authority flow ranking.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.52</guid>
  </item>
  <item>
     <title>PrePrint: Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.47</link>
     <description>Co-clustering is a commonly used technique for tapping the rich meta-information of multimedia web documents, including category, annotation, and description, for associative discovery. However, most co-clustering methods proposed for heterogeneous data do not consider the representation problem of short and noisy text and their performance is limited by the empirical weighting of the multi-modal features. In this paper, we propose a generalized form of Heterogeneous Fusion Adaptive Resonance Theory, called GHF-ART, for co-clustering of large-scale web multimedia documents. By extending the two-channel Heterogeneous Fusion ART (HF-ART) to multiple channels, GHF-ART is designed to handle multimedia data with an arbitrarily rich level of meta-information. For handling short and noisy text, GHF-ART does not learn directly from the textual features. Instead, it identifies key tags by learning the probabilistic distribution of tag occurrences. More importantly, GHF-ART incorporates an adaptive method for effective fusion of multi-modal features, which weights the features of multiple data sources by incrementally measuring the importance of feature modalities through the intra-cluster scatters. Extensive experiments on two web image data sets and one text document set have shown that GHF-ART achieves significantly better clustering performance and is much faster than many existing state-of-the-art algorithms</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.47</guid>
  </item>
  <item>
     <title>PrePrint: Secure Mining of Association Rules in Horizontally Distributed Databases</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.41</link>
     <description>We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms --- one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in [18]. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.41</guid>
  </item>
  <item>
     <title>PrePrint: On-Demand Snapshot: An Efficient Versioning File System for Phase-Change Memory</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.35</link>
     <description>Versioning file systems are widely used in modern computer systems as they provide system recovery and old data access functions by retaining previous file system snapshots. However, existing versioning file systems do not perform well with the emerging PCM (phase change memory) storage, because they are optimized for hard disks. Specifically, a large amount of additional writes incurred by maintaining snapshot degrades the performance of PCM seriously as write operations are the performance bottleneck of PCM. This paper presents a novel versioning file system, designed for PCM, that reduces the writing overhead of a snapshot significantly. Unlike existing versioning file systems that incur cascade writes up to the file system root, our scheme breaks the recursive update chain at the immediate parent level. The proposed file system is implemented on Linux 2.6 as a prototype. Measurement studies with various I/O benchmarks show that the proposed file system improves the I/O throughput by 144% on average, compared to ZFS, a representative versioning file system.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.35</guid>
  </item>
  <item>
     <title>PrePrint: The Role of Hubness in Clustering High-Dimensional Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.25</link>
     <description>High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster configurations. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.25</guid>
  </item>
  <item>
     <title>PrePrint: m-Privacy for Collaborative Data Publishing</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.18</link>
     <description>In this paper, we consider the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers. We consider a new type of &#x0022;insider attack&#x0022; by colluding data providers who may use their own data records (a subset of the overall data) to infer the data records contributed by other data providers. The paper addresses this new threat, and makes several contributions. First, we introduce the notion of m-privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers. Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently checking m-privacy given a group of records. Third, we present a data provider-aware anonymization algorithm with adaptive m-privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency. Finally, we implement the m-privacy anonymization and verification algorithms with a trusted third party (TTP), and propose secure multiparty computation protocols for scenarios without TTP. All protocols are extensively analyzed and their security and efficiency are formally proved. Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency than existing and baseline algorithms while satisfying m-privacy.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.18</guid>
  </item>
  <item>
     <title>PrePrint: CoDe Modeling of Graph Composition for Data Warehouse Report Visualization</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.24</link>
     <description>The visualization of information contained in reports is an important aspect of human-computer interaction, for both the accuracy and the complexity of relationships between data must be preserved. A greater attention has been paid to individual report visualization through different types of standard graphs (Histograms, Pies, etc.). However, this kind of representation provides separate information items and gives no support to visualize their relationships which are extremely important for most decision processes. This paper presents a design methodology exploiting the visual language CoDe based on a logic paradigm. CoDe allows to organize the visualization through the CoDe model which graphically represents relationships between information items and can be considered a conceptual map of the view. The proposed design methodology is composed of four phases: the CoDe Modeling and OLAP Operation pattern definition phases define the CoDe model and underlying meta-data information, the OLAP Operation phase physically extracts data from a data warehouse and the Report Visualization phase generates the final visualization. Moreover, a case study on real data is provided.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.24</guid>
  </item>
  <item>
     <title>PrePrint: Linkable Ring Signature with Unconditional Anonymity</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.17</link>
     <description>In this paper, we construct a linkable ring signature scheme with unconditional anonymity. It has been regarded as an open problem in \cite{LiuWeWo04b} since 2004 for the construction of an unconditional anonymous linkable ring signature scheme. We are the first to solve this open problem by giving a concrete instantiation, which is proven secure in the random oracle model. Our construction is even more efficient than other schemes that can only provide computational anonymity. Simultaneously, our scheme can act as an counterexample to show that Theorem 1 stated in \cite{JeongKL08} is not always true, which stated that linkable ring signature scheme cannot provide strong anonymity. Yet we prove that our scheme can achieve strong anonymity (under one of the interpretations).</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.17</guid>
  </item>
  <item>
     <title>PrePrint: A Probabilistic Approach to String Transformation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.11</link>
     <description>Many problems in natural language processing, data mining, information retrieval, and bioinformatics can be formalized as string transformation, which is a task as follows. Given an input string, the system generates the $k$ most likely output strings corresponding to the input string. This paper proposes a novel and probabilistic approach to string transformation, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for generating the top $k$ candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation conditioned on an input string. The learning method employs maximum likelihood estimation for parameter estimation. The string generation algorithm based on pruning is guaranteed to generate the optimal top $k$ candidates. The proposed method is applied to correction of spelling errors in queries as well as reformulation of queries in web search. Experimental results on large scale data show that the proposed approach is very accurate and efficient improving upon existing methods in terms of accuracy and efficiency in different settings.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.11</guid>
  </item>
  <item>
     <title>PrePrint: Privacy-Preserving Enhanced Collaborative Tagging</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.248</link>
     <description>Collaborative tagging is one of the most popular services available online, and it allows end user to loosely classify either online or offline resources based on their feedback, expressed in the form of free-text labels (i.e., tags). Although tags are not per se sensitive information, the wide use of collaborative tagging services increases the risk of cross referencing, thereby seriously compromising user privacy. In this paper, we make a first contribution in this direction by showing how a specific privacy-enhancing technology, namely tag suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the effectiveness of a policy-based collaborative tagging system which supports enhanced Web access functionalities, like content filtering and discovery, based on preferences specified by end users.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.248</guid>
  </item>
  <item>
     <title>PrePrint: Typicality-Based Collaborative Filtering Recommendation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.7</link>
     <description>Collaborative filtering (CF) is an important and popular technology for recommender systems. However, current CF methods suffer from such problems as sparsity problem, inaccurate recommendation and producing big-error predictions. In this paper, we borrow ideas of object typicality from cognitive psychology and propose a novel typicality-based collaborative filtering recommendation method named TyCo. A distinct feature of typicality-based CF is that it finds &amp;amp;#x2018;neighbors&amp;amp;#x2019; of users based on user typicality degrees in user groups (instead of the co-rated items of users or common users of items in traditional CF). To the best of our knowledge, there is no prior work on investigating CF recommendation by combining object typicality. The proposed method outperforms the compared CF recommendation methods on recommendation accuracy (in terms of MAE). Especially, this method works well even with sparse training data and have less time cost than previous CF methods. Further, it can obtain more accurate predictions with less big-error predictions.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.7</guid>
  </item>
  <item>
     <title>PrePrint: Mining Weakly-Labeled Web Facial Images for Search-Based Face Annotation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.240</link>
     <description>This paper investigates a framework of Search-Based Face Annotation (SBFA) by mining weakly-labeled facial images that are freely available on the World Wide Web (WWW). One challenging problem for search-based face annotation scheme is how to effectively perform annotation by exploiting the list of most similar facial images and their weak labels that are often noisy and incomplete. To tackle this problem, we propose an effective Unsupervised Label Refinement (ULR) approach for refining the labels of web facial images using machine learning techniques. We formulate the learning problem as a convex optimization and develop effective optimization algorithms to solve the large-scale learning task efficiently. To further speed up the proposed scheme, we also propose a clustering-based approximation algorithm which can improve the scalability considerably. We have conducted an extensive set of empirical studies on a large-scale web facial image testbed, in which encouraging results showed that the proposed ULR algorithms can significantly boost the performance of the promising SBFA scheme.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.240</guid>
  </item>
  <item>
     <title>PrePrint: Disputant Relation-based Classification for Contrasting Opposing Views of Contentious News Issues</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.238</link>
     <description>Contentious news issues, such as the health care reform debate, draws much interest from the public; however, it is not simple for an ordinary user to search and contrast the opposing arguments and have a comprehensive understanding of the issues. Providing a classified view of the opposing views of the issues can help readers to easily understand the issue from multiple perspectives. We present disputant relation-based method for classifying news articles on contentious issues. We observe that the disputants of a contention are an important feature to understand the discourse. It performs unsupervised classification on news articles based on disputant relations, and helps readers intuitively view the articles through the opponent-based frame and attain balanced understanding, free from a specific biased viewpoint. The method is performed in three stages: disputant extraction, disputant partitioning, and article classification. We apply a modified version of HITS algorithm and an SVM classifier trained with pseudo-relevant data for article analysis. We conduct an accuracy analysis and an upper bound analysis for evaluation of the method.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.238</guid>
  </item>
  <item>
     <title>PrePrint: A Cocktail Approach for Travel Package Recommendation</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.233</link>
     <description>This paper provides a study of exploiting online travel information for personalized travel package recommendation. A critical challenge along this line is to address the unique characteristics of travel data, which distinguish travel packages from traditional items for recommendation. To that end, we first analyze the characteristics of the travel packages and develop a Tourist-Area-Season Topic (TAST) model. This TAST model can represent travel packages and tourists by topic distributions, where the topic extraction is conditioned on both the tourists and the intrinsic features of the landscapes. Then, based on this representation, we propose a cocktail approach to generate the lists for personalized travel package recommendation. Furthermore, we extend the TAST model to the Tourist-Relation-Area-Season Topic (TRAST) model for capturing the latent relationships among the tourists in each travel group. Finally, we evaluate the TAST model, the TRAST model, and the cocktail recommendation approach on the real-world travel package data. Experimental results show that the TAST model can effectively capture the unique characteristics of the travel data and the cocktail approach is thus much more effective than traditional recommendation techniques for travel package recommendation. Also, the TRAST model can be used as an effective assessment for travel group formation.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.233</guid>
  </item>
  <item>
     <title>PrePrint: Learning Conditional Preference Networks From Inconsistent Examples</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.231</link>
     <description>The problem of learning Conditional Preference Networks(CP-nets) from a set of examples has received great attention recently. However, because of the randomicity of the users' behaviors and the observation errors, there is always some noise making the examples inconsistent, namely, there exists at least one outcome preferred over itself (by transferring) in examples. Existing CP-nets learning methods can not handle inconsistent examples. In this work, we introduce the model of learning consistent CP-nets from inconsistent examples and present a method to solve this model. We do not learn the CP-nets directly. Instead, we first learn a preference graph from the inconsistent examples, because dominance testing and consistency testing in preference graphs are easier than those in CP-nets. The problem of learning preference graphs is translated into a 0-1 programming and is solved by the branch-and-bound search. Then the obtained preference graph is transformed into a CP-net equivalently , which can entail a subset of examples with maximal sum of weight. Examples are given to show that our method can obtain consistent CP-nets over both binary and multivalued variables from inconsistent examples. The proposed method is verified on both simulated data and real data, and it is also compared with existing methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.231</guid>
  </item>
  <item>
     <title>PrePrint: Effective Online Group Discovery in Trajectory Databases</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.193</link>
     <description>GPS-enabled devices are pervasive nowadays. Finding movement patterns in trajectory data stream is gaining in importance. We propose a group discovery framework that aims to efficiently support the online discovery of moving objects that travel together. The framework adopts a sampling-independent approach that makes no assumptions about when positions are sampled, gives no special importance to sampling points, and naturally supports the use of approximate trajectories. The framework's algorithms exploit state-of-the-art, density-based clustering (DBScan) to identify groups. The groups are scored based on their cardinality and duration, and the top-$k$ groups are returned. To avoid returning similar subgroups in a result, notions of domination and similarity are introduced that enable the pruning of low-interest groups. Empirical studies on real and synthetic data sets offer insight into the effectiveness and efficiency of the proposed framework.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.193</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Index-Based Approaches for Skyline Queries in Location-Based Applications</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.216</link>
     <description>Enriching many location-based applications, various new skyline queries are proposed and formulated based on the notion of locational dominance, which extends conventional one by taking objects&amp;amp;#8217; nearness to query positions into account additional to objects&amp;amp;#8217; non-spatial attributes. To answer a representative class of skyline queries for location-based applications efficiently, this paper presents two index-based approaches, namely, Augmented R-tree and Dominance Diagram. Augmented R-tree extends R-tree by including aggregated non-spatial attributes in index nodes to enable dominance checks during index traversal. Dominance Diagram is a solution-based approach, by which each object is associated with a precomputed non-dominance scope wherein query points should have the corresponding object not locationally dominated by any other. Dominance Diagram enables skyline queries to be evaluated via parallel and independent comparisons between non-dominance scopes and query points, providing very high search efficiency. The performance of these two approaches is evaluated via empirical studies, in comparison with other possible approaches.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.216</guid>
  </item>
  <item>
     <title>PrePrint: Supervised Multiple Kernel Embedding for Learning Predictive Subspaces</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.213</link>
     <description>For supervised learning problems, dimensionality reduction is generally applied as a preprocessing step. However, coupled training of dimensionality reduction and supervised learning steps may improve the prediction performance. In this paper, we propose a novel dimensionality reduction algorithm coupled with a supervised kernel-based learner, called supervised multiple kernel embedding, that integrates multiple kernel learning to dimensionality reduction and performs prediction on the projected subspace with a joint optimization framework. Combining multiple kernels allows us to combine different feature representations and/or similarity measures towards a unified subspace. We perform experiments on one digit recognition and two bioinformatics data sets. Our proposed method significantly outperforms multiple kernel Fisher discriminant analysis followed by a standard kernel-based learner, especially on low dimensions.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.213</guid>
  </item>
  <item>
     <title>PrePrint: Clustering Based on Enhanced &amp;#x03B1;-Expansion Move</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.202</link>
     <description>The exemplar-based data clustering problem can be formulated as minimizing an energy function defined on a Markov random field (MRF). However, most algorithms for optimizing MRF energy function can&amp;amp;#8217;t be directly applied to the task of clustering, as the problem has a high-order energy function. In this paper, we first show that the high order energy function for the clustering problem can be simplified as a pairwise energy function with the metric property, and consequently it can be optimized by the alpha-expansion move algorithm based on graph cut. Then, the original expansion move algorithm is improved in the following two aspects: (I) Instead of solving a minimal s-t graph cut problem, we show that there is an explicit and interpretable solution for minimizing the energy function in the clustering problem. Based on this interpretation, a fast alpha-expansion move algorithm is proposed, which is much more efficient than the graph cut based algorithm. (II) The fast alpha-expansion move algorithm is further improved by extending its move space so that a larger energy value reduction can be achieved in each iteration. Experiments on benchmark datasets show the enhanced expansion move algorithm has a better performance, compared to other state-of-the-art exemplar-based clustering algorithms.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.202</guid>
  </item>
  <item>
     <title>PrePrint: EnBay: A Novel Pattern-Based Bayesian Classifier</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.197</link>
     <description>A promising approach to Bayesian classification is based on exploiting frequent patterns, i.e., patterns that frequently occur in the training dataset, to estimate the Bayesian probability. Pattern-based Bayesian classification focuses on building and evaluating reliable probability approximations by exploiting a subset of frequent patterns tailored to a given test case. This paper proposes a novel and effective approach to estimate the Bayesian probability. Differently from previous approaches, the Entropy-based Bayesian classifier, namely EnBay, focuses on selecting the minimal set of long and not overlapped patterns that best complies with a conditional-independence model, based on an entropy-based evaluator. Furthermore, the probability approximation is separately tailored to each class. An extensive experimental evaluation, performed on both real and synthetic datasets, shows that EnBay is significantly more accurate than most state-of-the-art classifiers, Bayesian and not.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.197</guid>
  </item>
  <item>
     <title>PrePrint: iLike: Bridging the Semantic Gap in Vertical Image Search by Integrating Text and Visual Features</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.192</link>
     <description>With the development of Internet and Web 2.0, large volume of multimedia content have been made online. It is highly desired to provide easy accessibility to such contents. Towards this goal, content-based image retrieval has been intensively studied in the research community, while text-based search is better adopted in the industry. Both approaches have inherent disadvantages and limitations. Therefore, unlike the great success of text search, Web image search engines are still premature. In this paper, we present iLike, a vertical image search engine which integrates textual and visual features to improve retrieval performance. We bridge the semantic gap by capturing the meaning of each text term in the visual feature space, and re-weight visual features according to their significance to the query terms. We also bridge the user intention gap since we are able to infer the &#x0022;visual meanings&#x0022; behind the textual queries. Last but not least, we provide a visual thesaurus, which is generated from the statistical similarity between the visual space representation of textual terms. Experimental results show that our approach improves both precision and recall, compared with content-based or text-based image retrieval techniques. More importantly, search results from iLike is more consistent with users' perception of the query terms.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.192</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Cluster Labeling for Support Vector Clustering</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.190</link>
     <description>We propose a new efficient algorithm for solving the cluster labeling problem in Support Vector Clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improve the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.190</guid>
  </item>
  <item>
     <title>PrePrint: A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.176</link>
     <description>Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) has been extensively applied to detecting anomalies in network data streams. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we propose novel sparse PCA methods to perform anomaly detection and localization for network data streams. Our key observation is that we can localize anomalies by identifying a sparse low dimensional space that captures the abnormal events in data streams. To better capture the sources of anomalies, we incorporate the structure information of the network stream data in our anomaly localization framework. Furthermore, we extend our joint sparse PCA framework with multi-dimensional Karhunen Lo`eve Expansion (KLE) that considers both spatial and temporal domains of data streams to stabilize localization performance.We have performed comprehensive experimental studies of the proposed methods, and have compared our methods with the state-of-the-art using three real-world data sets from different application domains. Our experimental studies demonstrate the utility of the proposed methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.176</guid>
  </item>
  <item>
     <title>PrePrint: Dealing with Uncertainty: A Survey of Theories and Practices</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.179</link>
     <description>Uncertainty accompanies our life processes, and covers almost all fields of scientific studies. Two general categories of uncertainty, namely, aleatory uncertainty and epistemic uncertainty, exist in the world. While aleatory uncertainty refers to the inherent randomness in nature, derived from natural variability of the physical world (e.g., random show of a flipped coin), epistemic uncertainty origins from human's lack of knowledge of the physical world, as well as ability of measuring and modeling the physical world (e.g., computation of the distance between two cities). Different kinds of uncertainty call for different handling methods. Aggarwal, Yu, Sarma, Zhang et al. have made good surveys on uncertain database management based on the probability theory. This paper reviews multi-disciplinary uncertainty processing activities in diverse fields. Beyond the dominant probability theory and fuzzy theory, we also review information-gap theory and recently derived uncertainty theory. Practices of these uncertainty handling theories in the domains of economics, engineering, ecology, and information sciences are also described. It is our hope that this study could provide insights to the database community on how uncertainty is managed in other disciplines; and further challenge and inspire database researchers to develop more advanced data management techniques and tools to cope with a variety of uncertainty issues in the real world.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.179</guid>
  </item>
  <item>
     <title>PrePrint: A Group Incremental Approach to Feature Selection Applying Rough Set Technique</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.146</link>
     <description>Many real data increase dynamically in size. This phenomena occurs in several fields including economics, population studies and medical research. As an effective and efficient mechanism to deal with such data, incremental technique has been proposed in the literature and attracted much attention, which stimulates the result in this paper. When a group of objects are added to a decision table, we first introduce incremental mechanisms for three representative information entropies and then develop a group incremental rough feature selection algorithm based on information entropy. When multiple objects are added to a decision table, the algorithm aims to find the new feature subset in a much shorter time. Experiments have been carried out on nine UCI data sets and the experimental results show that the algorithm is effective and efficient.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.146</guid>
  </item>
  <item>
     <title>PrePrint: Towards the Automatic Extraction of Policy Networks using Web Links and Documents</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.159</link>
     <description>Policy networks are widely used by political scientists and economists to explain various financial and social phenomena, such as the development of partnerships between political entities or institutions from different levels of governance. The analysis of policy networks demands a series of arduous and time consuming manual steps including interviews and questionnaires. In this paper, we estimate the strength of relations between actors in policy networks using features extracted from data harvested from the web. Features include webpage counts, outlinks, and lexical information extracted from web documents or web snippets. The proposed approach is automatic and does not require any external knowledge source, other than the specification of the word forms that correspond to the political actors. The features are evaluated both in isolation and jointly for both positive and negative (antagonistic) actor relations. The proposed algorithms are evaluated on two EU policy networks from the political science literature. Performance is measured in terms of correlation and mean square error between the human rated and the automatically extracted relations. Correlation of up to 0.74 is achieved for positive relations. The extracted networks are validated by political scientists and useful conclusions about the evolution of the networks over time are drawn.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.159</guid>
  </item>
  <item>
     <title>PrePrint: Omnivariate Rule Induction Using a Novel Pairwise Statistical Test</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.155</link>
     <description>Rule learning algorithms, for example, Ripper, induces univariate rules, that is, a propositional condition in a rule uses only one feature. In this paper, we propose an omnivariate induction of rules where at each condition, both a univariate and a multivariate condition is trained and the best is chosen according to a novel statistical test. This paper has three main contributions: First, we propose a novel statistical test, the combined 5$\times$2 cv $t$ test, to compare two classifiers, which is a variant of the 5$\times$2 cv $t$ test and give the connections to other tests as $5\times 2$ cv $F$ test and $k$-fold paired $t$ test. Second, we propose a multivariate version of Ripper where Support Vector Machine (SVM) with linear kernel is used to find multivariate linear conditions. Third, we propose an omnivariate version of Ripper where the model selection is done via the combined 5$\times$2 cv $t$ test. Our results indicate that (1) the combined 5$\times$2 cv $t$ test has higher power (lower type II error), lower type I error, and higher replicability compared to the 5$\times$2 cv $t$ test, (2) omnivariate rules are better in that they choose whichever condition is more accurate, selecting the right model automatically and separately for each condition in a rule.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.155</guid>
  </item>
  <item>
     <title>PrePrint: Optimizing Multi-Top-k Queries over Uncertain Data Streams</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.126</link>
     <description>Query processing over uncertain data streams, in particular top-$k$ query processing, has become increasingly important due to its wide application in many fields such as sensor network monitoring and internet traffic control. In many real applications, multiple top-$k$ queries are registered in the system. Sharing the results of these queries is a key factor in saving the computation cost and providing real time response. However, due to the complex semantics of uncertain top-$k$ query processing, it is nontrivial to implement sharing among different top-$k$ queries and few works have addressed the sharing issue. In this paper, we formulate various types of sharing among multiple top-$k$ queries over uncertain data streams based on the frequency upper bound of each top-$k$ query. We present an optimal dynamic programming solution as well as a more efficient (in terms of time and space complexity) greedy algorithm to compute the execution plan of executing queries for saving the computation cost between them. Experiments have demonstrated that the greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same performance (in terms of latency and throughput) as the dynamic programming approach.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.126</guid>
  </item>
  <item>
     <title>PrePrint: NHOP: A Nested Associative Pattern for Analysis of Consensus Sequence Ensembles</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.151</link>
     <description>In this research, we introduce a novel, complex associative pattern that is found to be very useful because it identifies the core associative structure from the data. We refer to it as nested high-order pattern (or NHOP). The pattern is more specific than associative patterns represented as multiple variables. It also generalizes sequential patterns, as the outcomes need not be contiguous. This paper outlines two search algorithms, the r-Tree and Best-k algorithm in its detection. It was then applied to an analysis of biomolecule using the aligned sequence family of the molecule. In the SH3 protein, a model for protein-protein interaction mediator, we identify functional groups (core and binding sites) in the three-dimensional structure as well as amino acid patterns dominating certain species.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.151</guid>
  </item>
  <item>
     <title>PrePrint: Static and Dynamic Structural Correlations in Graphs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.133</link>
     <description>Real-life graphs not only contain nodes and edges, but also have events taking place, e.g., product sales in social networks. Some events exhibit strong correlations with the network structure, while others do not. Such structural correlations will shed light on viral influence existing in the corresponding network. Unfortunately, the traditional association mining concept is not applicable in graphs since it only works on homogeneous datasets. We propose a novel measure for assessing structural correlations in graphs with events. The measure applies hitting time to aggregate the proximity among nodes that have the same event. In order to calculate correlation scores for many events in large networks, we develop a scalable framework, called gScore, using sampling and approximation. By comparing to the situation where events are randomly distributed in the network, our method is able to discover events that are highly correlated with the graph structure. We test gScore's effectiveness by synthetic events on the DBLP co-author network and report interesting correlation results in a social network extracted from TaoBao.com, the largest online shopping network in China. Scalability of gScore is tested on the Twitter network. We also propose a dynamic measure which can be used for discovering detailed evolutionary patterns.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.133</guid>
  </item>
  <item>
     <title>PrePrint: Latent Structured Perceptrons for Large-Scale Learning with Hidden Information</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.129</link>
     <description>Many real-world data mining problems contain hidden information (e.g., unobservable latent dependencies). We propose a perceptron-style method, latent structured perceptron, for fast discriminative learning of structured classification with hidden information. We also give theoretical analysis and demonstrate good convergence properties of the proposed method. Our method extends the perceptron algorithm for the learning task with hidden information, which can be hardly captured by traditional models. It relies on Viterbi decoding over latent variables, combined with simple additive updates. We perform experiments on one synthetic dataset and two real-world structured classification tasks. Compared to conventional non-latent models (e.g., conditional random fields, structured perceptrons), our method is significantly more accurate on real-world tasks. Compared to existing heavy probabilistic models of latent variables (e.g., latent conditional random fields), our method lowers the training cost significantly (almost one order magnitude faster) yet with comparable or even superior classification accuracy. In addition, experiments demonstrate that the proposed method has good scalability on large-scale problems.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.129</guid>
  </item>
  <item>
     <title>PrePrint: The Adaptive Clustering Method for the Long Tail Problem of Recommender Systems</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.119</link>
     <description>This is a study of the long tail problem of recommender systems when many items in the long tail have only a few ratings, thus making it hard to use them in recommender systems. The approach presented in this paper clusters items according to their popularities, so that the recommendations for tail items are based on the ratings in more intensively clustered groups and for the head items are based on the ratings of individual items or groups, clustered to a lesser extent. We apply this method to two real-life datasets and compare the results with those of the non-grouping and fully grouped methods in terms of recommendation accuracy and scalability. The results show that if such adaptive clustering is done properly, this method reduces the recommendation error rates for the tail items, while maintaining reasonable computational performance.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.119</guid>
  </item>
  <item>
     <title>PrePrint: Harnessing Folksonomies to Produce a Social Classification of Resources</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.115</link>
     <description>In our daily lives, organizing resources like books or web pages into a set of categories to ease future access is a common task. The usual largeness of these collections requires a vast endeavor and an outrageous expense to organize manually. As an approach to effectively produce an automated classification of resources, we consider the immense amounts of annotations provided by users on social tagging systems in the form of bookmarks. In this paper, we deal with the utilization of these user-provided tags to perform a social classification of resources. For this purpose, we have created three large-scale social tagging datasets including tagging data for different types of resources, web pages and books. Those resources are accompanied by categorization data from sound expert-driven taxonomies. We analyze the characteristics of the three social tagging systems, and perform an analysis on the usefulness of social tags to perform a social classification of resources that resembles the classification by experts as much as possible. We analyze 6 different representations using tags, and compare to other data sources by using 3 different settings of SVM classifiers. Finally, we explore the appropriateness of combining different data sources with tags using classifier committees to best classify the resources.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.115</guid>
  </item>
  <item>
     <title>PrePrint: COSAC: A framework for COmbinatorial Statistical Analysis on Cloud</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.113</link>
     <description>In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally-intensive job that typically takes days, if not weeks, to complete. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale datasets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: (a) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; (b) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario where only a selected subset of objects need to be analyzed against all the objects. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable and flexible.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.113</guid>
  </item>
  <item>
     <title>PrePrint: Towards Multi-Tenant Performance SLOs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.74</link>
     <description>As traditional and mission-critical relational database workloads migrate to the cloud in the form of Database- as-a-Service (DaaS), there is an increasing motivation to provide performance goals in Service Level Objectives (SLOs). Providing such performance goals is challenging for DaaS providers as they must balance the performance that they can deliver to tenants and the data center&amp;amp;#x2019;s operating costs. In general, aggressively aggregating tenants on each server reduces the operating costs but degrades performance for the tenants, and vice versa. In this paper, we present a framework that takes as input the tenant workloads, their performance SLOs, and the server hardware that is available to the DaaS provider, and outputs a cost- effective recipe that specifies how much hardware to provision and how to schedule the tenants on each hardware resource. We evaluate our method and show that it produces effective solutions that can reduce the costs for the DaaS provider while meeting performance goals.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.74</guid>
  </item>
  <item>
     <title>PrePrint: A Myopic Approach to Ordering Nodes for Parameter Elicitation in Bayesian Belief Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.72</link>
     <description>Building Bayesian belief networks in the absence of data involves the challenging task of eliciting conditional probabilities from experts to parameterize the model. In this paper, we develop an analytical method for determining the optimal order for eliciting these probabilities. Our method uses prior distributions on network parameters and a novel expected proximity criteria, to propose an order that maximizes information gain per unit elicitation time. We present analytical results when priors are uniform Dirichlet; for other priors, we find through experiments that the optimal order is strongly affected by which variables are of primary interest to the analyst. Our results should prove useful to researchers and practitioners involved in belief network model building and elicitation.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.72</guid>
  </item>
  <item>
     <title>PrePrint: Infrequent Weighted Itemset Mining Using Frequent Pattern Growth</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.69</link>
     <description>Frequent weighted itemsets represent correlations frequently holding in data in which items may weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted itemsets, i.e., the Infrequent Weighted Itemset (IWI) mining problem. Two novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented. Experimental results show efficiency and effectiveness of the proposed approach.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.69</guid>
  </item>
  <item>
     <title>PrePrint: Fast Nearest Neighbor Search with Keywords</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.66</link>
     <description>Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions on objects' geometric properties. Today, many modern applications call for novel forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for the restaurant that is the closest among those whose menus contain &#x0022;steak, spaghetti, brandy&#x0022; all at the same time. Currently the best solution to such queries is based on the IR&amp;amp;#x00B2;-tree, which, as shown in this paper, has a few deficiencies that seriously impact its efficiency. Motivated by this, we develop a new access method called the spatial inverted index that extends the conventional inverted index to cope with multidimensional data, and comes with algorithms that can answer nearest neighbor queries with keywords in real time. As verified by experiments, the proposed techniques outperform the IR&amp;amp;#x00B2;-tree in query response time significantly, often by a factor of orders of magnitude.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.66</guid>
  </item>
  <item>
     <title>PrePrint: A robust, distortion minimizing technique for watermarking relational databases using once-for-all usability constraints</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.227</link>
     <description>Ownership protection on relational databases - shared with collaborators (or intended recipients) - demands developing a watermarking scheme that must be able to meet four challenges: (1) it should be robust against different types of attacks that an intruder could launch to corrupt the embedded watermark; (2) it should be able to preserve the knowledge in the databases to make them an effective component of knowledge-aware decision support systems; (3) it should try to strike a balance between the conflicting requirements of database owners, who require soft usability constraints, and database recipients who want tight usability constraints that ensure minimum distortions in the data; and (4) last but not least, it should not require that a database owner defines usability constraints for each type of application and every recipient separately. The major contribution of this paper is a robust and efficient watermarking scheme for relational databases that is able to meet all above-mentioned four challenges. The results of our experiments prove that the proposed scheme achieves 100% decoding accuracy even if only one watermarked row is left in the database.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.227</guid>
  </item>
  <item>
     <title>PrePrint: Dynamic Query Forms for Database Queries</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.62</link>
     <description>Modern scientific and web databases maintain large and heterogeneous data. These real-world database schemas contain over hundreds or even thousands of attributes and relations. Traditional predefined query forms are not able to satisfy various ad-hoc queries from users. This paper proposes DQF, a novel database query form interface, which is able to dynamically generate query forms. The essence of DQF is to capture the user&amp;amp;#x2019;s preference and rank query form components. The generation of the query form is an iterative process and is guided by the user. At each iteration, the system automatically generates ranking lists of form components and the user then adds the desired form components into the query form. The ranking of form components is based on the captured user preference. The user can also fill the query form and submit queries to view the query result at each iteration. In this way, the query form could be dynamically refined until the user satisfies with the query results. We propose a metric for measuring the goodness of a query form. A probabilistic model is developed for estimating the goodness of a query form in DQF. Our experimental evaluation and user study demonstrate the effectiveness and efficiency of the system.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.62</guid>
  </item>
  <item>
     <title>PrePrint: A Two-Level Topic Model Towards Knowledge Discovery from Citation Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.56</link>
     <description>Knowledge discovery from scientific articles has received increasing attentions recently since huge repositories are made available by the development of the Internet and digital databases. In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. In the existing topic models, little effort is made to differentiate these two roles. We believe that the topic distributions of these two roles are different and related in a certain way. In this paper we propose a Bernoulli Process Topic~(BPT) model which considers the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of citation network is captured by a generative process involving a Bernoulli process. The comparisons against state-of-the-art methods demonstrate a very promising performance. The implementations and the datasets are available online.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.56</guid>
  </item>
  <item>
     <title>PrePrint: Local Thresholding in General Network Graphs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.59</link>
     <description>Local thresholding algorithms were first presented more than a decade ago and have since been applied to a variety of data mining tasks in peer-to-peer systems, wireless sensor networks, and in grid systems. One critical assumption made by those algorithms has always been cycle-free routing. The existence of even one cycle may lead all peers to the wrong outcome. Outside the lab, unfortunately, cycle freedom is not easy to achieve. This work is the first to lift the requirement of cycle freedom by presenting a local thresholding algorithm suitable for general network graphs. The algorithm relies on a new repositioning of the problem in weighted vector arithmetics, on a new stopping rule, whose proof does not require that the network be cycle free, and on new methods for balance correction when the stopping rule fails. The new stopping and update rules permit calculation of the very same functions that were calculable using previous algorithms, which do assume cycle freedom. The algorithm is implemented on a standard peer-to-peer simulator and is validated for networks of up to 80,000 peers, organized in three different topologies representative of major current distributed systems: the Internet, structured peer-to-peer systems, and wireless sensor networks.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.59</guid>
  </item>
  <item>
     <title>IEEE Transactions on Knowledge and Data Engineering - </title>
     <link>http://www.computer.org/portal/site/tkde/</link>
     <description>IEEE Transactions on Knowledge and Data Engineering</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/tkde/</guid>
  </item>
   </channel>
</rss>