<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
<channel>
<title>IEEE Transactions on Knowledge and Data Engineering</title>
<link>http://www.computer.org/tkde</link>
<description>The IEEE Transactions on Knowledge and Data Engineering is an archival journal published monthly. The information published in this Transactions is designed to inform researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area. We are interested in well-defined theoretical results and empirical studies that have potential impact on the acquisition, management, storage, and graceful degeneration of knowledge and data, as well as in provision of knowledge and data services. Specific topics include, but are not limited to: a) artificial intelligence techniques, including speech, voice, graphics, images, and documents; b) knowledge and data engineering tools and techniques; c) parallel and distributed processing; d) real-time distributed; e) system architectures, integration, and modeling; f) database design, modeling and management; g) query design and implementation languages; h) distributed database control; j) algorithms for data and knowledge management; k) performance evaluation of algorithms and systems; l) data communications aspects; m) system applications and experience; n) knowledge-based and expert systems; and, o) integrity, security, and fault tolerance.	</description>
	<language>en-us</language>
	<pubDate>Wed, 4 Jan 2012 11:00:01 GMT</pubDate>
	<image>
		<url>http://csdl.computer.org/common/images/logos/tkde.gif</url>
		<title>IEEE Computer Society</title>
		<description>List of recently published journal articles</description>
		<link>http://www.computer.org/tkde</link>
	</image>
  <item>
     <title>PrePrint: Information-Theoretic Outlier Detection for Large-Scale Categorical Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.261</link>
     <description>Outlier detection can usually be considered as a pre-processing step for locating, in a dataset, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical datasets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holo-entropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional datasets where existing algorithms fail.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.261</guid>
  </item>
  <item>
     <title>PrePrint: Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.258</link>
     <description>Most proposed DHTs engage certain topology maintenance mechanisms specific to the static graphs on which they are based. The designs of these mechanisms are complicated and repeated with graph-relevant concerns. In this paper we propose the &amp;#x201C;distributed line graphs&amp;#x201D; (DLG), a universal technique for designing DHTs based on arbitrary regular graphs. Using DLG, the main features of the initial graphs are preserved, and thus people can design a new DHT by simply choosing the graph with desirable features and applying DLG to it. We demonstrate the power of DLG by illustrating four DLG-enabled DHTs based on different graphs, namely, Kautz, de Bruijn, butterfly and hypertree graphs. The effectiveness of our proposals is demonstrated through analysis, simulation and implementation.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.258</guid>
  </item>
  <item>
     <title>PrePrint: Transfer Across Completely Different Feature Spaces via Spectral Embedding</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.252</link>
     <description>In many applications, it is difficult to obtain a lot of labeled examples. One practically important problem is: can the labeled data from other related sources help predict the target task, even if they have (a) different feature spaces (e.g., image vs. text data), (b) different data distributions, and (c) different output spaces? This paper proposes a solution and discusses the conditions where this is highly likely to produce better results. It first unifies the feature spaces of the target and source data sets by spectral embedding, even when they are with completely different feature spaces. The principle is to devise an optimization objective that preserves the original structure of the data, while at the same time, maximizes the similarity between the two. A linear projection model, as well as a non-linear approach are derived on the basis of this principle with closed forms. Second, a judicious sample selection strategy is applied to select only those related source examples. At last, a Bayesian-based approach is applied to model the relationship between different output spaces. The three steps can bridge related heterogeneous sources in order to learn the target task. Among the 20 experiment data sets, the proposed models can reduce the error rate by as much as 50%.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.252</guid>
  </item>
  <item>
     <title>PrePrint: A New Algorithm for Inferring User Search Goals with Feedback Sessions</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.248</link>
     <description>For a broad-topic and ambiguous query, different users may have different search goals when they submit it to a search engine. The inference and analysis of user search goals can be very useful in improving search engine relevance and user experience. In this paper, we propose a novel approach to infer user search goals by analyzing search engine query logs. Firstly, we propose a framework to discover different user search goals for a query by clustering the proposed feedback sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect the information needs of users. Secondly, we propose a novel approach to generate pseudo-documents to better represent the feedback sessions for clustering. Finally, we propose a new criterion 'Classified Average Precision (CAP)' to evaluate the performance of inferring user search goals. Experimental results are presented using user click-through logs from a commercial search engine to validate the effectiveness of our proposed methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.248</guid>
  </item>
  <item>
     <title>PrePrint: Fuzzy Web Data Tables Integration Guided by an Ontological and Terminological Resource</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.245</link>
     <description>In this paper, we present the design of the ONDINE system which allows the loading and the querying of a data warehouse opened on the Web, guided by a domain Termino-Ontological Resource (TOR). The data warehouse, composed of data tables extracted from Web documents, has been built to supplement existing local data sources. First we present the main steps of our semi-automatic method to annotate Web data tables driven by a domain TOR. The output of this method is an XML/RDF data warehouse composed of XML documents representing Web data tables with their fuzzy RDF annotations. We then present our flexible querying system which allows the local data sources and the XML/RDF data warehouse to be simultaneously and uniformly queried, using the domain TOR. This system relies on SPARQL and permits to retrieve approximate answers extracted from Web data tables by comparing preferences, expressed in selection criteria using fuzzy sets, with fuzzy RDF annotations.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.245</guid>
  </item>
  <item>
     <title>PrePrint: Radio Database Compression for Accurate Energy-Efficient Localization in Fingerprinting Systems</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.241</link>
     <description>Location fingerprinting is a positioning method that exploits the already existing infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy efficient networks and the emergence of issues like green networking, we propose a clustering technique to compress the radio database in the context of cellular fingerprinting systems. The aim of the proposed technique is to reduce the computation cost and transmission load in the mobile-based implementations. The presented method may be called Block-based Weighted Clustering (BWC) technique, which is applied in a concatenated location-radio signal space, and attributes different weight factors to the location and radio components. Computer simulations and real experiments have been conducted to evaluate the performance of our proposed technique in the context of a GSM network. The obtained results confirm the efficiency of the BWC technique, and show that it improves the performance of standard k-means and hierarchical clustering methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.241</guid>
  </item>
  <item>
     <title>PrePrint: Facilitating Effective User Navigation through Web Site Structure Improvement</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.238</link>
     <description>Designing well-structured Web sites to facilitate effective user navigation has long been a challenge. A primary reason is that the Web developers' understanding of how a Web site should be structured can be considerably different from that of the users. While various methods have been proposed to re-link Web pages to improve navigability using user navigation data, the completely reorganized new structure can be highly unpredictable, and the cost of disorienting users after the changes remains unanalyzed. This paper addresses how to improve a Web site without introducing substantial changes. Specifically, we propose a mathematical programming model to improve the user navigation on a Web site while minimizing alterations to its current structure. Results from extensive tests conducted on a publicly available real data set indicate that our model not only significantly improves the user navigation with very few changes, but also can be effectively solved. We have also tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we define two evaluation metrics and use them to assess the performance of the improved Web site using the real data set. Evaluation results confirm that the user navigation on the improved structure is indeed greatly enhanced.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.238</guid>
  </item>
  <item>
     <title>PrePrint: A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.234</link>
     <description>Kappa-error diagrams are used to gain insights about why an ensemble method is better than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The x-axis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study, kappa is calculated from the 2x2 correct/wrong contingency matrix. We derive a lower bound on kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments with real data show that there is unoccupied feasible space on the diagram corresponding to (hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the ensemble accuracy.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.234</guid>
  </item>
  <item>
     <title>PrePrint: Range-Based Skyline Queries in Mobile Environments</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.229</link>
     <description>Skyline query processing in location-based services, which considers both spatial and non-spatial attributes of the objects being queried, has recently received increasing attention. Existing solutions focus on solving point- or line-based skyline queries, in which the query location is an exact location point or a line segment. However, due to privacy consideration and limited precision of localization devices, the input of a user location is often a two-dimensional range. This paper studies a new problem on how to process such range-based skyline queries. Two novel algorithms are proposed: one is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY, which avoid recomputing the query index and results from scratch. Additionally, we develop efficient solutions for probabilistic and continuous range-based skyline queries. Experimental results show that our proposed algorithms well outperform the baseline algorithm that simply adopts the existing line-based skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial computation costs, especially when the objects move frequently.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.229</guid>
  </item>
  <item>
     <title>PrePrint: Building a Scalable Database-Driven Reverse Dictionary</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.225</link>
     <description>In this paper, we describe the design and implementation of a reverse dictionary. Unlike a traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the input phrase. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. We present a set of algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and the runtime response time performance of our implementation. Our experimental results show that our approach can provide significant improvements in performance scale without sacrificing the quality of the result. Our experiments comparing the quality of our approach to that of currently available reverse dictionaries show that of our approach can provide significantly higher quality over either of the other currently-available implementations.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.225</guid>
  </item>
  <item>
     <title>PrePrint: Mining User Queries with Markov Chains: Application to Online Image Retrieval.</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.219</link>
     <description>We propose a novel method for automatic annotation, indexing and annotation-based retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented in the context of an online image retrieval system. Assuming such a system, the users' queries are used to construct an \textit{Aggregate Markov Chain} ($AMC$) through which the relevance between the keywords seen by the system is defined. The users' queries are also used to automatically annotate the images. A stochastic distance between images, based on their annotation and the keyword relevance captured in the $AMC$, is then introduced. Geometric interpretations of the proposed distance are provided and its relation to a clustering in the keyword space is investigated. By means of a new measure of Markovian state similarity, the \textit{mean first cross passage time} ($CPT$), optimality properties of the proposed distance are proved. Images are modeled as points in a vector space and their similarity is measured with MSI. The new method is shown to possess certain theoretical advantages and also to achieve better Precision vs Recall results when compared to Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation Based Image Retrieval (ABIR) tasks.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.219</guid>
  </item>
  <item>
     <title>PrePrint: Scalable and Parallel Boosting with MapReduce</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.208</link>
     <description>In this era of data abundance, it has become critical to be able to process large volumes of data at much faster rates than ever before. Boosting is a powerful predictive model that has been successfully used in many real-world applications. However, due to it's inherent sequential nature, achieving scalability for boosting is not trivial and demands the development of new parallelized versions which will allow them to efficiently handle large-scale data. In this paper, we propose two parallel boosting algorithms, AdaBoost.PL and LogitBoost.PL, which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. The proposed algorithms are competitive to the corresponding serial versions in terms of the generalization performance. In addition, our algorithms achieve significant speedup since our approach does not require individual computing nodes to communicate with each other for sharing their data. Hence, they are applicable and are robust in preserving privacy of computations as well. We used Map-Reduce framework to implement our algorithms and demonstrated the performance in terms of classification accuracy, speedup and scaleup using a wide variety of synthetic and real-world data sets.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.208</guid>
  </item>
  <item>
     <title>PrePrint: k-Pattern Set Mining under Constraints</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.204</link>
     <description>We introduce the problem of $k$-pattern set mining, concerned with finding a set of $k$ related patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The $k$-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in $k$-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.204</guid>
  </item>
  <item>
     <title>PrePrint: Detecting Intrinsic Loops Underlying Data Manifold</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.191</link>
     <description>Detecting intrinsic loop structures of a data manifold is the necessary pre-step for the proper employment of the manifold learning techniques and of fundamental importance in the discovery of the essential representational features underlying the data lying on the loopy manifold. An effective strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting essential loops of a manifold is derived. An effective algorithm for loop detection is then constructed. The soundness of the proposed theory and algorithm is validated by a series of experiments performed on synthetic and real-life data sets. In each of the experiments, the essential loops underlying the data manifold can be properly detected, and the intrinsic representational features of the data manifold can be revealed along the loop structure so detected. Particularly, some of these features can hardly be discovered by the conventional manifold learning methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.191</guid>
  </item>
  <item>
     <title>PrePrint: Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled Data Warehouse</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.179</link>
     <description>There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the \textit{Sym'Previus} predictive microbiology oriented data warehouse.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.179</guid>
  </item>
  <item>
     <title>PrePrint: An Unsupervised Approach for Person Name Bipolarization Using Principal Component Analysis</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.177</link>
     <description>A topic is usually associated with a specific time, place, and person(s). Generally, topics that involve bipolar or competing viewpoints are attention-getting and are thus reported in a large number of documents. Identifying the association between important persons mentioned in numerous topic documents would help readers comprehend topics more easily. In this paper, we propose an unsupervised approach for identifying bipolar person names in a set of topic documents. Specifically, we employ principal component analysis (PCA) to discover bipolar word usage patterns of person names in the documents, and show that the signs of the entries in the principal eigenvector of PCA partition the person names into bipolar groups spontaneously. To reduce the effect of data sparseness, we introduce two techniques, called the weighted correlation coefficient and off-topic block elimination. We also present a timeline system that shows the intensity and activeness development of the identified bipolar person groups. Empirical evaluations demonstrate the efficacy of the proposed approach in identifying bipolar person names in topic documents, while the generated timelines provide comprehensive storylines of topics.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.177</guid>
  </item>
  <item>
     <title>PrePrint: Hybrid Generative/Discriminative Approaches for Proportional Data Modeling and Classification</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.162</link>
     <description>The work proposed in this paper is motivated by the need to develop powerful models and approaches to classify and learn proportional data. Indeed, an abundance of interesting data in several applications occur naturally in this form. Our goal is to discover and capture the intrinsic nature of the data by proposing some approaches that combine the major advantages of generative models namely finite mixtures and discriminative techniques namely support vector machines (SVMs). Indeed, SVMs often rely on classic kernels which are not generally meaningful for proportional data. One serious limitation of these kernels is that they do not take into account the nature of data to classify and choosing a suitable kernel continues to be a formidable challenge for data mining and machine learning researchers. Our approach builds on selecting accurate kernels generated from finite mixtures of Dirichlet, generalized Dirichlet and Beta-Liouville distributions which chief advantage is their flexibility and explanatory capabilities in the case of heterogenous proportional data. Using extensive simulations and a number of experiments involving scene modeling and classification, and automatic image orientation detection, we show the merits of the proposed mixture models and the accuracy of the generated kernels.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.162</guid>
  </item>
  <item>
     <title>PrePrint: Multi-View Semi-Supervised Learning with Consensus</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.160</link>
     <description>Obtaining high-quality and up-to-date labeled data can be difficult in many real-world machine learning applications. Semi-supervised learning aims to improve the performance of a classifier trained with limited number of labeled data by utilizing the unlabeled ones. This paper demonstrates a way to improve the transductive SVM, which is an existing semi-supervised learning algorithm, by employing a multi-view learning paradigm. We propose a novel two-view transductive SVM that takes advantage of both the abundant amount of unlabeled data and their multiple representations to improve the performance of classifiers. The idea is fairly simple: train a classifier on each of the two views of both labeled and unlabeled data, and impose a global constraint requiring each classifier to assign the same class label to each labeled and unlabeled data. We also incorporate manifold regularization into our learning framework. The proposed two-view transductive SVM was evaluated on both synthetic and real-life datasets. Experimental results show that our algorithm performs up to 10% better than a single view learning approach, especially when the amount of labeled data is small. The other advantage of our two-view semi-supervised learning approach is its significantly improved stability, which is especially useful for noisy real world data.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.160</guid>
  </item>
  <item>
     <title>PrePrint: Joint Optimization of Index Freshness and Coverage in Real-Time Search Engines</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.144</link>
     <description>Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as index coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.144</guid>
  </item>
  <item>
     <title>PrePrint: Toward Private Joins on Outsourced Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.142</link>
     <description>In an outsourced database framework, clients place data management with specialized service providers. Of essential concern in such frameworks is data privacy. Potential clients are reluctant to outsource sensitive data to a foreign party without strong privacy assurances beyond policy "fine prints". In this paper we introduce a mechanism for executing general binary JOIN operations (for predicates that satisfy certain properties) in an outsourced relational database framework with computational privacy and low overheads - a first, to the best of our knowledge. We illustrate via a set of relevant instances of JOIN predicates, including: range and equality (e.g., for geographical data), Hamming distance (e.g., for DNA matching) and semantics (i.e., in health-care scenarios - mapping antibiotics to bacteria). We experimentally evaluate the main overhead components and show they are reasonable. The initial client computation overhead for 100000 data items is around 5 minutes and our privacy mechanisms can sustain theoretical throughputs of several million predicate evaluations per second, even for an un-optimized OpenSSL based implementation.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.142</guid>
  </item>
  <item>
     <title>PrePrint: Constructing a New-Style Conceptual Model of Brain Data for Systematic Brain Informatics</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.139</link>
     <description>The development of brain science has led to a vast increase of brain data. To meet requirements of a systematic methodology of Brain Informatics (BI), this paper proposes a new conceptual model of brain data, namely Data-Brain, which explicitly represents various relationships among multiple human brain data sources, with respect to all major aspects and capabilities of human information processing systems (HIPS). A multi-dimension framework and a BI methodology based ontological modeling approach have been developed to implement a Data-Brain. The Data-Brain, Data-Brain based BI provenances, and heterogeneous brain data can be used to construct a Data-Brain based brain data center which provides a global framework to integrate data, information and knowledge coming from the whole research process for systematic BI study. Such a Data-Brain modeling approach represents a radically new way for domain-driven conceptual modeling of brain data, which models a whole process of systematically investigating human information processing mechanisms.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.139</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Multi-Dimensional Fuzzy Search for Personal Information Management Systems</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.126</link>
     <description>With the explosion in the amount of semi-structured data users access and store in personal information management systems, there is a critical need for powerful search tools to retrieve often very heterogeneous data in a simple and efficient way. Existing tools typically support some IR-style ranking on the textual part of the query, but only consider structure (e.g., file directory) and metadata (e.g., date, file type) as filtering conditions. We propose a novel multi-dimensional search approach that allows users to perform fuzzy searches for structure and metadata conditions in addition to keyword conditions. Our techniques individually score each dimension and integrate the three dimension scores into a meaningful unified score. We also design indexes and algorithms to efficiently identify the most relevant files that match multi-dimensional queries. We perform a thorough experimental evaluation of our approach and show that our relaxation and scoring framework for fuzzy query conditions in non-content dimensions can significantly improve ranking accuracy. We also show that our query processing strategies perform and scale well, making our fuzzy search approach practical for every day usage.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.126</guid>
  </item>
  <item>
     <title>PrePrint: Discovering the Most Influential Sites over Uncertain Data: A Rank Based Approach</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.121</link>
     <description>With the rapidly increasing availability of uncertain data in many important applications such as location-based services, sensor monitoring and biological information management systems, uncertainty-aware query processing has received a significant amount of research effort from the database community in recent years. In this paper, we investigate a new type of query in the context of uncertain databases, namely uncertain top-k influential sites query (UTkIS query for short), which can be applied in a wide range of application areas such as marketing analysis and mobile services. Since it is not so straightforward to precisely define the semantics of topk query with uncertain data, in this paper we introduce a novel and more intuitive formulation of the query on the basis of expected rank semantics. To address the efficiency issue caused by possible worlds exploration, we propose effective pruning rules and a divide-and-conquer paradigm such that the number of candidates as well as the number of possible worlds to be considered can be significantly reduced. Finally we conduct extensive experiments on real datasets to verify the effectiveness and efficiency of the new methods proposed in this paper.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.121</guid>
  </item>
  <item>
     <title>PrePrint: Clustering with Multi-Viewpoint Based Similarity Measure</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.86</link>
     <description>All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.86</guid>
  </item>
  <item>
     <title>PrePrint: Efficient and Progressive Algorithms for Distributed Skyline Queries over Uncertain Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.77</link>
     <description>The skyline operator has received considerable attention from the database community, due to its importance in many applications including multi-criteria decision making, preference answering, and so forth. In many applications where uncertain data are inherently exist, i.e., data collected from different sources in distributed locations are usually with imprecise measurements, and thus exhibit kind of uncertainty. Taking into account the network delay and economic cost associated with sharing and communicating large amounts of distributed data over an internet, an important problem in this scenario is to retrieve the global skyline tuples from all the distributed local sites with minimum communication cost. Based on the well known notation of the probabilistic skyline query over centralized uncertain data, in this paper, we propose the notation of distributed skyline queries over uncertain data. Furthermore, two communication- and computation-efficient algorithms are proposed to retrieve the qualified skylines from distributed local sites. Extensive experiments have been conducted to verify the efficiency, the effectiveness and the progressiveness of our algorithms with both the synthetic and real data sets.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.77</guid>
  </item>
  <item>
     <title>PrePrint: Adding Temporal Constraints to XML Schema</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.74</link>
     <description>If past versions of XML documents are retained, what of the various integrity constraints defined in XML Schema on those documents? This paper describes how to interpret such constraints as sequenced constraints, applicable at each point in time. We also consider how to add new variants that apply across time, so-called non-sequenced constraints. Our approach supports temporal documents that vary over both valid and transaction time, whose schema can vary over transaction time. We do this by replacing the schema with a (possibly time-varying) temporal schema and replacing the document with a temporal document, both of which are upward compatible with conventional XML and with conventional tools like XMLLINT, which we have extended to support the temporal constraints introduced here.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.74</guid>
  </item>
  <item>
     <title>PrePrint: Scalable Scheduling of Updates in Streaming Data Warehouses</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.45</link>
     <description>We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of inter-arrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses update base tables and layers of materialized views as new data arrive. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time (at time t, if a table has been updated with information up to some earlier time r, its staleness is t minus r). We then propose a scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different inter-arrival times and data volumes among different sources, and handling transient overload. A novel feature of our framework is that scheduling decisions do not depend on properties of update jobs (such as deadlines), but rather on the effect of update jobs on data staleness. Finally, we present a suite of update scheduling algorithms and extensive simulation experiments.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.45</guid>
  </item>
  <item>
     <title>PrePrint: Visual Role Mining: A Picture Is Worth a Thousand Roles</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.37</link>
     <description>This paper offers a new role engineering approach to Role-Based Access Control (RBAC), referred to as visual role mining. The key idea is to graphically represent user-permission assignments to enable quick analysis and elicitation of meaningful roles. First, we formally define the problem by introducing a metric for the quality of the visualization. Then, we prove that finding the best representation according to the defined metric is a NP-hard problem. In turn, we propose two algorithms: ADVISER and EXTRACT. The former is a heuristic used to best represent the user-permission assignments of a given set of roles. The latter is a fast probabilistic algorithm that, when used in conjunction with ADVISER, allows for a visual elicitation of roles even in absence of pre-defined roles. Besides being rooted in sound theory, our proposal is supported by extensive simulations run over real data. Results confirm the quality of the proposal and demonstrate its viability in supporting role engineering decisions.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.37</guid>
  </item>
  <item>
     <title>PrePrint: A Unified Probabilistic Framework for Name Disambiguation in Digital Library</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.13</link>
     <description>Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of people K in the disambiguation process. In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm. We also investigate a dynamic approach for estimating the number of people K. Experimental results show that our proposed framework significantly outperforms four baseline methods of using traditional clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number. We apply the result of name disambiguation by the proposed method to expert finding and obtain clear improvement on the performance of expert finding.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.13</guid>
  </item>
  <item>
     <title>PrePrint: Efficient Service Skyline Computation for Composite Service Selection</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.268</link>
     <description>Service composition is emerging as an effective vehicle for integrating existing Web services to create value-added and personalized composite services. As Web services with similar functionality are expected to be provided by competing providers, a key challenge is to find the "best" Web services to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc) are considered, a weighting mechanism is usually adopted by most existing approaches, which requires users to specify their preferences as numeric values. We propose to exploit the {\em dominance relationship} among service providers to find a set of "best" possible composite services, referred to as a {\em composite service skyline}. We develop efficient algorithms that allow us to find the composite service skyline from a significantly reduced searching space instead of considering all possible service compositions. We propose a novel bottom-up computation framework that enables the skyline algorithm to scale well with the number of services in a composition. We conduct a comprehensive analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the composite skyline computation approaches.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.268</guid>
  </item>
  <item>
     <title>PrePrint: The Minimum Consistent Subset Cover Problem: A Minimization View of Data Mining</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.260</link>
     <description>In this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition, a dual problem of graph coloring, as an instance. The problem reflects a minimization view of data mining. Many common data mining tasks in rule learning, clustering, and pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set problem that minimizes model complexity of decision rules, the converse k-clustering problem that minimizes the number of clusters, and the pattern summarization problem that minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.260</guid>
  </item>
  <item>
     <title>IEEE Transactions on Knowledge and Data Engineering - February 2012 (Vol. 24, No. 2)</title>
     <link>http://opac.ieeecomputersociety.org/opac?year=2012&amp;volume=24&amp;issue=02&amp;acronym=tkde</link>
     <description>IEEE Transactions on Knowledge and Data Engineering</description>
     <guid isPermaLink="true">http://www.computer.org/portal/site/tkde/</guid>
  </item>
  <item>
     <title>PrePrint: Data Cube Materialization and Mining over MapReduce</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.257</link>
     <description>Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive datasets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOP-K, however, is non-trivial. In this paper we detail real-world challenges in cube materialization and mining tasks on Web-scale datasets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our datasets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple datasets.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.257</guid>
  </item>
  <item>
     <title>PrePrint: &#x03BB;-Diverse Nearest Neighbors Browsing for Multi-Dimensional Data</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.251</link>
     <description>Traditional search methods try to obtain the most relevant information and rank it according to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannot capture all aspects of the queried topic. In this work, we focus on the &#x03BB;-diverse k-nearest neighbor search problem on spatial and multi-dimensional data. Unlike the approach of diversifying query results in a post-processing step, we naturally obtain diverse results with the proposed geometric and index-based methods. We first make an analogy with the concept of natural neighbors and propose a natural neighbor-based method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for higher dimensional spaces. We then introduce a diverse browsing method based on the distance browsing feature of spatial index structures, such as R-trees. The algorithm maintains a priority queue with mindivdist of the objects depending on both relevancy and angular diversity and efficiently prunes non-diverse items and nodes. We experimented with a number of spatial and high-dimensional datasets, including Factual's US points-of-interest dataset with 13M entries. With effective pruning, our diverse browsing method is shown to be more efficient and more effective than KNN and KNDN techniques.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.251</guid>
  </item>
  <item>
     <title>PrePrint: Cutting Plane Training for Linear Support Vector Machines</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.247</link>
     <description>Suppor t Vector Machines (SVMs) have been shown to achieve high performance on classification tasks across many domains, and a great deal of work has been dedicated to developing training algorithms for linear SVMs which are computationally tractable on large data sets. One approach [1] approximately minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work, presenting a modification to the algorithm developed by [2]. We demonstrate empirically that our changes can reduce cutting-plane training time by up to 40%, and discuss how effectiveness of our method is impacted by changes in data sets and parameter settings.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.247</guid>
  </item>
  <item>
     <title>PrePrint: Clustering Large Probabilistic Graphs</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.243</link>
     <description>We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction networks and discovering groups of users in affiliation networks. We extend the edit-distance based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.243</guid>
  </item>
  <item>
     <title>PrePrint: Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.240</link>
     <description>A novel platform for instantaneous Sensitivity Analysis applicable to large scale real-time data acquisition systems is introduced. The drive for the proposed EventTracker platform is the assumption that modern industrial systems are suffieciently flexible and equiped to capture data. This flexibility to adapt can only be assured if the data collected can be succinctly interpreted and translated into corrective actions in timely manner. An important factor that will help in data interpretation and information modelling is the appreciation of the affect system inputs have on each output at their time of occurrence. Existing sensitivity analysis methods appear to hamper efficient and timely sensitivity analysis due to their heavy reliance on historical data, or their sluggishness in providing a timely solution that is of use in real-time applications. This inefficiency is compounded by computational limitations and the complexity of existing models. Dealing with real-time event driven systems, the underpinning logic of the proposed approach is the assumption that, in the vast majority of cases changes to input variables triggers events. The proposed event tracking sensitivity analysis method describes variables and the system state as a collection of events. Compared with Entropy-based, the proposed event tracking sensitivity analysis demostrates 10% improvement in computational efficiency with and same accuracy.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.240</guid>
  </item>
  <item>
     <title>PrePrint: Simple Hybrid and Incremental Post-Pruning Techniques for Rule Induction</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.237</link>
     <description>Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental post-pruning technique based on a misclassification tolerance. Although both have been designed for RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other state-of-the-art classification techniques. The improved classifier is also more accurate and up to two orders of magnitude faster than before.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.237</guid>
  </item>
  <item>
     <title>PrePrint: Modeling and Solving Distributed Configuration Problems: A CSP-Based Approach</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.236</link>
     <description>Product configuration can be defined as the task of tailoring a product according to the specific needs of a customer. Due to the inherent complexity of this task, which for example includes the consideration of complex constraints or the automatic completion of partial configurations, various Artificial Intelligence techniques have been explored in the last decades to tackle such configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In modern supply-chain settings, however, the components of a customizable product may themselves be configurable, thus requiring a multi-site, distributed approach. In this paper, we analyze the challenges of modeling and solving such distributed configuration problems and propose an approach based on Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic constraints is particularly advantageous also in the distributed problem solving phase.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.236</guid>
  </item>
  <item>
     <title>PrePrint: Reassessing Top-Down Join Enumeration</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.235</link>
     <description>Finding an optimal execution order of join operations is a crucial task in every cost-based query optimizer. Since there are many possible join trees for a given query, the overhead of the join (tree) enumeration algorithm per valid join tree should be minimal. In the case of a clique-shaped query graph, the best known top-down algorithm has a complexity of &#920;(n2) per join tree, where n is the number of relations. In this paper, we present an algorithm that has an according O(1) complexity in this case. We show experimentally that this more theoretical result has indeed a high impact on the performance in other non-clique settings. This is especially true for cyclic query graphs. Further, we evaluate the performance of our new algorithm and compare it with the best top-down and bottom-up algorithms described in the literature.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.235</guid>
  </item>
  <item>
     <title>PrePrint: Discovering Temporal Change Patterns in the Presence of Taxonomies</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.233</link>
     <description>Frequent itemset mining is a widely exploratory technique that focuses on discovering correlations among data. The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of frequent itemsets, investigates changes in the set of mined itemsets from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that provide a high level abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the HIGEN (HIstory GENeralized Pattern), that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalization characterized by minimal redundancy in case it becomes infrequent at a certain time period. To address HIGEN mining, it proposes a novel algorithm that avoids itemset mining followed by postprocessing by exploiting a support-driven itemset generalization approach. Experiments performed on both real and synthetic datasets show the efficiency and the effectiveness of the proposed approach as well as its usefulness in a real application context.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.233</guid>
  </item>
  <item>
     <title>PrePrint: Anonymization of Centralized and Distributed Social Networks by Sequential Clustering</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.232</link>
     <description>We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering. Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.232</guid>
  </item>
  <item>
     <title>PrePrint: Finding Rare Classes: Active Learning with Generative and Discriminative Models</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.231</link>
     <description>Discovering rare categories and classifying new instances of them is an important data mining issue in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in labeling effort. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur together in practice and are intrinsically related because examples of each class are required to train a classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data mining for rare classes in new domains makes inefficient use of human supervision. Developing active learning algorithms to optimise both rare class discovery and classification simultaneously is challenging because discovery and classification have conflicting requirements in query criteria. In this paper we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them by adapting query criteria online; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on a batch of standard UCI and vision datasets demonstrates the superiority of this approach over existing methods.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.231</guid>
  </item>
  <item>
     <title>PrePrint: Reinforced Similarity Integration in Image-Rich Information Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.228</link>
     <description>Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments and other information, thus forming heterogeneous image-rich information networks. In this paper, we introduce the concept of (heterogeneous) image-rich information network and the problem of how to perform information retrieval and recommendation in such networks. We propose a fast algorithm HMok-SimRank (heterogeneous minimum order k-SimRank) to compute link-based similarity in weighted heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity Learning (IWSL) to account for both link-based and content-based similarities by considering the network structure and mutually reinforcing link similarity and feature weight learning. Both local and global feature learning methods are designed. Experimental results on Flickr and Amazon datasets show that our approach is significantly better than traditional methods in terms of both relevance and speed. A new product search and recommendation system for e-commerce has been implemented based on our algorithm.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.228</guid>
  </item>
  <item>
     <title>PrePrint: A Generalized Flow Based Method for Analysis of Implicit Relationships on Wikipedia</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.227</link>
     <description>We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and co-citation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.227</guid>
  </item>
  <item>
     <title>PrePrint: Change Detection in Streaming Multivariate Data Using Likelihood Detectors</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.226</link>
     <description>Change detection in streaming data relies on a fast estimation of the probability that the data in two consecutive windows come from different distributions. Choosing the criterion is one of the multitude of questions that need to be addressed when designing a change detection procedure. This paper gives a log-likelihood justification for two well known criteria for detecting change in streaming multidimensional data: Kullback-Leibler (K-L) distance and Hotelling's T-square test for equal means (H). We semi-parametric log-likelihood criterion (SPLL) for change detection. Compared to the existing log-likelihood change detectors, SPLL trades some theoretical rigour for computation simplicity. We examine SPLL together with K-L and H on detecting induced change on 30 real data sets. The criteria were compared using the area under the respective ROC curve (AUC). SPLL was found to be on the par with H and better than K-L for the non-normalised data, and better than both on the normalised data.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.226</guid>
  </item>
  <item>
     <title>PrePrint: Energy-Aware Set-Covering Approaches for Approximate Data Collection in Wireless Sensor Networks</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.224</link>
     <description>To conserve energy, sensor nodes with similar readings can be grouped such that readings from only the representative nodes within the groups need to be reported. However, efficiently identifying sensor groups and their representative nodes is a very challenging task. In this paper, we propose a centralized algorithm to determine a set of representative nodes with high energy levels and wide data coverage ranges. Here, the data coverage range of a sensor node is considered to be the set of sensor nodes that have reading behaviors very close to the particular sensor node. To further reduce the extra cost incurred in messages for selection of representative nodes, a distributed algorithm is developed. Furthermore, maintenance mechanisms are proposed to dynamically select alternative representative nodes when the original representative nodes run low on energy, or cannot capture spatial correlation within their respective data coverage ranges. Using experimental studies on both synthesis and real datasets, our proposed algorithms are shown to effectively and efficiently provide approximate data collection while prolonging the network lifetime.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.224</guid>
  </item>
  <item>
     <title>PrePrint: An Information-Preserving Watermarking Scheme for Right Protection of EMR Systems</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.223</link>
     <description>Recently, a significant interest has been developed in motivating physicians to use Electronic Medical Records (EMR) systems. An important utility of such EMR systems is: a next generation of Clinical Decision Support Systems (CDSS) will extract knowledge from the EMR to enable physicians to do accurate and effective diagnosis. In future such medical records will be shared through cloud among different physicians to improve the quality of health care. Therefore, their right protection is important to protect their ownership once they are shared with third parties. Watermarking is a proven well known technique to achieve this objective. The challenges associated with watermarking of EMR systems are: (1) some fields in EMR are more relevant in the diagnosis process; therefore, small variations in them could change the diagnosis, and (2) a misdiagnosis might not only result in a life threatening scenario but also leads to significant costs of the treatment to the patients. The major contribution of this paper is an information-preserving watermarking scheme to address the above-mentioned challenges. Our pilot studies reveal that our scheme never degrades the classification accuracy by more than 1%. In comparison, a well known threshold-based technique degrades the accuracy by more than 18%.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.223</guid>
  </item>
  <item>
     <title>PrePrint: Comparable Entity Mining from Comparative Questions</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.210</link>
     <description>Comparing one thing with another is a typical part of human decision making process. However, it is not always easy to know what to compare and what are the alternatives. In this paper, we present a novel way to automatically mine comparable entities from comparative questions that users posted online to address this difficulty. To ensure high precision and high recall, we develop a weakly-supervised bootstrapping approach for comparative question identification and comparable entity extraction by leveraging a large collection of online question archive. The experimental results show our method achieves F1-measure of 82.5% in comparative question identification and 83.3% in comparable entity extraction. Both significantly outperform an existing state-of-the-art method. Additionally, our ranking results show highly relevance to user's comparison intents in web.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.210</guid>
  </item>
  <item>
     <title>PrePrint: A Survey of XML Tree Patterns</title>
     <link>http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.209</link>
     <description>With XML becoming an ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree to answer a query. Since the turn of the twenty-first century, an astounding research effort has been focusing on tree pattern models and matching optimization (a primordial issue). This paper is a comprehensive survey of these topics, in which we outline and compare the various features of tree patterns. We also review and discuss the two main families of approaches for optimizing tree pattern matching, namely pattern tree minimization and holistic matching. We finally present actual tree pattern-based developments, to provide a global overview of this significant research topic.</description>
     <guid isPermaLink="true">http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.209</guid>
  </item>
   </channel>
</rss>
